1. Introduction
Mediterranean hurricanes (hereafter medicanes) are mesoscale cyclones that develop over the Mediterranean Sea and display tropical-like cyclone characteristics [
1]. In their mature phase, these characteristics include a warm core extending into the upper troposphere, spiral cloud bands around an eye-like central feature, and a nearly symmetric near-surface wind circulation. Although infrequent, medicanes are relevant for risk management because they can produce heavy precipitation, strong winds, and coastal impacts over densely populated regions. Their relatively small spatial scale, hybrid structure, and rapid evolution make continuous monitoring difficult, especially during genesis and mature phases, and motivate the use of high-temporal-resolution geostationary satellite observations to support timely detection and early-warning activities.
Among the available observing systems, infrared imagery from the Spinning Enhanced Visible and InfraRed Imager (SEVIRI) aboard Meteosat Second Generation (MSG) is particularly suitable for near-real-time monitoring because it provides basin-scale coverage at high refresh rate. However, turning these observations into automatic medicane-detection and center-localization products remains challenging. The rarity of positive events limits the amount of labelled data, while reference tracks are typically derived from reanalysis rather than from the satellite signal itself.
To address this need, the paper introduces DeMeTrA (Deep Learning Medicane Tracking Algorithm), a framework aimed at medicane detection and rotation-center localization from SEVIRI Airmass RGB composite image sequences. Before detailing the methodology, the relevant scientific literature is introduced below.
One promising way to exploit these high-frequency satellite observations is through machine-learning methods, especially deep-learning models based on multi-layer neural networks, which are well suited to the high-dimensional spatial and temporal structure of image sequences. The literature most relevant to this study spans non-learning, machine-learning, and end-to-end deep-learning approaches.
Before the adoption of learning-based approaches, tropical cyclones analysis from satellite observations largely relied on manual interpretation methods, objective diagnostic schemes, and physically based tracking algorithms. A classical example is the Dvorak technique, which estimates tropical-cyclone intensity from satellite imagery through cloud-pattern analysis [
2], while the Advanced Dvorak Technique later translated this operational logic into a more objective framework based on geostationary infrared observations [
3]. Additional non-learning approaches include passive-microwave products such as MIMIC and ARCHER, the latter explicitly designed to determine the rotational center of tropical cyclones from satellite imagery in an objective way [
4,
5]. At the dynamical-analysis level, cyclone detection and tracking have also been addressed through algorithmic frameworks such as CycloTRACK and through intercomparison initiatives such as IMILAST, while composite reference datasets have more recently been developed to provide harmonized Mediterranean cyclone tracks [
6,
7,
8]. For the Mediterranean basin specifically, complementary satellite-based observational studies have characterized medicanes in greater detail, from the precipitation structure of Medicane Ianos observed by GPM Core Observatory measurements [
9] to passive-microwave analyses of warm-core depth, symmetry, and deep-convection signatures in individual medicanes [
10]. This perspective was subsequently extended to a broader set of events over 2000–2021 through a passive-microwave-based diagnostic analysis of medicane structure and evolution [
11], while additional satellite work has characterized the associated near-surface wind fields and further clarified the observational signature of these storms [
12].
Machine-learning approaches occupy an intermediate position between these objective/manual methods and recent end-to-end deep-learning systems. Rather than learning directly from raw image sequences, these studies typically exploit structured predictors or lower-dimensional latent representations. For example, Olander et al. (2021) showed that machine-learning models driven by Advanced Dvorak Technique analysis parameters can improve tropical-cyclone intensity estimation while preserving operational interpretability [
13]. For the Mediterranean basin, Roveri et al. (2025) proposed a Bayesian statistical-learning framework based on latent representations of wind-velocity fields to support cyclone detection and tracking [
14]. These studies show that predictive skill can be extracted from compact or engineered representations, but they do not fully exploit the raw spatiotemporal information contained in satellite image sequences.
Deep-learning methods for cyclone analysis have addressed detection, classification, size estimation, intensity estimation, and track prediction, primarily for tropical cyclones and, more recently, for extratropical systems [
15,
16,
17,
18]. Much of this literature relies on convolutional neural networks, ConvLSTM architectures, or end-to-end feature learning directly from satellite imagery. These approaches are effective at capturing local spatial structure, but they are less explicit in modeling long-range dependencies across both space and time. Reviews comparing convolutional and Transformer-based vision models highlight the ability of self-attention to better exploit global context and large-scale pretraining [
19].
For video analysis, this advantage is particularly relevant because temporal evolution is part of the signal rather than an auxiliary feature. VideoMAE and VideoMAE v2 extend masked autoencoding to spatiotemporal data and learn transferable video representations by reconstructing masked tubes from a sparse subset of visible tokens [
20,
21]. In particular, VideoMAE v2 further improves scalability through a dual-masking strategy that reduces computational cost while preserving strong downstream performance. This family of models is therefore attractive for meteorological applications, where labeled events are scarce but large archives of unlabeled geostationary imagery are available. For this reason, VideoMAE provides a strong spatiotemporal starting point for our study: it can first adapt to the dataset statistics through self-supervision and can then be fine-tuned for the downstream tasks of detection and tracking.
In contrast, the broader tropical-cyclone literature already contains several genuinely spatiotemporal deep-learning paradigms. For example, Ruttgers et al. (2019) proposed a generative adversarial network for typhoon-track prediction from past satellite imagery, showing that adding dynamical fields such as wind can substantially reduce mean track error [
22]. Likewise, Dong et al. (2022) formulated 24 h track prediction as a sequence-to-sequence problem using ConvLSTM and spatial attention on GridSat-B1 infrared imagery with IBTrACS labels, predicting future position through density maps rather than direct coordinate regression [
23]. More recently, Lagerquist et al. (2024/2025) introduced GeoCenter, an uncertainty-aware deep-learning approach for tropical-cyclone center fixing from high-temporal-resolution geostationary IR imagery, highlighting the growing relevance of calibrated uncertainty for operational center estimation [
24]. Additional complementary evidence comes from Tong et al. (2022), who demonstrated strong cyclone-identification skill from infrared cloud imagery using deep convolutional networks [
25], and from Manju M. S. et al. (2025), who combined CNN backbones with ConvLSTM to perform cyclone detection, classification, and intensity estimation from satellite image sequences [
26]. Related intensity-estimation studies include Devaraj et al. (2021), who proposed a modified CNN pipeline for hurricane-intensity estimation and category-level disaster assessment from satellite-linked best-track data [
27], and Maskey et al. (2020), who presented an objective deep-learning estimate of tropical-cyclone intensity from infrared imagery together with a production-oriented visualization portal for end users [
28]. Although not strictly video-based, Kumler-Bonfanti et al. (2020) are also relevant because they address both tropical and extratropical cyclone detection, an important point for medicanes, which can emerge through extratropical-to-tropical-like transition [
15]. From a methodological perspective, Nibali et al. (2018) introduced the DSNT layer for fully differentiable numerical coordinate regression with good spatial generalization, which is relevant when cyclone-center localization is posed as coordinate prediction [
29]. Additional retrieval-focused work includes Wimmers et al. (2024), who used a multibranched U-Net to infer two-dimensional inner-core wind fields from microwave and infrared satellite imagery, and Griffin et al. (2024), who reported that CNN-based multisource models can improve current and short-term tropical-cyclone intensity-change prediction, especially when microwave imagery is included [
30,
31]. Finally, recent nowcasting work based on diffusion models and 3D U-Net architectures suggests that future forecasting systems may benefit from explicit video prediction of infrared brightness-temperature fields, provided that evaluation is aligned with cyclone-centric meteorological targets such as center error and structural symmetry rather than only pixelwise reconstruction [
32].
Taken together, these studies show the broad value of satellite-based, machine-learning, and deep-learning approaches for cyclone monitoring, but they do not close the specific gap addressed here: the lack of a medicane-focused framework that combines self-supervised spatiotemporal adaptation on infrared sequences with downstream binary detection and center-coordinate regression. Against this background, DeMeTrA is formulated as an end-to-end workflow for medicane detection and rotation-center localization from SEVIRI Airmass RGB image sequences. Its implementation builds on VideoMAE v2 as the pretrained Transformer encoder, which is first specialized on unlabeled Mediterranean satellite clips and then fine-tuned for binary cyclone detection and coordinate regression of the cyclone center. In the resulting workflow, detection acts as a first filtering stage and tracking is applied only to clips classified as cyclonic.
The main contributions of this work are threefold. First, we construct a medicane-oriented video dataset from more than 7.5 years of MSG observations over the Mediterranean and derive labels from the best-track dataset of Mediterranean cyclones for 1979–2020 [
8], complemented by manual temporal refinements to better align the annotations with visible cloud rotation. Second, we develop a three-stage training strategy that combines self-supervised spatiotemporal representation learning with downstream detection and tracking tasks. Third, we evaluate the framework on cyclone-wise train/validation/test splits, including an imbalanced test configuration that better reflects operational conditions. Our objectives are to test whether a VideoMAE-based model can effectively support medicane detection and tracking from SEVIRI IR video sequences, and to define an end-to-end methodology for dataset construction, model specialization, and event-based evaluation that is suitable for near-real-time monitoring.
The remainder of this paper is organized as follows.
Section 2 describes the study data, the dataset-construction procedure, and the DeMeTrA training pipeline.
Section 3 presents the experimental results for detection and tracking, while
Section 4 discusses the main findings, limitations, and possible future developments.