Detection and Tracking of Medicanes Through DeMeTrA Self-Supervised Vision Transformer

Daniele D’Armiento; Stefano Sebastianelli; Leo Pio D’Adderio; Paolo Sanò; Daniele Casella; Giulia Panegrossi

doi:10.20944/preprints202605.1494.v1

Submitted:

20 May 2026

Posted:

22 May 2026

You are already at the latest version

Abstract

Medicanes are mesoscale cyclones that develop over the Mediterranean Sea and display tropical-like cyclone characteristics, including a warm core, spiral cloud organization, and deep convection over warm sea surfaces. Since their structure and position can change rapidly on short lead times before coastal impact, robust near-real-time tracking algorithms are essential for timely warning and operational decision support. To advance this research direction, this work introduces the Deep learning Medicane Tracking (DeMeTrA) algorithm, an end-to-end deep-learning framework for medicane detection and rotation-center localization from SEVIRI Rapid Scan Airmass RGB imagery. The proposed methodology consists of three-stage VideoMAE v2 architecture encompassing: (i) self-supervised domain specialization on unlabeled satellite image sequences, (ii) supervised binary classification of cyclone versus non-cyclone events, and (iii) supervised coordinate regression for rotation-center tracking. The training corpus spans several time windows of Meteosat Second Generation observations across the Mediterranean basin, with ground-truth annotations derived from a consensus cyclone-track reference. On event-based splits, cyclones detection reaches 91% balanced accuracy on a balanced validation set, and 89% on an unbalanced test set representative of operational conditions. Tracking results show generally low localization errors (mostly below 20 km), with limited outliers in the most complex cases. These findings support the use of Transformer-based video models for operational medicane monitoring and establish a baseline for future developments.

Keywords:

medicanes

;

MSG SEVIRI

;

Deep learning

;

Vision Transformer

;

cyclone detection

;

cyclone tracking

Subject:

Environmental and Earth Sciences - Remote Sensing

1. Introduction

Mediterranean hurricanes (hereafter medicanes) are mesoscale cyclones that develop over the Mediterranean Sea and display tropical-like cyclone characteristics [1]. In their mature phase, these characteristics include a warm core extending into the upper troposphere, spiral cloud bands around an eye-like central feature, and a nearly symmetric near-surface wind circulation. Although infrequent, medicanes are relevant for risk management because they can produce heavy precipitation, strong winds, and coastal impacts over densely populated regions. Their relatively small spatial scale, hybrid structure, and rapid evolution make continuous monitoring difficult, especially during genesis and mature phases, and motivate the use of high-temporal-resolution geostationary satellite observations to support timely detection and early-warning activities.

Among the available observing systems, infrared imagery from the Spinning Enhanced Visible and InfraRed Imager (SEVIRI) aboard Meteosat Second Generation (MSG) is particularly suitable for near-real-time monitoring because it provides basin-scale coverage at high refresh rate. However, turning these observations into automatic medicane-detection and center-localization products remains challenging. The rarity of positive events limits the amount of labelled data, while reference tracks are typically derived from reanalysis rather than from the satellite signal itself.

To address this need, the paper introduces DeMeTrA (Deep Learning Medicane Tracking Algorithm), a framework aimed at medicane detection and rotation-center localization from SEVIRI Airmass RGB composite image sequences. Before detailing the methodology, the relevant scientific literature is introduced below.

One promising way to exploit these high-frequency satellite observations is through machine-learning methods, especially deep-learning models based on multi-layer neural networks, which are well suited to the high-dimensional spatial and temporal structure of image sequences. The literature most relevant to this study spans non-learning, machine-learning, and end-to-end deep-learning approaches.

Before the adoption of learning-based approaches, tropical cyclones analysis from satellite observations largely relied on manual interpretation methods, objective diagnostic schemes, and physically based tracking algorithms. A classical example is the Dvorak technique, which estimates tropical-cyclone intensity from satellite imagery through cloud-pattern analysis [2], while the Advanced Dvorak Technique later translated this operational logic into a more objective framework based on geostationary infrared observations [3]. Additional non-learning approaches include passive-microwave products such as MIMIC and ARCHER, the latter explicitly designed to determine the rotational center of tropical cyclones from satellite imagery in an objective way [4,5]. At the dynamical-analysis level, cyclone detection and tracking have also been addressed through algorithmic frameworks such as CycloTRACK and through intercomparison initiatives such as IMILAST, while composite reference datasets have more recently been developed to provide harmonized Mediterranean cyclone tracks [6,7,8]. For the Mediterranean basin specifically, complementary satellite-based observational studies have characterized medicanes in greater detail, from the precipitation structure of Medicane Ianos observed by GPM Core Observatory measurements [9] to passive-microwave analyses of warm-core depth, symmetry, and deep-convection signatures in individual medicanes [10]. This perspective was subsequently extended to a broader set of events over 2000–2021 through a passive-microwave-based diagnostic analysis of medicane structure and evolution [11], while additional satellite work has characterized the associated near-surface wind fields and further clarified the observational signature of these storms [12].

Machine-learning approaches occupy an intermediate position between these objective/manual methods and recent end-to-end deep-learning systems. Rather than learning directly from raw image sequences, these studies typically exploit structured predictors or lower-dimensional latent representations. For example, Olander et al. (2021) showed that machine-learning models driven by Advanced Dvorak Technique analysis parameters can improve tropical-cyclone intensity estimation while preserving operational interpretability [13]. For the Mediterranean basin, Roveri et al. (2025) proposed a Bayesian statistical-learning framework based on latent representations of wind-velocity fields to support cyclone detection and tracking [14]. These studies show that predictive skill can be extracted from compact or engineered representations, but they do not fully exploit the raw spatiotemporal information contained in satellite image sequences.

Deep-learning methods for cyclone analysis have addressed detection, classification, size estimation, intensity estimation, and track prediction, primarily for tropical cyclones and, more recently, for extratropical systems [15,16,17,18]. Much of this literature relies on convolutional neural networks, ConvLSTM architectures, or end-to-end feature learning directly from satellite imagery. These approaches are effective at capturing local spatial structure, but they are less explicit in modeling long-range dependencies across both space and time. Reviews comparing convolutional and Transformer-based vision models highlight the ability of self-attention to better exploit global context and large-scale pretraining [19].

For video analysis, this advantage is particularly relevant because temporal evolution is part of the signal rather than an auxiliary feature. VideoMAE and VideoMAE v2 extend masked autoencoding to spatiotemporal data and learn transferable video representations by reconstructing masked tubes from a sparse subset of visible tokens [20,21]. In particular, VideoMAE v2 further improves scalability through a dual-masking strategy that reduces computational cost while preserving strong downstream performance. This family of models is therefore attractive for meteorological applications, where labeled events are scarce but large archives of unlabeled geostationary imagery are available. For this reason, VideoMAE provides a strong spatiotemporal starting point for our study: it can first adapt to the dataset statistics through self-supervision and can then be fine-tuned for the downstream tasks of detection and tracking.

In contrast, the broader tropical-cyclone literature already contains several genuinely spatiotemporal deep-learning paradigms. For example, Ruttgers et al. (2019) proposed a generative adversarial network for typhoon-track prediction from past satellite imagery, showing that adding dynamical fields such as wind can substantially reduce mean track error [22]. Likewise, Dong et al. (2022) formulated 24 h track prediction as a sequence-to-sequence problem using ConvLSTM and spatial attention on GridSat-B1 infrared imagery with IBTrACS labels, predicting future position through density maps rather than direct coordinate regression [23]. More recently, Lagerquist et al. (2024/2025) introduced GeoCenter, an uncertainty-aware deep-learning approach for tropical-cyclone center fixing from high-temporal-resolution geostationary IR imagery, highlighting the growing relevance of calibrated uncertainty for operational center estimation [24]. Additional complementary evidence comes from Tong et al. (2022), who demonstrated strong cyclone-identification skill from infrared cloud imagery using deep convolutional networks [25], and from Manju M. S. et al. (2025), who combined CNN backbones with ConvLSTM to perform cyclone detection, classification, and intensity estimation from satellite image sequences [26]. Related intensity-estimation studies include Devaraj et al. (2021), who proposed a modified CNN pipeline for hurricane-intensity estimation and category-level disaster assessment from satellite-linked best-track data [27], and Maskey et al. (2020), who presented an objective deep-learning estimate of tropical-cyclone intensity from infrared imagery together with a production-oriented visualization portal for end users [28]. Although not strictly video-based, Kumler-Bonfanti et al. (2020) are also relevant because they address both tropical and extratropical cyclone detection, an important point for medicanes, which can emerge through extratropical-to-tropical-like transition [15]. From a methodological perspective, Nibali et al. (2018) introduced the DSNT layer for fully differentiable numerical coordinate regression with good spatial generalization, which is relevant when cyclone-center localization is posed as coordinate prediction [29]. Additional retrieval-focused work includes Wimmers et al. (2024), who used a multibranched U-Net to infer two-dimensional inner-core wind fields from microwave and infrared satellite imagery, and Griffin et al. (2024), who reported that CNN-based multisource models can improve current and short-term tropical-cyclone intensity-change prediction, especially when microwave imagery is included [30,31]. Finally, recent nowcasting work based on diffusion models and 3D U-Net architectures suggests that future forecasting systems may benefit from explicit video prediction of infrared brightness-temperature fields, provided that evaluation is aligned with cyclone-centric meteorological targets such as center error and structural symmetry rather than only pixelwise reconstruction [32].

Taken together, these studies show the broad value of satellite-based, machine-learning, and deep-learning approaches for cyclone monitoring, but they do not close the specific gap addressed here: the lack of a medicane-focused framework that combines self-supervised spatiotemporal adaptation on infrared sequences with downstream binary detection and center-coordinate regression. Against this background, DeMeTrA is formulated as an end-to-end workflow for medicane detection and rotation-center localization from SEVIRI Airmass RGB image sequences. Its implementation builds on VideoMAE v2 as the pretrained Transformer encoder, which is first specialized on unlabeled Mediterranean satellite clips and then fine-tuned for binary cyclone detection and coordinate regression of the cyclone center. In the resulting workflow, detection acts as a first filtering stage and tracking is applied only to clips classified as cyclonic.

The main contributions of this work are threefold. First, we construct a medicane-oriented video dataset from more than 7.5 years of MSG observations over the Mediterranean and derive labels from the best-track dataset of Mediterranean cyclones for 1979–2020 [8], complemented by manual temporal refinements to better align the annotations with visible cloud rotation. Second, we develop a three-stage training strategy that combines self-supervised spatiotemporal representation learning with downstream detection and tracking tasks. Third, we evaluate the framework on cyclone-wise train/validation/test splits, including an imbalanced test configuration that better reflects operational conditions. Our objectives are to test whether a VideoMAE-based model can effectively support medicane detection and tracking from SEVIRI IR video sequences, and to define an end-to-end methodology for dataset construction, model specialization, and event-based evaluation that is suitable for near-real-time monitoring.

The remainder of this paper is organized as follows. Section 2 describes the study data, the dataset-construction procedure, and the DeMeTrA training pipeline. Section 3 presents the experimental results for detection and tracking, while Section 4 discusses the main findings, limitations, and possible future developments.

2. Materials and Methods

2.1. Dataset

The dataset generation strategy was designed to preserve the physical content of the original SEVIRI observations while producing clip-based samples compatible with the downstream VideoMAE framework. Two data levels are distinguished throughout the workflow: a Source Dataset, consisting of preprocessed basin-scale Airmass RGB frames, and a Working Dataset, consisting of tiled and labelled video clips used for detection and tracking experiments.

2.1.1. Input Data Specifications

The primary input is the infrared SEVIRI/MSG Rapid Scan Level 1.5 data, which is suitable for monitoring rapidly evolving mesoscale weather systems because of its high refresh rate and basin-scale spatial coverage. The Rapid Scan Service provides one image every 5 min; for the non-HRV SEVIRI channels considered here, the Level 1.5 sampling distance is defined to be 3 km by 3 km at the sub-satellite point [33], with effective ground sampling increasing away from the projection centre (see Section 2.1.3 for the exact resolution over the Mediterranean domain).

The data source imagery is provided by the Google Cloud BigQuery public archive [34]; this distributed data was prepared by Open Climate Fix using the dedicated python libraries, and is stored in Zarr format, in the native geostationary projection.

The labels are derived from the composite Mediterranean cyclone reference dataset of Flaounas et al. [8], which combines the outputs of 10 cyclone detection and tracking methods (CDTMs) applied hourly to ERA5 over the broader Mediterranean domain. In that framework, track points from different methods are considered similar when they occur at the same hour and lie within 300 km of each other; criteria involving 6, 12, 18, and 24-hour minimum overlaps were tested to define the “similarity” between two traces; composite track points are then defined at the average position of the contributing methods, and a 12 h overlap criterion is selected as a suitable compromise among the thresholds. The published database encode this agreement through a Confidence Level CLn: the index n denotes the minimum number of CDTMs supporting the composite track point, so higher values correspond to progressively stricter consensus among methods. As a consequence, higher-confidence datasets are nested subsets of lower-confidence ones. This ranking is important for our labelling strategy because it controls the trade-off between completeness and robustness: low-confidence datasets retain a larger number of weaker or shorter-lived systems, whereas high-confidence datasets increasingly emphasize long-lived, deeper, and better-organized cyclones. This behavior is also linked to the dynamical variables used by the underlying CDTMs: across the 10 methods, cyclone centers are diagnosed from local minima of mean sea-level pressure or geopotential height, or from maxima of 850 hPa relative vorticity, while some tracking schemes further constrain the trajectories using sea-level-pressure gradients, lower- and mid-tropospheric thickness, and steering or advection by the 700–850 hPa wind fields. Accordingly, the lowest confidence levels are more likely to include shallow lows, weak vortices, topographically perturbed disturbances, or fragmented tracks identified only by the most sensitive methods, whereas increasing CL progressively narrows the selection to cyclones with clearer and more coherent dynamical signatures, including longer lifetimes, stronger intensity contrasts between genesis, mature and decay stages, and a stronger tendency to travel over maritime areas rather than over land regions.

For Mediterranean applications, Flaounas et al. recommend confidence levels CL5–CL7 for general climatological use and CL8–CL10 for analyses focused on the most intense systems. In this work, we used a small number of most intense cyclones which are a subset of CL10, for reasons explained in Section 2.1.4.

2.1.2. Source Dataset

The distributed archive [34] contains calibrated channels normalized to the range

[0, 1023]

, corresponding to 10-bit values per channel. Physical Brightness Temperature (BT) values are recovered through de-normalization,

B T = v (X_{max} - X_{min}) + X_{min} (Kelvin),

(1)

where v is the normalized value in

[0, 1]

, and

X_{max}

and

X_{min}

are channel-dependent bounds. The values adopted for the channels used in this study are reported in Table 1.

The channels are combined into an Airmass RGB composite following the EUMETSAT RGB Recipes and Best Practices [35,36]. The three channels are defined as

\begin{matrix} R & = W V_{062} - W V_{073} \\ G & = I R_{097} - I R_{108} \\ B & = W V_{062} \end{matrix}

(2)

This composite highlights air-mass contrasts, moisture structure, and upper-level dynamical features that are relevant for cyclones interpretation.

The study domain is a crop between

30^{\circ}

N and

48^{\circ}

N and between

- 7^{\circ}

and

46^{\circ}

, yielding images of

1290 \times 420

pixels, as the three examples shown in Figure 1.

The resulting dataset, called ’Source Dataset’, provides a source to a further processed dataset, and contains approximately 860k Airmass RGB frames corresponding to about 600 GB of data and covering more than 7.5 years between 2010 and 2023.

2.1.3. Working Dataset

The data processing here described is applied in order to build the so called ’Working Dataset’. It is obtained by converting the Source Dataset into labelled video clips suitable for supervised learning. The basin-scale crop is subdivided into partially overlapping tiles

224 \times 224

pixels wide, the tiling offsets being defined by a stride of 213 pixels in horizontal axis x and 196 pixels in vertical axis y, which provides limited spatial overlap while preserving Mediterranean coverage. This procedure results in 12 tiles completely convering the Mediterranean domain area, as Figure 2 shows.

Using the selected crop, the domain spans

Δ l a t = 18^{\circ}

and

Δ l o n = 53^{\circ}

, corresponding to an approximate meridional extension of 2004 km and a zonal extension of 4585 km (since

1^{\circ}

latitude

= 2 π R / 360 \approx 111.32

km,

1^{\circ}

longitude

\approx 111.32 cos (39^{\circ}) = 86.5

km, where

39^{\circ}

is the mean latitude). This yields mean spatial samplings of about 4.77 km per pixel in latitude (effective range 3.77–7.09 km per pixel) and 3.55 km per pixel in longitude (effective range 3.07–4.67 km per pixel). Each tile therefore covers approximately 1069 km in the north–south direction (range 904–1284 km) and 796 km in the east–west direction (range 687–923 km).

Tiles extracted at successive times are stacked into 16-frame clips sampled at the same original 5 min rate, so that each sample spans 80 min. Figure 3 illustrates a videotile example. One videoclip is generated every hour, resulting in 20 minutes temporal overlapping, and the final frame of each clip is always aligned with an integer hour in order to match the cyclone-track labels timing. This representation is consistent both with the temporal availability of the label source and with the spatial and temporal input constraints of the pretrained VideoMAE model backbone.

2.1.4. Dataset Labelling

The dataset labelling process is based on the reference tracks database [8], which provide hourly cyclone center positions based on several physics variables and methods as explained in the introduction. We take these tracks as ground truth cyclone positions, chosing high confidence level CL7–CL10 tracks, from which a restricted number of cyclones are selected to obtain the ones most similar to medicanes. A trade-off is needed to increase the available labelled dataset and not to include more etherogeneus extra-tropical cyclones, which have greater variability and less precision in center definition, leading to noisy target signal. For this reason 18 cyclones were selected, including all available medicanes plus a number of visually inspected extra-tropical cyclones (Table 2). For the most recent medicanes not covered by Flaounas et al. [8], namely Apollo, Blas, Juliette, and Daniel, the reference labels are provided by ERA5 as the location of minimum mean sea-level pressure [37].

Cyclone-center coordinates are transformed from geographic space to image space using the georeferenced latitude/longitude grid associated with the

1290 \times 420 p x

crop; the y axis is inverted to preserve consistency between the geographic and image coordinate systems. A tile is labelled as positive if the projected cyclone center lies within the tile in at least 6 of the 16 frames; otherwise it is labelled as negative. This criterion reduces sensitivity to short-lived edge crossings and assigns the positive class only to clips with sustained cyclone presence. Negative videoclip tiles are usually completely clear sky or contain cloud systems not associated to cyclones structure, or can happen to be tiles adjacent to positive tiles, thus containing spiral rain bands showing a peripheral rotational dynamic, however they never contain the rotational center.

For each frame, a master dataframe records the image path, tile offset, projected cyclone-center coordinates when available, and the corresponding label. These frame-wise records are grouped by tile position and segmented into contiguous temporal sequences, which traces video clip paths, and exported as train/validation/test CSV manifests.

Detection datasets are constructed with optional class balancing and are always split by cyclone ID in order to avoid event leakage across subsets. For the tracking dataset, only positive clips are retained, since cyclone-center coordinates are undefined for negative samples. Each sample clip in the tracking dataset has the label relative the cyclone-center position in the final frame of the clip.

The combination of tiled video samples, projected labels, and manifest files constitutes the Working Dataset used in the training processes described below.

Because reanalysis-based track spatio-temporal location does not always coincide with clearly visible cyclone presence in the satellite imagery, the cyclone time intervals were further refined by manual visual inspection over all working dataset samples. These revised windows are stored in dedicated support files, and are reported in Table 3. Their purpose is to reduce the mismatch between the reference track and the infrared signature observed in the Airmass RGB sequences. A significant better model training was observed after this time window selection, proving how cloud rotation feature is essential for a good cyclones representation. This will be further explained in Section 2.2.3 and investigated in the discussion.

2.2. Model Architecture and Training

2.2.1. Model Overview

An architectural overview of DeMeTra is provided here, together with the training protocol adopted for medicane detection and tracking. VideoMAE v2 is used here as a generic spatiotemporal backbone, that is, as a shared feature extractor that learns to represent how cloud patterns evolve in space and time, through adaptation to SEVIRI Airmass RGB sequences, that is a domain-specific post-pretraining. This choice is motivated by the limited number of labelled medicane events, which makes it advantageous to separate representation learning from downstream supervision.

2.2.2. Three-Stage Training Strategy

Training follows a staged transfer-learning scheme with three tasks, designed to start from general visual understanding and proceed to specific pattern detection. In the first stage, the model is specialized on unlabeled satellite image sequences through masked autoencoding. A large portion of the images (about

80 %

) is hidden with mask patches, and the network is asked to reconstruct the missing content from the visible context. In practical terms, this stage encourages the model to learn recurring cloud structures, spatial organization, and temporal evolution directly from the big data archive, without requiring manual annotations. The output of this phase is not yet a cyclone detector, but a latent representation that encodes useful information about the appearance and dynamics of Mediterranean cloud systems.

In the second stage, this learned representation (i.e., the trained backbone) is fine-tuned with a classification head for clip-level binary detection, that is, deciding whether a video clip contains a medicane signature. In the third stage, the same backbone is coupled with a regression head that predicts the continuous coordinates of the cyclone center associated with the final frame of each clip.

This formulation separates the problem into two sequential objectives: cyclone-presence identification and cyclone-center localization. This decomposition reflects the operational logic of the problem and also simplifies the learning task, because center coordinates are meaningful only for positive clips. Therefore, detection acts as an event filter, whereas tracking is performed only on clips identified as cyclonic. The main hyperparameters used in the three stages are summarized in the appendix.

2.2.3. Training Stages

First self-supervised stage

The first, self-supervised ’specialization’ stage uses a big dataset of approximately 80,000 unlabeled clips for training and 20,000 clips for held-out validation monitoring. Starting from the publicly available pretrained VideoMAE v2 giant checkpoint, the model is further trained for 150 epochs. This stage, also called adaption training or post-pretraining, is intended to adapt the model latent representation to the dynamical characteristics of the target domain, the Mediterranean Airmass RGB sequences, before fine-tuning on scarce medicane labels. The corresponding training and validation loss curves are shown in Figure 4, where it is shown how a good adaption was reached without overfitting. This constitutes the heavy part of the model training, since it takes days on a HPC cluster using 16 GPUs [38].

Second stage: detection fine-tuning

The supervised detection stage is formulated as clip-level event recognition. Positive samples are extracted from the refined time windows described in Table 3, whereas negative samples correspond to clips temporally outside those time windows, or clips inside those time windows but spatially distant and not showing cyclone presence.

This narrower temporal selection is necessary because the reference tracks are derived from reanalysis and do not always coincide with a clear cyclone signature in the infrared imagery. In practice, the observed dataset contains frequent cases in which the available labels indicates cyclone presence while the corresponding frames show little or no cloud coverage, weak organization, or no clearly identifiable cyclonic rotation. Retaining such samples as positive supervision causes the target signal to become substantially noisier. This exposes the model to inconsistent visual evidence, which has a negative impact on learning.

For this reason, the cyclone temporal boundaries were refined through visual inspection so that positive clips are restricted to periods in which the infrared cloud structures clearly exhibit cyclonic rotation, resulting in the narrower time intervals shown in Table 3.

To assess generalization across distinct events rather than across temporally adjacent clips, the train/validation/test split is constructed by assigning all clips extracted from a given medicane to a single subset, rather than distributing them across training, validation, and test sets. This choice prevents the model from being evaluated on frames that are very similar in space, time, and cloud organization to samples already seen during training, which would lead to an overly optimistic estimate of performance. Therefore, the adopted split measures the ability of the model to transfer what it has learned from the training events to cyclones contained in validation and test sets. The 18 cases studies selected are listed with their dataset splitting in Table 2.

The clip-level partition of the detection dataset is summarized in Table 4. The training and validation subsets are balanced, whereas the test subset is intentionally imbalanced and includes 201 positive and 2199 negative clips. The additional negative test samples are drawn from the same temporal windows as the positive cases, but from tiles outside the cyclone’s spatial extent. This provides a testing environment that more closely resembles operational conditions.

The training dynamics of the detection fine-tuning are summarized in Figure 5, which reports training and validation loss together with the balanced validation accuracy. The overlaid learning-rate schedule is also shown. It is visible that a good loss-value together with a good balanced accuracy are reached nearly at 200 epochs training, thus giving a good model checkpoint able to generalize well on unseen validation videoclip data.

Third stage: regression fine-tuning

The third stage addresses cyclone-center localization as a direct coordinate-regression problem. Since a center position is defined only when a cyclone is present, the tracking dataset is built exclusively from positive clips and preserves the same event-wise split used for detection. The resulting partition contains 835 training clips from 12 cyclones, 192 validation clips from 3 cyclones, and 161 test clips from 3 cyclones. The regression target is the cyclone-center position in the final frame of each clip, which is the frame temporally aligned with the hourly reference track. Mean squared error on the center coordinates is minimized during training. In this way the model is trained to infer cyclone’s center position at the end time of the input video sequence. In Figure 6 some of these video sequence samples end frames are shown, together with their target location.

The training dynamics of the tracking fine-tuning are shown in Figure 7, which reports the training and validation loss curves for the regression stage, and shows again that a good generalization is achieved before 200 epochs.

At inference time, the two trained model stages are applied sequentially. The detection model first identifies candidate cyclonic clips, then the regression model is applied only to those clips classified as positive.

The code implementation of the workflow described above is available in the DeMeTra repository.

3. Results and Discussion

3.1. Validation Workflow

Quantitative verification is performed on two complementary evaluation datasets: event-wise validation set and test set, considered in Table 2, and using the refined cyclone-presence windows reported in Table 3 as the reference for clear visible cyclone presence. For cyclone detection, the first is a balanced validation set, which measures the intrinsic discriminative ability of the classifier under controlled class prevalence. The second is an imbalanced test set, in which non-cyclone clips strongly outnumber positive samples. It, therefore, provides a closer approximation to the expected operational setting. Regarding cyclone center tracking, the same cyclone splitting is used, while balancing is not an issue because there are only positive samples in the two evaluation datasets.

Full-year datasets were also considered for detection evaluation; however, their broader temporal coverage includes extratropical cyclones with morphology and dynamics that differ markedly from those of medicanes, making robust identification more difficult [1,8].

Quantitative evaluation is complemented by qualitative inspection of track overlays, tile-level visualizations, and Mediterranean-scale rendered video sequences. These diagnostics are useful for identifying systematic failure modes that are not fully captured by scalar scores alone, especially in a setting characterized by a limited number of positive events and by substantial morphological variability across medicanes.

In the following sections the classification and regression training results are shown.

3.2. Classification Metrics and Detection Results

Because cyclone detection is a rare-event problem, plain accuracy alone is not sufficiently informative. We therefore evaluate the classifier through the confusion matrix and a set of metrics commonly used in meteorological event verification, which are more appropriate under strong class imbalance. In the following formula, H denotes hits, M misses, F false alarms, and C correct negatives, as summarized in Table 5.

A good classification model should therefore show few off-diagonal counts. The adopted metrics are:

POD = \frac{H}{H + M}

(3)

FAR = \frac{F}{H + F}

(4)

CSI = \frac{H}{H + F + M}

(5)

HSS = \frac{2 (H C - M F)}{(H + M) (M + C) + (H + F) (F + C)}

(6)

BA = 0.5 [\frac{H}{H + M} + \frac{C}{C + F}]

(7)

POD is also usually named recall, while the other quantity used in classification tasks is precision, here corresponding to

1 - FAR

. Taken together, these measures quantify event-detection capability, false-alarm rate, and performance relative to a random baseline model.

The best results report a maximum Balanced Accuracy of 91% on the balanced validation set and 89% on the unbalanced test set, as shown in Figure 8 and Figure 9, respectively.

Interpretation

These results suggest an overall good performance, due to the high sensitivity of the model (very low number of misses) to the cyclone occurrence while operating under both balanced and operationally imbalanced conditions. However a large False Alarms Ratio is evident. This behavior suggests that the classifier is currently more effective at preserving cyclone detections than at suppressing difficult negatives. A likely contributing factor is the present tile-management strategy, which can generate ambiguous samples near the cyclone boundaries or in surrounding cloud structures that appear visually similar to true positives. This issue becomes even more pronounced when the cyclone moves across adjacent tiles: the same cyclonic structure may be labeled as negative while its center lies just outside a given video tile, and then be labeled as positive as soon as the center enters that tile. This introduces a strong discrete transition in the target labels over a visual signal that is otherwise highly similar and continuous.

The interpretation of the scores must nevertheless account for a residual label uncertainty. In a subset of clips, the reference rotational-center position derived from reanalysis does not always coincide with a clearly identifiable rotating cloud structure in the infrared imagery. This situation occurs most often during genesis and decay, or when cloud coverage is weak and cyclone morphology is ambiguous. Such samples are intrinsically difficult targets for both detection and tracking and contribute to the observed error tail. Although many were removed through the careful selection of the temporal windows reported in Table 3, a potentially non-negligible number of borderline cases remains, reflecting an ambiguity that is intrinsic and that a robust model should learn to handle and have access to it during training. Restricting these temporal intervals was nevertheless essential to achieving the present results: without this dataset-cleaning step, performance was markedly lower and overall unsatisfactory.

3.3. Tracking Results

The metric used for the tracking task is the localization error, which is originally expressed in pixels and, then, converted in kilometers. Figure 10 shows the error distributions, which are useful to assess model precision and validation. The error is the geodesic distance d, obtained by converting pixel coordinates to latitude/longitude using the georeferenced grid and applying the haversine formula computed as in (8)

a = {sin}^{2} (\frac{θ_{2} - θ_{1}}{2}) + cos (θ_{1}) cos (θ_{2}) {sin}^{2} (\frac{ϕ_{2} - ϕ_{1}}{2})

(8)

d = 2 R arcsin (\sqrt{a}), where R = 6371.088 km

(9)

The three histograms in Figure 10 are relative to the training, validation, and test splits, respectively, as described above.

Across the three splits, the distributions remain concentrated at short distances. As expected, the validation and test sets exhibit slightly higher mean errors than the training set, together with broader upper tails, reflecting the greater difficulty of out-of-sample generalization.

3.4. Interpretation

These tracking errors should be interpreted in light of the accuracy of the reference labels themselves. The composite tracks are derived from ERA5 fields at 0.25° × 0.25° resolution and from an averaging procedure across multiple tracking methods. Their positional accuracy therefore reflects not only the native grid spacing, but also inter-method discrepancies and the smoothing introduced by the averaging process [8]. The resulting coordinates should thus be regarded as partially noisy labels rather than as exact point-wise ground truth, with an effective uncertainty plausibly on the order of tens of kilometers. An additional source of mismatch may arise from the vertical structure of the cyclone itself: the center inferred from high-cloud satellite signatures does not necessarily coincide exactly with the near-surface circulation center represented by the reference tracks. This vertical misalignment is expected to be larger during the development stage, when the system is typically less vertically aligned, and smaller during the mature, more tropical-like phase, when the vortex tends to become more vertically coherent. Within this context, a median localization error of approximately 12 km remains comfortably within the effective precision of the composite-track reference, supporting the interpretation that the predicted cyclone centers are consistent with the positional accuracy available in the target dataset.

4. Conclusions and Future Developments

Results presented in this study show that DeMeTrA provides a coherent end-to-end framework for medicane monitoring, from SEVIRI Airmass RGB preprocessing and event-oriented dataset construction to self-supervised VideoMAE specialization, binary detection, and coordinate-based tracking. The detection scores obtained on cyclone-wise splits, together with the generally low tracking errors for the majority of samples, indicate that the proposed approach captures relevant spatiotemporal signatures of medicanes rather than merely memorizing case-specific patterns. This point is particularly important because the evaluation includes an imbalanced test setting that is closer to the class distribution expected in practical monitoring applications.

Overall, the current results can be regarded as scientifically robust and operationally encouraging. In particular, the combination of high sensitivity in detection and low typical localization error in tracking suggests that the model is a promising candidate for production-oriented or pre-operational monitoring chains, where the primary requirement is timely identification of potentially dangerous cyclonic structures together with an approximate estimate of their rotational center. Although further validation is still necessary before any fully operational deployment, the present evidence supports the feasibility of using Transformer-based video models for near-real-time medicane surveillance.

The main weakness highlighted by the present study is the relatively large number of false alarms. One of the most promising directions for improvement is a revision of the fixed-tile handling, which could make the negative class more consistent and thereby reduce false alarms without sacrificing sensitivity.

A second limitation is the intrinsically small number of positive medicane events available for supervised learning. Even though the self-supervised specialization stage helps compensate for label scarcity, the diversity of medicane morphology represented in the dataset remains limited. Future work should therefore expand the event library, test the framework on longer full-year datasets, and further assess its behavior in the presence of non-medicane events. In summary, the present study establishes a solid methodological baseline and provides encouraging evidence that DeMeTrA can evolve from a research prototype into a production-oriented medicane monitoring tool.

Acknowledgments

The present work is funded by the European Space Agency‘s (ESA) Medicanes project (https://medicanes.isac.cnr.it- ESA Contract No. 4000144111/23/I-KE. We acknowledge ISCRA for awarding this project access to the LEONARDO supercomputer, owned by the EuroHPC Joint Undertaking and hosted by CINECA (Italy).

Appendix A

Appendix A.1. Why Transformers Are Suitable for This Task

The use of a Transformer backbone is motivated by the properties of self-attention. Unlike purely local filtering operations, self-attention allows each image or video token to interact directly with all others, facilitating the modeling of long-range spatial and temporal dependencies. This capability is particularly relevant for satellite video analysis, where cyclone signatures depend on both the large-scale organization of cloud structures and their temporal evolution. A concise overview of these advantages is provided by Mauricio et al. [19].

Within this family, VideoMAE is particularly attractive because it combines global spatiotemporal self-attention with self-supervised masked reconstruction, making it well suited to settings in which labelled samples are scarce but large archives of unlabeled satellite imagery are available [20,21].

Appendix A.2. VideoMAE Input Configuration and Architecture

DeMeTrA builds on VideoMAE v2, an open-source self-supervised video Transformer pretrained on large-scale video corpora with an autoencoding objective [20,21]. In its original pretraining setup, the model processes 16-frame clips at a spatial resolution of

224 \times 224

pixels, and each frame is partitioned into non-overlapping

14 \times 14

pixel patches. To reuse the pretrained weights without modifying the tokenization scheme, the satellite clips constructed for this study were designed to match these input specifications.

VideoMAE converts the input sequence into tokens by cube embedding, using spatiotemporal cubes of size

2 \times 14 \times 14

, corresponding to two consecutive frames and a spatial resolution of

14 \times 14

pixels [20,21]. The model follows an asymmetric encoder–decoder design. The encoder is a Vision Transformer that operates jointly over space and time and, under the masking strategy adopted here, processes only about 10% of the original tokens. The decoder is intentionally lightweight and reconstructs the full token set with reduced depth and width, thereby lowering the computational burden.

VideoMAE v2 combines this architecture with tube masking, in which contiguous spatiotemporal patches are masked along the temporal dimension [21]. Tube masking is particularly important in the video setting because masking individual patches would make the reconstruction task relatively easy owing to temporal redundancy. By masking entire spatiotemporal tubes, the model is forced to infer coherent motion and cloud-evolution patterns. VideoMAE v2 further introduces a dual-masking strategy in which the encoder receives only a small visible subset of the input sequence, while the decoder reconstructs only part of the masked cubes [21]. This design reduces memory usage and training time, supports scaling to very large models, and preserves reconstruction quality. In DeMeTrA, the pretrained backbone is subsequently specialized on Airmass RGB sequences, a much narrower domain than the generic videos used during pretraining.

Appendix A.3. VideoMAE Training Details

The configuration that produced the results reported in the main text relied on distributed training over 16 GPUs.

Appendix A.3.3.5. Unsupervised specialization stage.

Table A1 summarizes the main hyperparameters used during the post-pretraining specialization stage.

Table A1. Main hyperparameters used during the unsupervised specialization stage.

Parameter	Value
`mask_ratio`	0.75
`decoder_mask_ratio`	0.5
`decoder_depth`	4
`patch_size`	14
`sampling_rate`	1
`batch_size`	$6 \cdot$ num. GPU
`learning_rate`	$10^{- 3}$ ^†
`warmup_epochs`	10

^† Linearly scaled with total batch size / 256.

Training was stopped after 150 epochs, with no evident signs of overfitting.

Appendix A.3.3.6. Classification and Regression fine-tuning stages.

Table A2 reports the main hyperparameters used for classification fine-tuning. Training epochs depends on learning rate value, it was found that more stable and lower losses were achieved using lower learning rate values, thus increasing number of epochs needed.

Table A2. Main hyperparameters used during the classification fine-tuning stage.

Parameter	Value per GPU
`mask_ratio`	0.8
`nb_classes`	2
`batch_size`	2
`learning_rate`	$10^{- 6}$ ^†
`warmup_epochs`	100
`min_lr`	$5 \times 10^{- 8}$ ^†
`warmup_lr`	$10^{- 8}$ ^†

^† Linearly scaled with total batch size / 256.

All hyperparameters for the regression fine-tuning were inherited from the classification stage except for the learning-rate settings, which are:

\begin{matrix} learning_rate & = 10^{- 4}, \\ warmup_learning_rate & = 10^{- 6}, \\ \min_learning_rate & = 9 \times 10^{- 5} . \end{matrix}

(A1)

References

Miglietta, M.M.; Flaounas, E.; Gonzalez-Aleman, J.J.; Panegrossi, G.; et al. Defining Medicanes: Bridging the Knowledge Gap between Tropical and Extratropical Cyclones in the Mediterranean. Bull. Am. Meteorol. Soc. 2025, 106, E1955–E1971. [Google Scholar] [CrossRef]
Dvorak, V.F. Tropical Cyclone Intensity Analysis and Forecasting from Satellite Imagery. Mon. Weather Rev. 1975, 103, 420–430. [Google Scholar] [CrossRef]
Olander, T.L.; Velden, C.S. The Advanced Dvorak Technique: Continued Development of an Objective Scheme to Estimate Tropical Cyclone Intensity Using Geostationary Infrared Satellite Imagery. Weather Forecast. 2007, 22, 287–298. [Google Scholar] [CrossRef]
Wimmers, A.J.; Velden, C.S. MIMIC: A New Approach to Visualizing Satellite Microwave Imagery of Tropical Cyclones. Bull. Am. Meteorol. Soc. 2007, 88, 1187–1196. [Google Scholar] [CrossRef]
Wimmers, A.J.; Velden, C.S. Objectively Determining the Rotational Center of Tropical Cyclones in Passive Microwave Satellite Imagery. J. Appl. Meteorol. Climatol. 2010, 49, 2013–2034. [Google Scholar] [CrossRef]
Flaounas, E.; Kotroni, V.; Lagouvardos, K.; Flaounas, I. CycloTRACK (v1.0)—Tracking Winter Extratropical Cyclones Based on Relative Vorticity: Sensitivity to Data Filtering and Other Relevant Parameters. Geosci. Model Dev. 2014, 7, 1841–1853. [Google Scholar] [CrossRef]
Neu, U.; Akperov, M.G.; Bellenbaum, N.; Benestad, R.; Blender, R.; Caballero, R.; Cocozza, A.; Dacre, H.F.; Feng, Y.; Fraedrich, K.; et al. IMILAST: A Community Effort to Intercompare Extratropical Cyclone Detection and Tracking Algorithms. Bull. Am. Meteorol. Soc. 2013, 94, 529–547. [Google Scholar] [CrossRef]
Flaounas, E.; et al. A composite approach to produce reference datasets for extratropical cyclone tracks: application to Mediterranean cyclones. Weather Clim. Dyn. 2023, 4, 639–661. [Google Scholar] [CrossRef]
D’Adderio, L.P.; Prat, A.C.; Casella, D.; Sanò, P.; Panegrossi, G. GPM-CO Observations of Medicane Ianos: Comparative Analysis of Precipitation Structure between Development and Mature Phase. Atmos. Res. 2022, 277, 106174. [Google Scholar] [CrossRef]
Panegrossi, G.; Prat, A.C.; D’Adderio, L.P.; Casella, D.; Sanò, P. Warm Core and Deep Convection in Medicanes: A Passive Microwave-Based Investigation. Remote Sens. 2023, 15, 2838. [Google Scholar] [CrossRef]
Di Francesca, V.; D’Adderio, L.P.; Sanò, P.; Rysman, J.F.; Casella, D.; Panegrossi, G. Passive Microwave-Based Diagnostics of Medicanes over the Period 2000–2021. Atmos. Res. 2025, 321, 107922. [Google Scholar] [CrossRef]
Sebastianelli, S.; D’Adderio, L.P.; Sanò, P.; Casella, D.; Panegrossi, G. Near-surface Wind Field Characterization of Medicanes Using Satellite Observations. Atmos. Res. 2026, 334, 108734. [Google Scholar] [CrossRef]
Olander, T.; Wimmers, A.; Velden, C.; Kossin, J.P. Investigation of Machine Learning Using Satellite-Based Advanced Dvorak Technique Analysis Parameters to Estimate Tropical Cyclone Intensity. Weather Forecast. 2021, 36, 2161–2186. [Google Scholar] [CrossRef]
Roveri, L.; Fery, L.; Cavicchia, L.; Grotto, F. A Statistical Learning Approach to Mediterranean Cyclones. arXiv 2025, arXiv:2501.15694. [Google Scholar] [CrossRef] [PubMed]
Kumler-Bonfanti, C.; Stewart, J.; Hall, D.; Govett, M. Tropical and Extratropical Cyclone Detection Using Deep Learning. J. Appl. Meteorol. Climatol. 2020. [Google Scholar] [CrossRef]
Shi, M.; He, P.; Shi, Y.; et al. Detecting Extratropical Cyclones of the Northern Hemisphere with Single Shot Detector. Remote Sens. 2022, 14, 254. [Google Scholar] [CrossRef]
Xu, J.; Wang, X.; Wang, H.; Zhao, C.; Wang, H.; Zhu, J. Tropical Cyclone Size Estimation Based on Deep Learning Using Infrared and Microwave Satellite Data. Front. Mar. Sci. 2023, 9, 1077901. [Google Scholar] [CrossRef]
Martinez-Amaya, J.; Nieves, V.; Munoz-Mari, J. A Comprehensive AI Approach for Monitoring and Forecasting Medicanes Development. Climate 2024, 12, 220. [Google Scholar] [CrossRef]
Mauricio, J.; Domingues, I.; Bernardino, J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Appl. Sci. 2023, 13, 5521. [Google Scholar] [CrossRef]
Tong, Z.; Song, Y.; Wang, J.; Wang, L. VideoMAE: Masked Autoencoders Are Data-Efficient Learners for Self-Supervised Video Pre-Training. In Proceedings of the Advances in Neural Information Processing Systems, 2022. [Google Scholar]
Wang, L.; Huang, B.; Zhao, Z.; Tong, Z.; He, Y.; Wang, Y.; Wang, Y.; Qiao, Y. VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023; pp. 14549–14560. [Google Scholar]
Ruttgers, M.; et al. Prediction of a typhoon track using a generative adversarial network and satellite images. Sci. Rep. 2019, 9, 1–15. [Google Scholar] [CrossRef]
Dong, P.; et al. Tropical Cyclone Track Prediction with an Encoding-to-Forecasting Deep Learning Model. Weather and Forecasting, 2022. [Google Scholar]
Lagerquist, R.; et al. Center-fixing of tropical cyclones using uncertainty-aware deep learning applied to high-temporal-resolution geostationary satellite imagery. arXiv 2024, arXiv:2409.16507. [Google Scholar] [CrossRef]
Tong, B.; et al. Identification of tropical cyclones via deep convolutional neural network based on satellite cloud images. Atmos. Meas. Tech. 2022, 15, 1829–1847. [Google Scholar] [CrossRef]
Manju, M.S.; et al. ConvLSTM-based tropical cyclone intensity estimation and classification using satellite imagery over the North Indian Ocean. PLoS ONE 2025. [Google Scholar]
Devaraj; Ganesan, S.; Elavarasan, R.M.; Subramaniam, U. A Novel Deep Learning Based Model for Tropical Intensity Estimation and Post-Disaster Management of Hurricanes. Appl. Sci. 2021, 11, 4129. [Google Scholar] [CrossRef]
Maskey, M.; Ramachandran, R.; Ramasubramanian, M.; Gurung, I.; et al. Deepti: Deep-Learning-Based Tropical Cyclone Intensity Estimation System. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing; 2020. [Google Scholar]
Nibali, A.; He, Z.; Morgan, S.; Prendergast, L. Numerical Coordinate Regression with Convolutional Neural Networks. arXiv 2018, arXiv:1801.07372. [Google Scholar] [CrossRef]
Wimmers, A.J.; Griffin, S.; Velden, C. A U-Net Retrieval of Tropical Cyclone Inner-Core Wind Fields from Microwave and Infrared Satellite Imagery. Artificial Intelligence for the Earth Systems, 2024. [Google Scholar]
Griffin, S.; Wimmers, A.J.; Herndon, D.C.; Velden, C.S. Predicting Current and Short-Term Intensity Change in Tropical Cyclones Using a Convolutional Neural Network. 36th Conference on Hurricanes and Tropical Meteorology abstract, 2024; Available online: https://ams.confex.com/ams/36Hurricanes/webprogram/Paper441150.html.
Gorooh, V.A.; et al. Deterministic nowcasting of geostationary satellite infrared brightness temperatures with diffusion and 3D U-Net. Scientific Reports, 2026. [Google Scholar]
EUMETSAT. MSG Level 1.5 Image Data Format Description; EUM/MSG/ICD/105, 2017; pp. v8 e–signed. Available online: https://user.eumetsat.int/s3/eup-strapi-media/pdf_ten_05105_msg_img_data_e7c8b315e6.pdf.
Cloud, Google. EUMETSAT SEVIRI RSS in BigQuery Public Datasets. 2026. Available online: https://console.cloud.google.com/marketplace/product/bigquery-public-data/eumetsat-seviri-rss (accessed on 3 April 2026).
EUMETRAIN. RGB Recipes. 2020. Available online: https://eumetrain.org/sites/default/files/2020-05/RGB_recipes.pdf.
EUMETSAT. Using RGB Images and Best Practices. 2020. Available online: https://www-cdn.eumetsat.int/files/2020-04/pdf_using_rgb_best_practices.pdf.
ECMWF. ERA5 Hourly Data on Single Levels Provided by the Copernicus Climate Data Store. 2026. Available online: https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels (accessed on 4 May 2026).
CINECA Supercomputing Centre; SuperComputing Applications and Innovation Department. LEONARDO: A Pan-European Pre-Exascale Supercomputer for HPC and AI applications. J. Large-Scale Res. Facil. 2024, 8, A186. [Google Scholar] [CrossRef]

Figure 1. Examples of Airmass RGB images covering the Mediterranean basin at different times: (a) Extra-tropical cyclone; (b) Medicane Ianos in its premature phase; and (c) Medicane Ianos in its mature phase.

Figure 2. Mediterranean area is splitted into 12 tiles. Tiles are partially overlapping, shown by gray squares. The tile highlighted in green contains a cyclone-center (the white dot) and is therefore labelled as positive (class 1). All other tiles are labelled as negative (class 0).

Figure 3. Video tile creation by stacking tile frames.

Figure 4. Training and validation loss curves for the self-supervised specialization stage of VideoMAE ’giant’ backbone, on the large unlabelled dataset.

Figure 5. Training and validation loss curves for the detection fine-tuning stage, together with the balanced validation accuracy. The learning-rate schedule used during training is also shown in black, whose values are on the rightmost axis; although different learning-rate schedules were tested, the best final performance was found to be only weakly sensitive to this choice. The best model checkpoint is chosen right before overfitting occurs, nearly at epoch 200.

Figure 6. Example end-frames from the tracking video dataset. The red dot represents the cyclone-center position track label, always corresponding to the last-frame time. Since position labels are available only at integer hours, every clip ends at an integer hour, e.g., HH:00.

Figure 7. Training and validation loss curves for the tracking fine-tuning stage. The best model checkpoint is chosen right before overfitting occurs, nearly before epoch 200.

Figure 8. Confusion matrix for cyclone detection calculated on the balanced validation set.

Figure 9. Confusion matrix for cyclone detection calculated on the imbalanced test set.

Figure 10. Tracking-error distributions (km) for the training, validation, and test splits. The training split comprises 12 cyclones and 832 videoclip tiles, the validation split 3 cyclones and 192 videoclip tiles, and the test split 3 cyclones and 160 videoclip tiles. Across all three splits, the median error is approximately 12 km, while the validation and test distributions exhibit slightly broader upper tails than the training distribution.

Table 1. Maximum and minimum values used for the channel values de-normalization.

SEVIRI channel	$X_{min}$ (K)	$X_{max}$ (K)
IR_097	2.84	317.87
IR_108	199.10	313.28
WV_062	199.57	249.92
WV_073	198.95	286.96

Table 2. Event-wise train/validation/test split of the cases studies used in both detection and tracking training. Their IDs are taken from TRACKS_CL7 [8].

Split	ID	Name
Training	1283	Unnamed
Training	1328	Rolf
Training	1358	Unnamed
Training	1421	Ruven
Training	1461	Qendresa
Training	1466	Unnamed
Training	1500	Unnamed
Training	1521	Unnamed
Training	1542	Trixie
Training	1575	Numa
Training	1674	Unnamed
Training	1702	Ianos
Validation	1715	Unnamed
Validation	1716	Inge
Validation	N/A	Apollo
Test	N/A	Blas
Test	N/A	Daniel
Test	N/A	Juliette

Table 3. Summary of the 18 case studies included in the dataset. The table reports the original start and end times together with the manually refined interval adopted for the detection of a clear cloud-rotation phase. Values marked with * are taken from Flaounas et al. [8]; when necessary, these intervals were adjusted to account for the time at which the cyclone enters or exits the Mediterranean crop domain, defined in Section 2.1.2.

Track ID	Name	Start time*	End time*	Refined start	Refined end
N/A	Apollo	2021-10-24 16:00	2021-10-31 21:00	2021-10-28 10:35	2021-10-30 13:00
N/A	Daniel	2023-09-04 20:00	2023-09-12 11:00	2023-09-05 03:20	2023-09-10 02:00
N/A	Blas	2021-11-07 17:00	2021-11-14 18:00	2021-11-08 10:20	2021-11-09 06:35
N/A	Juliette	2023-02-26 19:00	2023-03-03 08:00	2023-02-27 19:00	2023-03-02 13:30
1283	Unnamed	2010-10-11 17:00	2010-10-13 20:00	2010-10-12 12:55	2010-10-13 13:25
1328	Rolf	2011-11-05 05:00	2011-11-09 14:00	2011-11-05 22:55	2011-11-08 08:05
1358	Unnamed	2012-04-10 19:00	2012-04-14 21:00	2012-04-13 04:15	2012-04-14 21:00
1421	Ruven	2013-11-18 10:00	2013-11-20 22:00	2013-11-18 22:00	2013-11-20 09:00
1461	Qendresa	2014-11-05 23:00	2014-11-10 10:00	2014-11-07 00:10	2014-11-08 14:00
1466	Unnamed	2014-11-29 19:00	2014-12-03 20:00	2014-11-30 06:25	2014-12-03 06:30
1500	Unnamed	2015-09-29 18:00	2015-10-03 03:00	2015-09-30 21:25	2015-10-01 21:00
1521	Unnamed	2016-02-27 16:00	2016-03-02 10:00	2016-02-28 23:20	2016-03-01 08:05
1542	Trixie	2016-10-27 13:00	2016-11-02 01:00	2016-10-28 06:30	2016-10-31 01:50
1575	Numa	2017-11-14 23:00	2017-11-19 19:00	2017-11-16 14:45	2017-11-18 12:40
1674	Unnamed	2019-11-09 20:00	2019-11-11 21:00	2019-11-10 09:30	2019-11-11 17:30
1702	Ianos	2020-09-12 05:00	2020-09-20 18:00	2020-09-14 23:50	2020-09-18 19:40
1715	Unnamed	2020-12-12 17:00	2020-12-17 17:00	2020-12-13 08:20	2020-12-16 13:45
1716	Inge	2020-12-24 09:00	2020-12-28 01:00	2020-12-25 10:35	2020-12-27 10:50

Table 4. Clip-level partition of the detection dataset.

Split	Num. of cyclones	Number of videoclips / class balance
Training set	12	1238, balanced
Validation set	3	354, balanced
Test set	3	2400, 201 positives / 2199 negatives

Table 5. Confusion-matrix notation used for cyclone-detection verification metrics.

	Predicted: Event Yes	Predicted: Event No
True: Event Yes	H (hits)	M (misses)
True: Event No	F (false alarms)	C (correct negatives)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.