Submitted:
19 November 2025
Posted:
20 November 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction: The Imperative for High-Resolution Climate Projections and the Rise of Machine Learning
1.1. Positioning This Review in the Literature
- Creating a novel taxonomy that explicitly maps different classes of ML models—from CNNs and GANs to Transformers and Diffusion Models—to the specific downscaling challenges they are best suited to address.
- Conducting a critical analysis of the “performance paradox,” where high statistical skill on historical data often fails to translate to robust performance under the non-stationary conditions of future climate change.
- Proposing a practical evaluation protocol and charting clear, targeted research priorities to guide the community towards developing more physically consistent, trustworthy, and operationally viable models.
1.2. Overview of the Review’s Scope and Objectives
- RQ1: Evolution of Methodologies: How have ML approaches for climate downscaling evolved from classical algorithms to the current deep learning architectures, and what are the primary capabilities and intended applications of each major model class?
- RQ2: Persistent Challenges: What are the critical, cross-cutting challenges that limit the operational reliability of contemporary ML downscaling models, particularly regarding their physical consistency, generalization under non-stationary climate conditions, and overall trustworthiness?
- RQ3: Emerging Solutions and Future Trajectories: What methodological frontiers including physics-informed learning (PIML), robust uncertainty quantification (UQ), and explainable AI (XAI)—hold the most promise for addressing these key challenges and guiding future research?

2. Scope and Approach
3. Background: The Downscaling Problem
3.1. The Scale Gap in Climate Modeling and the Need for Downscaling
3.2. Limitations of Traditional Downscaling Methods
3.2.1. Dynamical Downscaling (DD)
3.2.2. Statistical Downscaling (SD)
3.3. Emergence and Promise of ML in Transforming Statistical Downscaling
4. The Evolution of Machine Learning Approaches in Climate Downscaling
4.1. Early Applications and Classical ML Benchmarks
4.2. The Deep Learning Paradigm Shift
4.2.1. Pioneering Work with Convolutional Neural Networks (CNNs)
4.2.2. Architectural Innovations
U-Nets
Residual Networks (ResNets)
Generative Adversarial Networks (GANs)
- Strengths:
- GANs have shown promise for generating outputs with improved perceptual quality, sharp gradients, and in some cases better representation of fine-scale variability and heavy-tailed statistics compared to models trained solely with pixel-wise losses like Mean Squared Error (MSE) [32]. StyleGAN-family architectures achieve low Fréchet Inception Distance (FID) scores across large-scale benchmarks, highlighting their strength for perceptually realistic textures [56,57]. Conditional GANs (CGANs), such as MSG-GAN-SD by Accarino et al. [58], demonstrate direct applicability to downscaling by conditioning generation on low-resolution input. While some studies report more realistic precipitation or temperature fields using CGANs [59], consistent advantages in reproducing extremes remain preliminary and context-dependent.
- Limitations:
- GANs are notoriously challenging to train due to issues like mode collapse (where the generator produces limited varieties of samples) and training instability [32]. Evaluating GAN performance can also be difficult, as traditional pixel-wise metrics may not fully capture perceptual quality. Moreover, while GANs can produce visually appealing results, some studies suggest they might not always accurately capture the full statistical distribution of the high-resolution data, which is critical for scientific applications [19]. The extrapolation of GANs for downscaling precipitation extremes in warmer, future climates remains an active area of research and concern. A notable application of GAN-based frameworks is the Super-Resolution for Renewable Energy Resource Data with Climate Change Impacts (Sup3rCC) model developed by the National Renewable Energy Laboratory (NREL) [60]. Sup3rCC employs a generative machine learning approach, specifically leveraging GANs, to downscale Global Climate Model (GCM) data to produce 4-km hourly resolution fields for variables crucial to the energy sector, such as wind, solar irradiance, temperature, humidity, and pressure, for the contiguous United States under various climate change scenarios. The model learns realistic spatial and temporal attributes by training on NREL’s historical high-resolution datasets (e.g., National Solar Radiation Database, Wind Integration National Dataset Toolkit) and then injects this learned small-scale information into coarse GCM inputs. This methodology is designed to be computationally efficient compared to traditional dynamical downscaling while providing physically realistic high-resolution data tailored for studying climate change impacts on energy systems, renewable energy generation, and electricity demand. It’s important to note that Sup3rCC is designed to represent the historical climate and future climate scenarios, rather than specific historical weather events [60].
Diffusion Models
- Latent Diffusion Models (LDMs): To mitigate the high computational cost of operating in pixel space, LDMs, such as those explored by Tomasi et al. [18], perform the diffusion process in a compressed latent space [63]. This significantly reduces training and sampling costs. For downscaling, LDMs have demonstrated the ability to mimic kilometer-scale dynamical model outputs (e.g., COSMO-CLM simulations) with remarkable fidelity for variables like 2m temperature and 10m wind speed, outperforming U-Net and GAN baselines in spatial error, frequency distributions, and power spectra [18].
- Spatio-Temporal and Video Diffusion: Recognizing the temporal nature of climate data, models like Spatio-Temporal Video Diffusion (STVD) extend video generation techniques to precipitation downscaling [19]. These frameworks often use a two-step process: a deterministic module (e.g., a U-Net) provides an initial coarse prediction, and a conditional diffusion model learns to add the high-frequency residual details. In initial experiments, STVD was reported to outperform GANs in capturing accurate statistical distributions and fine-grained precipitation structures, particularly those influenced by topography.
- Hybrid Dynamical-Generative Downscaling: A state-of-the-art paradigm combines the strengths of physical models and generative AI. As proposed by Lopez-Gomez et al. [9], this approach uses a computationally cheap RCM to dynamically downscale ESM output to an intermediate resolution. A generative diffusion model then refines this output to the final target resolution. This hybrid method leverages the physical consistency and generalizability of the RCM and the sampling efficiency and textural fidelity of the diffusion model. This approach not only reduces computational costs by over 97% compared to full dynamical downscaling but also produces more accurate uncertainty bounds and better captures spectra and multivariate correlations than traditional statistical methods.
- Distributional Correction: To better capture extreme events, recent work has focused on aligning the generated distribution with the target distribution, particularly in the tails. Liu et al. [20] introduced a Wasserstein penalty into a score-based diffusion model to improve the representation of extreme precipitation, demonstrating more reliable calibration across intensities.
Spatiotemporal Models (LSTMs, ConvLSTMs, Transformers)
-
LSTMs/ConvLSTMs: Long Short-Term Memory (LSTM) networks [67], a type of Recurrent Neural Network (RNN), are designed to capture long-range temporal dependencies in sequential data. Convolutional LSTMs (ConvLSTMs) [68] extend LSTMs by replacing fully connected operations with convolutional operations, enabling them to process spatio-temporal data where inputs and states are 2D or 3D grids [68,69]. These models are particularly relevant for downscaling precipitation sequences or forecasting river runoff using atmospheric forcing.Strengths: Explicitly model temporal sequences and dependencies, crucial for variables with memory effects. Hybrid CNN-LSTM models can leverage the spatial feature extraction capabilities of CNNs and the temporal modeling strengths of LSTMs, often outperforming standalone models [69].Limitations: Standard LSTMs might struggle with very high-dimensional spatial inputs unless effectively combined with convolutional structures. Training these complex recurrent architectures can also be demanding. While ConvLSTMs are better suited for spatio-temporal data, their ability to capture very long-range spatial dependencies might be limited compared to other architectures like Transformers.
-
Transformers: Originally developed for natural language processing [70], Transformer architectures, particularly Vision Transformers (ViTs) [71] and their variants, are increasingly being adopted for climate science applications, including downscaling [22]. Their core mechanism, self-attention, allows the model to weigh the importance of all other locations in the input when making a prediction for a single location. This enables the modeling of global context and long-range spatial dependencies (i.e., teleconnections), a critical advantage over the local receptive fields of CNNs.Key innovations and applications in downscaling include:
- −
- Architectural Adaptations: Models like SwinIR (Swin Transformer for Image Restoration) and Uformer have been adapted from computer vision for downscaling temperature and wind speed, demonstrating superior performance over CNN baselines like U-Net [72]. For precipitation, PrecipFormer utilizes a window-based self-attention mechanism and multi-level processing to significantly reduce computational overhead while effectively capturing the localized and dynamic nature of rainfall [24].
- −
- Resolution-Agnostic and Zero-Shot Downscaling: A significant frontier is the development of models that can generalize across different resolutions without retraining. Curran et al. [21] demonstrated that a pretrained Earth Vision Transformer (EarthViT) could be trained to downscale from 50km to 25km and then successfully applied to a 3km resolution task in a zero-shot setting (i.e., without any fine-tuning on the new resolution). This capability is crucial for operational efficiency, as it avoids the costly process of retraining models for every new GCM or grid configuration [21]. Research comparing various architectures found that a Swin-Transformer-based approach combined with interpolation surprisingly outperformed neural operators in zero-shot downscaling tasks in terms of average error metrics [73].
- −
- Foundation Models: The power and scalability of the Transformer architecture have made it the backbone for emerging foundation models in weather and climate science. Models like FourCastNet [22], Prithvi-WxC [23], and ORBIT-2 [74] are pre-trained on massive climate datasets (e.g., decades of ERA5 reanalysis). While primarily designed for forecasting, their learned representations of Earth system dynamics make them promising candidates for downscaling via fine-tuning. This paradigm shifts the task from training a specialized model from scratch to adapting a large, pre-trained model, which may enhance transferability and reduce data requirements for specific downscaling tasks, though this remains an active area of research [23,75]. This paradigm shifts the task from training a specialized model from scratch to adapting a large, pre-trained model, which can enhance transferability and reduce data requirements for specific downscaling tasks.
Strengths: Transformers excel at modeling long-range spatial and temporal dependencies, a key physical aspect of the climate system. They show strong potential for transfer learning and zero-shot generalization, which could dramatically reduce the computational burden of downscaling large, multi-model ensembles. Recent benchmarks indicate that Transformer architectures can achieve competitive or superior performance in zero-shot generalization across resolutions compared to some neural operator approaches [21,73]. While these findings are promising, they represent early results rather than a settled state-of-the-art, and broader validation across datasets and variables will be necessary. ViTs and their adaptations like PrecipFormer [24] (which uses window-based self-attention and multi-level processing for efficiency) and EarthViT [21] have shown promise in capturing complex spatio-temporal patterns. They exhibit good potential for transferability, especially when combined with CNNs in hybrid architectures [28]. Foundation models built on Transformers, such as Prithvi WxC [23] and ORBIT-2 [74], are being developed for multi-task downscaling across various variables and geographies. FourCastNet[22], another transformer-based model, is a weather emulator designed to resolve and forecast high-resolution variables like surface wind speed and precipitation.Limitations: The primary challenge is the quadratic computational complexity of the self-attention mechanism ( where N is the number of input patches), which can be prohibitive for very high-resolution data. However, innovations like window-based attention (Swin, PrecipFormer) and other efficient attention mechanisms are actively addressing this bottleneck [24]. Practically, their data-hungry nature means they benefit most from large-scale pre-training, making foundation models a key pathway for their effective use.
5. The Physical Frontier: Hybrid and Physics-Informed Downscaling
5.1. The Imperative for Physical Consistency
5.2. Architectural Integration of Physical Laws: PIML
- Soft Constraints
- This is the most common approach, where the standard data-fidelity loss term () is augmented with a physics-based penalty term () [78]. The total loss becomes , where is a weighting hyperparameter. is formulated as the residual of a governing differential equation (e.g., the continuity equation for mass conservation). By minimizing this residual across the domain, the network is encouraged, but not guaranteed, to find a physically consistent solution. This method is flexible and has been used to penalize violations of conservation laws [25] and to solve complex PDEs [26].A common example is enforcing mass conservation in precipitation downscaling. If x is the value of a single coarse-resolution input pixel and are the n corresponding high-resolution output pixels from the neural network, a soft constraint can be added to the loss function to penalize deviations from the conservation of mass. In other words, the sum of the smaller pixels cannot be larger than the value of the corresponding coarse pixel. The total loss, , becomes a weighted sum of the data fidelity term (e.g., Mean Squared Error, ) and a physics penalty term:where is a hyperparameter that controls the strength of the physical penalty. Minimizing this loss encourages, but does not guarantee, that the mean of the high-resolution patch matches the coarse-resolution value.
- Hard Constraints (Constrained Architectures)
-
This approach modifies the neural network architecture itself to strictly enforce physical laws by design. For example, Harder et al. [27] introduced specialized output layers that guarantee mass conservation by ensuring that the sum of the high-resolution output pixels equals the value of the coarse-resolution input pixel. Such methods provide an absolute guarantee of physical consistency for the constrained property, which can improve both performance and generalization. While more difficult to design and potentially less flexible than soft constraints, they represent a more robust method for embedding inviolable physical principles [27]. In contrast of soft consrtraints, a hard constraint enforces the physical law by design, often through a specialized, non-trainable output layer. Continuing the mass conservation example, let be the raw, unconstrained outputs from the final hidden layer of the network. A multiplicative constraint layer can be designed to produce the final, constrained outputs that are guaranteed to conserve mass:This layer rescales the raw outputs such that their sum is precisely equal to , thereby strictly enforcing the conservation law at every forward pass, without the need for a penalty term in the loss function.
5.3. Hybrid Frameworks: Merging Dynamical and Statistical Strengths
- An initial, computationally inexpensive dynamical downscaling step using an RCM to bring coarse ESM output to an intermediate resolution (e.g., from 100km to 45km). This step grounds the output in a physically consistent dynamical state.
- A subsequent generative ML step, using a conditional diffusion model, to perform the final super-resolution to the target scale (e.g., from 45km to 9km). The diffusion model learns to add realistic, high-frequency spatial details.
5.4. Enforcing Physical Realism in Practice
5.4.1. The Frontier of Physics-Informed Machine Learning (PIML)
The Promise of Physics-ML Integration
- Ensuring Conservation Laws: Models can be designed or constrained to conserve fundamental quantities like mass and energy [8].
- Maintaining Thermodynamic Consistency: Predictions can be guided to adhere to known thermodynamic relationships (e.g., between temperature, humidity, and precipitation).
- Reducing Data Requirements: By embedding prior physical knowledge, PIML models may require less training data to achieve good performance compared to purely data-driven approaches, as the physical laws provide strong regularization [26].
- Improving Extrapolation: Models that respect physical principles are hypothesized to extrapolate more reliably to unseen conditions, as these principles are expected to hold even when statistical relationships change.
Implementation Approaches for PIML
-
Hard Constraints: This approach involves modifying the neural network architecture or adding specific constraint layers at the output to strictly guarantee that certain physical laws are satisfied [27]. For example, a constraint layer could ensure that the total precipitation over a downscaled region matches the coarse-grid precipitation input, thereby enforcing water mass conservation.Advantages: Guarantees physical consistency for the enforced laws.Disadvantages: Can be more challenging to design and may limit the model’s flexibility if the constraints are too restrictive or incorrectly formulated.
-
Soft Constraints via Loss Functions: This is the more common approach, where penalty terms representing deviations from physical laws are added to the overall loss function that the model minimizes during training [26].Advantages: More flexible than hard constraints and can potentially incorporate multiple physical principles simultaneously. Easier to implement for complex, non-linear PDEs.Disadvantages: Does not strictly guarantee constraint satisfaction, only encourages it. The choice of weighting for the physics-based loss term () can be critical and may require careful tuning.
- Hybrid Statistical–Dynamical Models: As discussed previously, these models combine ML with components of traditional dynamical models [8]. ML can be used to emulate specific, computationally expensive parameterizations within an RCM, or to learn corrective terms for RCM biases. This approach inherently leverages the physical basis of the dynamical model components.

Case Studies and Results
6. Data, Variables, and Preprocessing Strategies in ML-Based Downscaling
6.1. Common Predictor Datasets (Low-Resolution Inputs)
- ERA5 Reanalysis:
- The fifth generation ECMWF atmospheric reanalysis, ERA5, is extensively used as a source of predictor variables, particularly for training models in a "perfect-prognosis" framework [83,84]. ERA5 provides a globally complete and consistent, high-resolution (relative to GCMs, typically 31 km or 0.25°) gridded dataset of many atmospheric, land-surface, and oceanic variables from 1940 onwards, assimilating a vast amount of historical observations. Its physical consistency and observational constraint make it an ideal training ground for ML models to learn relationships between large-scale atmospheric states and local climate variables. Often, models trained on ERA5 are subsequently applied to downscale GCM projections.
- CMIP5/CMIP6 GCM Outputs:
- Outputs from the Coupled Model Intercomparison Project Phase 5 (CMIP5) and Phase 6 (CMIP6) GCMs are indispensable when the objective is to downscale future climate projections under various emission scenarios (e.g., Representative Concentration Pathways - RCPs, or Shared Socioeconomic Pathways - SSPs). These GCMs provide the large-scale atmospheric forcing necessary for projecting future climate change. However, their coarse resolution and inherent biases necessitate downscaling and often bias correction before their outputs can be used for regional impact studies [10,84].
- CORDEX RCM Outputs:
- Data from the Coordinated Regional Climate Downscaling Experiment (CORDEX) are also utilized, particularly when ML techniques are employed for further statistical refinement of RCM outputs, as RCM emulators, or in hybrid downscaling approaches. CORDEX provides dynamically downscaled climate projections over various global domains, offering higher resolution than GCMs and incorporating regional climate dynamics. However, these outputs may still require further downscaling for very local applications or may possess biases that ML can help correct.
6.2. High-Resolution Reference Datasets (Target Data)
- Gridded Observational Datasets:
- Products like PRISM (Parameter-elevation Regressions on Independent Slopes Model) for North America [8,85], Iberia01 for the Iberian Peninsula [86], E-OBS for Europe [87], and regional datasets like REKIS [88] are commonly used [8]. PRISM, for example, provides high-resolution (e.g., 800m or 4km) daily temperature and precipitation data across the conterminous United States, incorporating physiographic influences like elevation and coastal proximity into its interpolation [85]. These datasets are invaluable for training models in a perfect-prognosis setup, where historical observations are used as the target.
- Satellite-Derived Products:
- Satellite observations offer global or near-global coverage and are increasingly used as reference data. Notable examples include the Global Precipitation Measurement (GPM) mission’s Integrated Multi-satellitE Retrievals for GPM (IMERG) products for precipitation [89] and the Soil Moisture Active Passive (SMAP) mission for soil moisture [90]. GPM IMERG, for instance, provides precipitation estimates at resolutions like 0.1° and 30-minute intervals, with various products (Early, Late, and Final Run) catering to different latency and accuracy requirements [89].
- Regional Reanalyses or High-Resolution Simulations:
- In some cases, outputs from high-resolution regional reanalyses or dedicated RCM simulations (sometimes run specifically for the purpose of generating training data) are used as the "truth" data, especially when high-quality gridded observations are scarce [29].
- FluxNet:
- For variables related to land surface processes and evapotranspiration, data from the FluxNet network of eddy covariance towers provide valuable site-level observational data for model validation [91]. These towers measure exchanges of carbon dioxide, water vapor, and energy between ecosystems and the atmosphere.
6.3. Key Downscaled Variables
- Daily Precipitation and 2-meter Temperature: These are the most commonly downscaled variables due to their direct relevance for impact studies (e.g., agriculture, hydrology, health). This includes mean, minimum, and maximum temperatures.
- Multivariate Downscaling: There is a growing trend towards downscaling multiple climate variables simultaneously (e.g., temperature, precipitation, wind speed, solar radiation, humidity). This is important for ensuring physical consistency among the downscaled variables.
- Spatial/Temporal Scales: Typical downscaling efforts aim to increase resolution from GCM/Reanalysis scales of 25-100 km to target resolutions of 1-10 km, predominantly at a daily temporal resolution.
6.4. Feature Engineering and Selection
- Static Predictors:
- High-resolution static geographical features such as topography (including elevation, slope, and aspect), land cover type, soil properties, and climatological averages are frequently incorporated as additional predictor variables. These features provide crucial local context that is often unresolved in coarse-scale GCM or reanalysis outputs. For instance, orography heavily influences local precipitation patterns and temperature lapse rates, while land cover affects surface energy balance and evapotranspiration [44,85]. The inclusion of these static predictors allows ML models to learn how large-scale atmospheric conditions interact with local surface characteristics to produce fine-scale climate variations.
- Dynamic Predictors:
- For specific variables like soil moisture, dynamic predictors such as Land Surface Temperature (LST) and Vegetation Indices (e.g., NDVI, EVI) derived from satellite remote sensing are often used, as these variables capture short-term fluctuations related to surface energy and water balance [92].
- Dimensionality Reduction and Collinearity:
- When dealing with a large number of potential predictors, dimensionality reduction techniques like Principal Component Analysis (PCA) are sometimes employed to reduce the number of input features while retaining most of the variance. This can help to mitigate issues related to collinearity among predictors and reduce computational load. Regularization techniques (e.g., L1 or L2 regularization) embedded within many ML models also implicitly handle collinearity by penalizing large model weights.
6.5. Data Preprocessing Challenges
- Data-Scarce Areas: A significant hurdle is the availability of sufficient high-quality, high-resolution reference data for training and validation, especially in many parts of the developing world or in regions with complex terrain where observational networks are sparse [93].
- Imbalanced Data for Extreme Events: Extreme climatic events (e.g., heavy precipitation, heatwaves) are, by definition, rare. This leads to imbalanced datasets where extreme values are underrepresented, potentially biasing ML models (trained with standard loss functions like MSE) to perform well on common conditions but poorly on these critical, high-impact events. This issue often hinders models from learning the specific characteristics of extremes.
- Ensuring Domain Consistency: Predictor variables derived from GCM simulations may exhibit different statistical properties (e.g., means, variances, distributions) and systematic biases compared to reanalysis data (like ERA5) often used for model training. This mismatch, known as a domain or covariate shift, can degrade model performance and is a critical preprocessing consideration. This occurs because GCMs often have systematic biases and different statistical properties than reanalysis data, even for historical periods, thereby violating the assumption that training and application data are drawn from the same distribution. Techniques such as bias correction of GCM predictors, working with anomalies by removing climatological means from both predictor and predictand data to focus on changes, or more advanced domain adaptation methods are employed to mitigate this critical issue and enhance consistency [94].
- Quality Control and Gap-Filling: Observational and satellite-derived datasets frequently require substantial preprocessing steps, including quality control to remove erroneous data, and gap-filling techniques (e.g., interpolation) to handle missing values due to sensor malfunction or environmental conditions (like cloud cover for satellite imagery) [95].
7. A Prescriptive Protocol for Model Evaluation
7.1. Protocol for Precipitation Downscaling
- Root Mean Squared Error (RMSE): Report as a baseline metric for average error, but acknowledge its limitations in penalizing realistic high-frequency variability.
- Fraction Skill Score (FSS): This is the primary metric [96] for spatial accuracy. FSS should be reported for multiple intensity thresholds and spatial scales to assess performance across different event types. Based on common practice in forecast verification, we recommend thresholds relevant to hydrological impacts depending on usual severity of precipitation in that area and return period, for instance 1, 5, and 20 mm/day. The analysis should show FSS as a function of neighborhood size, with recommended spatial scales of the area in hand. We can take 10, 20, 40, and 80 km as an example; however, these values need to be carefully chosen to identify the scale at which the forecast becomes skillful.
- High-Quantile Error: To specifically evaluate performance on extremes, report the bias or absolute error for a high quantile of the daily precipitation distribution, such as the 99th or 99.5th percentile. This directly measures the model’s ability to capture the magnitude of rare, intense events.
- Power Spectral Density (PSD): Plot the 1D radially-averaged power spectrum of the downscaled precipitation fields against the reference data. This is a critical diagnostic for spatial realism. An overly steep slope indicates excessive smoothing, while a shallow slope or bumps at high frequencies can indicate unrealistic noise or GAN-induced artifacts.
- Continuous Ranked Probability Score (CRPS): For probabilistic models (e.g., GAN or Diffusion ensembles), the CRPS [97] is the gold-standard metric for overall skill, as it evaluates the entire predictive distribution. It should be reported as the primary probabilistic skill score.
7.2. Protocol for Temperature Downscaling
- RMSE and Bias: Report the overall Root Mean Squared Error and Mean Bias (downscaled minus reference) as standard metrics of accuracy and systematic error.
- Power Spectral Density (PSD): As with precipitation, the PSD is crucial for ensuring that the downscaled temperature fields contain realistic spatial variability and are not overly smoothed by the model.
- Distributional Metrics (e.g., Wasserstein Distance): Compare the full probability distributions of downscaled and reference temperatures using a robust metric like the Wasserstein Distance. This provides a more complete picture of performance than just comparing means and variances, capturing shifts in the shape and tails of the distribution.
- Reliability Diagram (for probabilistic models): If the model produces probabilistic forecasts (e.g., ensembles), a reliability diagram is essential. It plots the observed frequency of an event against the forecast probability, providing a direct visual assessment of calibration. A well-calibrated model should lie along the 1:1 diagonal line.
7.3. Comparative Analysis and State-of-the-Art
- For spatial structure and deterministic accuracy, U-Net and ResNet-based CNNs remain strong contenders, particularly for smoother variables like temperature. Their inductive bias for local patterns is highly effective for learning topographically-induced climate variations [8].
- For probabilistic outputs and UQ, Diffusion models are emerging as the state-of-the-art due to their stable training and ability to generate high-fidelity, diverse ensembles [9,18]. They often outperform GANs on distributional metrics. As a simple, strong baseline for epistemic uncertainty, report deep ensembles [99] with CRPS and reliability diagnostics.
- For transferability and zero-shot generalization, Transformer-based foundation models represent the cutting edge. Their ability to learn from vast, diverse datasets enables generalization to new resolutions and regions with minimal fine-tuning, a critical capability for operational scalability [21].
7.4. Validation Under Non-Stationarity
7.4.1. Pseudo-Global Warming (PGW) Experiments
7.4.2. Transfer Learning and Domain Adaptation
- Models might be pre-trained on large, diverse datasets (e.g., multiple GCMs, long historical records) to learn general, invariant features of atmospheric processes.
- These pre-trained models can then be fine-tuned on smaller, target-specific datasets (e.g., data for a particular region, a specific future period, or a new GCM) [28]. This approach can lead to better generalization and reduce the amount of target-specific data needed for training. However, careful validation is crucial to ensure that the transferred knowledge is beneficial and does not introduce biases from the source domain. Prasad et al. [28] demonstrated that pre-training on diverse climate datasets can enhance zero-shot transferability for some downscaling tasks, but fine-tuning often remains necessary for optimal performance on distinct target domains like different GCM outputs.
7.4.3. Process-Informed Architectures and Predictor Selection
- Encoding known physical relationships into the network architecture: This might involve designing specific layers or connections that mimic physical processes or constraints.
- Using physically-motivated predictor variables: Selecting input variables that have a clear and robust physical link to the predictand (e.g., thermodynamic variables like potential temperature, specific humidity, or large-scale circulation indices known to influence local weather) rather than relying on a large set of potentially collinear or causally weak predictors.
7.4.4. Validation Strategies for Non-Stationary Conditions
- Perfect Model Framework (Pseudo-Reality Experiments): In this setup, output from a high-resolution GCM or RCM simulation is treated as the “perfect” truth [80]. Coarsened versions of this output are used to train the ML downscaling model, which then attempts to reconstruct the original high-resolution “truth”. This framework allows for testing the ML model’s ability to downscale under different climate states (e.g., historical vs. future periods from the same GCM/RCM), as the “truth” is known for all periods. This is crucial for evaluating extrapolation capabilities.
- Cross-GCM Validation: Models are trained on a subset of available GCMs and then tested on GCMs that were not included in the training set. This assesses the model’s ability to generalize to climate model outputs with different structural characteristics and biases.
- Temporal Extrapolation (Out-of-Sample Testing): Using the most recent portion of the historical record or specific periods with distinct climatic characteristics (e.g., the warmest historical years as proxies for future conditions) exclusively for testing, after training on earlier data [8]. This provides a more stringent test of generalization than random cross-validation.
- Process-Based Evaluation: Beyond statistical metrics, evaluating whether the downscaled outputs maintain plausible physical relationships between variables (e.g., temperature–precipitation scaling, wind–pressure relationships) and accurately represent key climate processes (e.g., diurnal cycles, seasonal transitions, extreme event characteristics) under different climate conditions. XAI techniques can play a role here in verifying if the model is relying on physically sound mechanisms.
7.5. A Multi-Faceted Toolkit for Model Evaluation
Uncertainty Baselines
7.6. Operational Relevance: Beyond Statistical Skill
- Computational Cost: Dynamical downscaling is exceptionally expensive, limiting its use for large ensembles. ML offers a computationally cheaper alternative by orders of magnitude [9,112]. However, costs vary within ML: inference with CNNs is fast, while the iterative sampling of diffusion models is slower. Training large foundation models requires massive computational resources, but once trained, fine-tuning and inference can be efficient [23]. The hybrid dynamical-generative approach offers a compelling trade-off, drastically cutting the cost of the most expensive part of the physical simulation pipeline [9].
- Interpretability: As discussed in Section 9.2.2, the "black-box" nature of deep learning is a major barrier to operational trust. The ability to use XAI tools to verify that a model is learning physically meaningful relationships, rather than spurious "shortcuts," is crucial for deployment in high-stakes applications.
- Robustness and Generalization: The single most important factor for operational relevance is a model’s ability to generalize to out-of-distribution(OOD) data, namely future climate scenarios. As detailed in Section 9.1, models that fail under covariate or concept drift are not operationally viable for climate projection. Therefore, rigorous OOD evaluation using techniques like cross-GCM validation and Pseudo-Global Warming (PGW) experiments is a prerequisite for deployment.
- Baselines: Always include strong classical comparators (e.g., BCSD/quantile-mapping and LOCA) as default references alongside modern DL models; these remain common operational choices in hydrologic and climate-services pipelines [33,34]. Formal assessments and national products continue to operationalize statistical interfaces between GCMs and impacts—bias adjustment and empirical/statistical downscaling (e.g., LOCA2, STAR-ESDM)—as default pathways, which underscores why ML downscalers must demonstrate clear, application-relevant added value [113,114].
8. Critical Investigation of Model Performance and Rationale
8.1. Rationale for Model Choices
- CNNs/U-Nets for Spatial Patterns:
- These architectures are predominantly chosen for their proficiency in learning hierarchical spatial features from gridded data. Convolutional layers are adept at identifying local patterns, while pooling layers capture broader contextual information. U-Nets, with their encoder-decoder structure and skip connections, are particularly favored for tasks requiring precise spatial localization and preservation of fine details, making them well-suited for downscaling variables like temperature and precipitation where spatial structure is paramount [8].
- LSTMs/ConvLSTMs for Temporal Dependencies:
- When the temporal evolution of climate variables and their sequential dependencies are critical (e.g., for daily precipitation sequences or hydrological runoff forecasting), LSTMs and ConvLSTMs are preferred due to their recurrent nature and ability to capture long-range temporal patterns.
- GANs/Diffusion Models for Realistic Outputs and Extremes:
- These generative models are selected when the objective is to produce downscaled fields that are not only statistically accurate but also perceptually realistic, with sharp gradients and a better representation of the full statistical distribution, including extreme events [8].
- Transformers for Long-Range Dependencies:
8.2. Factors Contributing to Model Success
- Appropriate Architectural Design: Matching the model architecture to the inherent characteristics of the data and the downscaling task is paramount. For instance, CNNs are well-suited for gridded spatial data, while LSTMs excel with time series. The incorporation of architectural enhancements like residual connections and the skip connections characteristic of U-Nets have proven crucial for training deeper models and preserving fine-grained spatial detail.
- Effective Feature Engineering: The performance of ML models is significantly boosted by the inclusion of relevant predictor variables. In particular, incorporating high-resolution static geographical features such as topography, land cover, and soil type provides essential local context that coarse-resolution GCMs or reanalysis products inherently lack. This allows the model to learn how large-scale atmospheric conditions are modulated by local surface characteristics.
- Quality and Representativeness of Training Data: The availability of sufficient, high-quality, and representative training data is fundamental. Data augmentation techniques, such as rotation or flipping of input fields, can expand the training set and improve model generalization, especially for underrepresented phenomena like extreme events [14,115].
- Appropriate Loss Functions: The choice of loss function used during model training significantly influences the characteristics of the downscaled output. While standard loss functions like MSE are common, they can lead to overly smooth predictions and poor representation of extremes. Tailoring loss functions to the specific task—for example, using quantile loss, Bernoulli-Gamma loss for precipitation (which models occurrence and intensity separately), Dice loss for imbalanced data, or the adversarial loss in GANs for perceptual quality—can lead to substantial improvements in capturing critical aspects of the climate variable’s distribution [8]. Studies show that L1 and L2 loss functions perform differently depending on data balance, with L2 often being better for imbalanced data like precipitation [116].
- Rigorous Validation Frameworks: The use of robust validation strategies, including out-of-sample testing and standardized evaluation metrics beyond simple error scores (e.g., the VALUE framework [117]), is crucial for assessing true model skill and generalizability.
8.3. Factors Hindering Model Learning
- Overfitting: Models may learn noise or spurious correlations present in the specific training dataset, leading to excellent performance on seen data but poor generalization to unseen data. This is a common issue, especially with highly flexible DL models and limited or non-diverse training data.
-
Poor Generalization (The "Transferability Crisis"), Covariate Shift, Concept Drift, and Shortcut Learning: A major and persistent challenge is the failure of models to extrapolate reliably to conditions significantly different from those encountered during training. This ’transferability crisis’ is the core of the "performance paradox" and is rooted in the violation of the stationarity assumption. It can be rigorously framed using established machine learning concepts:
- −
- Covariate Shift: This occurs when the distribution of input data, , changes between training and deployment, while the underlying relationship remains the same [118]. In downscaling, this is guaranteed when applying a model trained on historical reanalysis (e.g., ERA5) to the outputs of a GCM, which has its own systematic biases and statistical properties. It also occurs when projecting into a future climate where the statistical distributions of atmospheric predictors (e.g., mean temperature, storm frequency) have shifted.
- −
- Concept Drift: This is a more fundamental challenge where the relationship between predictors and the target variable, , itself changes [118]. Under climate change, the physical processes linking large-scale drivers to local outcomes might be altered (e.g., changes in atmospheric stability could alter lapse rates). A mapping learned from historical data may therefore become invalid.
- −
- Shortcut Learning: This phenomenon provides a mechanism to explain why models are so vulnerable to these shifts [119]. Models often learn "shortcuts"—simple, non-robust decision rules that exploit spurious correlations in the training data instead of the true underlying physical mechanisms [119]. For example, a model might learn to associate a specific GCM’s known regional cold bias with a certain type of downscaled precipitation pattern. This shortcut works perfectly for that GCM but fails completely when applied to a different, unbiased GCM or to the real world, leading to poor OOD performance. The finding by González-Abad et al. [77] that models may rely on spurious teleconnections is a prime example of shortcut learning in this domain.
The difficulty in generalizing to these OOD conditions is therefore a core impediment. High performance on historical, in-distribution test data provides no guarantee of reliability for future projections, necessitating strategies focused on robustness, physical understanding, and OOD detection. - Lack of Physical Constraints: Purely data-driven ML models, optimized solely for statistical accuracy, can produce outputs that are physically implausible or inconsistent (e.g., violating conservation laws). This lack of physical grounding can severely limit the trustworthiness and utility of downscaled projections.
- Data Limitations: Insufficient training data, particularly for rare or extreme events, remains a significant bottleneck. Data scarcity in certain geographical regions also poses a challenge for developing globally applicable models. Furthermore, the lack of training data that adequately represents the full range of potential future climate states can hinder a model’s ability to project future changes accurately.
- Inappropriate Model Complexity: Choosing an inappropriate level of model complexity can be detrimental. Models that are too simple may underfit the data, failing to capture complex relationships. Conversely, overly complex models are prone to overfitting, may be more difficult to train, and can be computationally prohibitive.
- Training Difficulties (e.g., Vanishing/Exploding Gradients): In very deep neural networks, especially plain CNNs without architectural aids like residual connections, the gradients used for updating model weights can become infinitesimally small (vanishing) or excessively large (exploding), hindering the learning process.
- Input Data Biases and Inconsistencies: Systematic biases present in GCM outputs, or inconsistencies between the statistical characteristics of training data (e.g., reanalysis) and application data (e.g., GCM outputs from a different model or future period), representing a significant covariate shift as discussed previously, can significantly degrade downscaling performance. Preprocessing steps, such as bias correction of predictors or working with anomalies by removing climatology, are often crucial for mitigating these issues [80].
8.4. Comparative Analysis of ML Approaches
9. Overarching Challenges in ML-Based Climate Downscaling
9.1. Transferability and Domain Adaptation: The Achilles’ Heel
- Extrapolation to Future Climates: Models trained exclusively on historical climate data often struggle to perform reliably when applied to future climate scenarios characterized by significantly different mean states, altered atmospheric dynamics, or novel patterns of variability. Studies by Hernanz et al. [121] demonstrated catastrophic drops in CNN performance when applied to future projections or GCMs not included in the training set. The models may learn statistical relationships that are valid for the historical period but do not hold under substantial climate change.
- Cross-GCM/RCM Transfer: Due to inherent differences in model physics, parameterizations, resolutions, and systematic biases, ML models trained to downscale the output of one GCM or RCM often exhibit degraded performance when applied to outputs from other climate models. This limits the ability to readily apply a single trained downscaling model across a multi-model ensemble.
- Spatial Transferability: A model developed and trained for a specific geographical region may not transfer effectively to other regions with different climatological characteristics, topographic complexities, or land cover types. Local adaptations are often necessary, which can be data-intensive.
- Domain Adaptation Techniques: These methods aim to explicitly adapt a model trained on a "source" domain (e.g., historical data from one GCM) to perform well on a "target" domain (e.g., future data from a different GCM) where labeled high-resolution data may be scarce or unavailable [8].
- Training on Diverse Data: A common strategy is to pre-train ML models on a wide array of data encompassing multiple GCMs, varied historical periods, and diverse geographical regions. The hypothesis is that exposure to greater variability will help the model learn more robust and invariant features that generalize better. For instance, Prasad et al. [28] found that training on diverse datasets (ERA5, MERRA2, NOAA CFSR) led to good zero-shot transferability for some tasks, though fine-tuning was still necessary for others, such as the two-simulation transfer involving NorESM data.
- Pseudo-Global Warming (PGW) Experiments: This approach involves training or evaluating models using historical data that has been perturbed to mimic certain aspects of future climate conditions (e.g., by adding a GCM-projected warming signal). This allows for a more systematic assessment of a model’s extrapolation capabilities under changed climatic states.
-
Causal Machine Learning: There is growing interest in developing ML approaches that aim to learn underlying causal physical processes rather than just statistical correlations. Such models are hypothesized to be inherently more robust to distributional shifts.The challenge of transferability implies that simply achieving high accuracy on a historical test set is insufficient. For ML-downscaled projections to be credible for future climate impact assessments, models must demonstrate robustness across different climate states and sources of climate model data.
Case Studies (Quantitative Case Studies)
- Cross-model transfer (temperature UNet emulator). In a pseudo-reality experiment, daily RMSE for a UNet emulator rose from ∼0.9°C when evaluated on the same driving model used for training (UPRCM) to ∼2–2.5°C when applied to unseen ESMs; for warm extremes (99th percentile) under future climate, biases were mostly within C but reached up to C in some locations, and were larger than a linear baseline [121].
- GAN downscaling artifacts (near-surface winds). Deterministic GAN super-resolution exhibited systematic low-variance (low-power) bias at fine scales and, under some partial frequency-separation settings, isolated high-power spikes at intermediate wavenumbers; allowing the adversarial loss to act across all frequencies restored fine-scale variance, but it also raised pixelwise errors via the double-penalty effect.” [98].
- Classical SD variability & bias pitfalls (VALUE intercomparison). In a 50+ method cross-validation over Europe, several linear-regression SD variants showed very large precipitation biases—sometimes worse than raw model outputs—while some MOS techniques systematically under- or over-estimated variability (e.g., ISI-MIP under, DBS over), underscoring that method class alone does not guarantee robustness [5].
9.2. Physical Consistency and Interpretability
9.2.1. Ensuring Physically Plausible Outputs

-
Physics-Informed Neural Networks (PINNs) and Constrained Learning:
- −
- Soft Constraints: This approach involves incorporating penalty terms into the model’s loss function that discourage violations of known physical laws. The total loss becomes a weighted sum of a data-fidelity term and a physics-based regularization term (e.g., ). Physics-informed loss functions have been explored to guide models towards more physically realistic solutions. While soft constraints can reduce the frequency and magnitude of physical violations, they may not eliminate them entirely and can introduce a trade-off between statistical accuracy and physical consistency [25].
- −
- Hard Constraints: These methods aim to strictly enforce physical laws by design, either by modifying the neural network architecture itself or by adding specialized output layers that ensure the predictions satisfy the constraints. Harder et al. [27] introduced additive, multiplicative, and softmax-based constraint layers that can guarantee, for example, mass conservation between low-resolution inputs and high-resolution outputs. Such hard-constrained approaches have been shown to not only ensure physical consistency but also, in some cases, improve predictive performance and generalization [27]. The rationale for PINNs includes reducing the dependency on large datasets and enhancing model robustness by ensuring physical consistency, especially in data-sparse regions or for out-of-sample predictions [26]. Recent work explores Attention-Enhanced Quantum PINNs (AQ-PINNs) for climate modeling applications like fluid dynamics, aiming for improved accuracy and computational efficiency [128].
- Hybrid Dynamical-Statistical Models: Another avenue is to combine the strengths of ML with traditional physics-based dynamical models (RCMs). This can involve using ML to emulate computationally expensive components of RCMs, to statistically post-process RCM outputs (e.g., for bias correction or further downscaling), or to develop hybrid frameworks where ML and dynamical components interact [8,29]. For example, "dynamical-generative downscaling" approaches combine an initial stage of dynamical downscaling with an RCM to an intermediate resolution, followed by a generative AI model (like a diffusion model) to further refine the resolution to the target scale. This leverages the physical consistency of RCMs and the efficiency and generative capabilities of AI [9]. Such hybrid models aim to achieve a balance between computational feasibility, physical realism, and statistical skill.
9.2.2. Explainable AI (XAI): Unmasking the "Black Box"
9.2.2.1. The Need for Interpretability
- Model Validation and Debugging: Understanding which input features a model relies on can help identify if it has learned scientifically meaningful relationships or if it is exploiting spurious correlations or artifacts in the training data. Understanding which input features a model relies on can help identify if it has learned scientifically meaningful relationships or if it is exploiting spurious correlations or artifacts in the training data - a phenomenon of shortcut learning where models may appear "right for the wrong reasons".
- Scientific Discovery: XAI can potentially reveal novel insights into climate processes by highlighting unexpected relationships learned by the model.
- Building Trust: Transparent models whose decision-making processes align with physical understanding are more likely to be trusted by domain scientists and policymakers.
- Identifying Biases: XAI can help uncover hidden biases in the model or the data it was trained on.
9.2.2.2. Common XAI Techniques Applied to Downscaling
- Saliency Maps and Feature Attribution: Methods like Integrated Gradients, DeepLIFT, and Layer-Wise Relevance Propagation (LRP) aim to attribute the model’s output (e.g., a high-resolution pixel value) back to the input features (e.g., coarse-resolution predictor fields), highlighting which parts of the input were most influential [8]. González-Abad et al. [77] introduced aggregated saliency maps for CNN-based downscaling, revealing that models might rely on spurious teleconnections or ignore important physical predictors. LRP has been adapted for semantic segmentation tasks in climate science, like detecting tropical cyclones and atmospheric rivers, to investigate whether CNNs use physically plausible input patterns [129].
- Gradient-weighted Class Activation Mapping (Grad-CAM): This technique produces a coarse localization map highlighting the important regions in the input image for predicting a specific class (or, adapted for regression, a specific output value) [130]. While useful for visualization, Grad-CAM may not differentiate well between input variables [129].
- SHAP (SHapley Additive exPlanations): Based on cooperative game theory, SHAP values explain the prediction of an instance by computing the contribution of each feature to the prediction [131]. SHAP has been noted for its ability to reveal features that degrade forecast accuracy, though it may inaccurately rank significant features in some contexts [132].
9.2.2.3. Challenges in XAI for Climate Downscaling
- Faithfulness and Plausibility: Ensuring that explanations truly reflect the model’s internal decision-making process (faithfulness) and are consistent with physical understanding (plausibility) is challenging [133]. Different XAI methods can yield different, sometimes conflicting, explanations for the same prediction [134].
- Relating Attributions to Physical Processes: While methods like integrated gradients are mathematically sound, the resulting attribution maps can be difficult to directly relate to specific, understandable physical processes or mechanisms.
- Standardization: Methodologies and reporting standards for XAI in climate downscaling remain inconsistent, making comparisons across studies difficult. Different XAI methods can yield conflicting explanations for the same prediction, and there is a lack of consensus on benchmark metrics, hindering systematic evaluation [133].
- Beyond Post Hoc Explanations: Current XAI often provides post hoc explanations. There is a growing call to move towards building inherently interpretable models or to integrate interpretability considerations into the model design process itself, drawing lessons from how dynamical climate models are understood at a component level. This involves striving for "component-level understanding" where model behaviors can be attributed to specific architectural components or learned representations.
9.3. Representation of Extreme Events
The Challenge
- Data Imbalance: Extreme events are rare by definition, leading to their under-representation in training datasets—an issue long recognized in extreme value analysis [108]. Models optimized to minimize average error across all data points may thus prioritize fitting common, non-extreme values, effectively “smoothing over” or underestimating extremes. In precipitation downscaling, tail-aware training (e.g., quantile losses) has been used precisely to counter this tendency [135]; empirical studies also note that standard DL architectures can underestimate heavy precipitation and smooth spatial variability in extremes [44,121].
- Loss Function Bias: MSE loss, for example, penalizes large errors quadratically, which might seem beneficial for extremes. However, because extremes are infrequent, their contribution to the total loss can be small, and the model may learn to predict values closer to the mean to minimize overall MSE, thereby underpredicting the magnitude of extremes. This regression-to-the-mean behavior under quadratic criteria is well documented in hydrologic error decompositions [136]; tail-focused alternatives such as quantile (pinball) losses offer a direct mitigation [135].
- Failure to Capture Compound Extremes: Models may also struggle to capture the co-occurrence of multiple extreme conditions (e.g., concurrent heat and drought), which requires learning cross-variable dependence structures. Reviews of compound events highlight the prevalence and impacts of such co-occurrences and the difficulty for standard single-target pipelines to reproduce them [137,138]; see also evidence on changing risks of concurrent heat–drought in the U.S. [139].
Specialized Approaches for Extremes
-
Tailored Loss Functions: Using loss functions that give more weight to extreme values or are specifically designed for tail distributions. Examples include:
- −
- Weighted Loss Functions: Assigning higher weights to errors associated with extreme events (e.g., the term in Eq. 1 from the original document [140]).
- −
-
Quantile Regression: Quantile Regression (QR) offers a powerful approach by directly modeling specific quantiles of a variable’s conditional distribution, which inherently allows for a detailed focus on the distribution’s tails and thus on extreme values. For instance, Quantile Regression Neural Networks (QRNNs), as implemented by Cannon [40], provide a flexible, nonparametric, and nonlinear method. This approach avoids restrictive assumptions about the data’s underlying distribution shape, a significant advantage for complex climate variables like precipitation where parametric forms are often inadequate. A key feature of the QRNN presented is its ability to handle mixed discrete-continuous variables, such as precipitation amounts (which include zero values alongside a skewed distribution of positive amounts). This is achieved through censored quantile regression, making the model adept at representing both the occurrence and varying intensities of precipitation, including extremes.Cannon [40] notes this was the first implementation of a censored quantile regression model that is nonlinear in its parameters. Furthermore, the methodology allows for the full predictive probability density function (pdf) to be derived from the set of modeled quantiles. This enables more comprehensive probabilistic assessments, such as estimating arbitrary prediction intervals, calculating exceedance probabilities for critical thresholds (i.e., performing extreme value analysis), and evaluating risks associated with different outcomes. To enhance model robustness and mitigate overfitting, especially when data for extremes might be sparse, Cannon [40] incorporates techniques like weight penalty regularization and bootstrap aggregation (bagging). The practical relevance to downscaling is demonstrated through an application to a precipitation downscaling task, where the QRNN model showed improved skill over linear quantile regression and climatological forecasts. Importantly, the paper also suggests that QRNNs could be a "viable alternative to parametric ANN models for nonstationary extremes", a crucial consideration for climate change impact studies where the characteristics of extreme events are expected to evolve. The Quantile-Regression-Ensemble (QRE) algorithm trains members on distinct subsets of precipitation observations corresponding to specific intensity levels, showing improved accuracy for extreme precipitation [141].
- −
- Bernoulli-Gamma or Tweedie Distributions: For precipitation, which has a mixed discrete-continuous distribution (zero vs. non-zero amounts, and varying intensity), loss functions based on these distributions (e.g., minimizing Negative Log-Likelihood - NLL) can better model both occurrence and intensity, including extremes [141].
- −
- Dice Loss and Focal Loss: Explored for handling sample imbalance in heavy precipitation forecasts, with Dice Loss showing similarity to threat scores and effectively suppressing false alarms while improving hits for heavy precipitation [140].
- Generative Models (GANs and Diffusion Models): These models, by learning the underlying data distribution, can be better at generating realistic extreme events compared to deterministic regression models [32]. Diffusion models, in particular, have shown promise in capturing fine spatial features of extreme precipitation and reproducing intensity distributions more accurately than GANs or CNNs [142].
- Data Augmentation: Techniques to artificially increase the representation of extreme events in the training data, as used in the SRDRN model [14].
- Architectural Modifications: Designing model architectures or components specifically to handle extremes, such as the gradient-guided attention model for discontinuous precipitation by Xiang et al. [81] or multi-scale gradient processing in GANs. Beyond tailored loss functions and data augmentation, the architectural choices within generative frameworks and other advanced models are also pivotal for addressing the severe class imbalance inherent in extreme events and for capturing their unique characteristics. For instance, some GAN variants, such as evtGAN, integrate Extreme Value Theory to better model the tails of distributions associated with rare events. Other architectural improvements, like the use of multi-scale gradients in MSG-GAN-SD, aim for more stable training dynamics, which is a general challenge in GANs [58,123]. Diffusion models, while noted for their stable training and ability to capture fine spatial details of extremes such as precipitation [29], might inherently be better at representing multimodal distributions and capturing tail behavior due to their iterative refinement process. This could make them less prone to the averaging effects that often cause simpler architectures to underestimate extremes. Similarly, attention mechanisms in Transformers, if appropriately designed, could learn to focus on subtle precursors or localized features indicative of rare, high-impact events, thereby complementing specialized loss functions in a synergistic manner. Effectively tackling extreme events thus necessitates a holistic approach where the model architecture itself is capable of learning and representing the complex, often subtle, features that characterize these rare phenomena, rather than relying solely on adjustments to the loss function or data handling.
- Extreme Value Theory (EVT) Integration: Combining ML with EVT provides a statistical framework for modeling the tails of distributions. For instance, evtGAN [123] combines GANs with EVT to model spatial dependencies in temperature and precipitation extremes [123]. Models using Generalized Pareto Distribution (GPD) for tails can incorporate covariates from climate models to improve estimates [108].
9.4. Uncertainty Quantification (UQ)
Sources of Uncertainty
- Aleatoric Uncertainty: Represents inherent randomness or noise in the data and the process being modeled (e.g., unpredictable small-scale atmospheric fluctuations).
- Epistemic Uncertainty: Arises from limitations in model knowledge, including model structure, parameter choices, and limited training data. This uncertainty is, in principle, reducible with more data or better models.
- Scenario Uncertainty: Uncertainty in future greenhouse gas emissions and other anthropogenic forcings.
- GCM Uncertainty: Structural differences among GCMs lead to a spread in projections even for the same scenario.
- Downscaling Model Uncertainty: The statistical downscaling model itself introduces uncertainty.
UQ Approaches in ML Downscaling
-
Ensemble Methods:
- −
- Deep Ensembles: Training multiple instances of the same DL model with different random initializations (and potentially variations in training data via bootstrap sampling) and then combining their predictions to estimate both the mean and the spread (uncertainty) [79,143]. DeepESD [10] is an example of a CNN ensemble framework that quantifies inter-model spread from multiple GCM inputs and internal model variability. Deep ensembles can improve UQ, especially for future periods, by providing confidence intervals [79]. The optimal number of models in an ensemble for improving mean and UQ is often found to be around 3-6 models [79].
- −
- Multi-Model Ensembles (MMEs): Applying a downscaling model to outputs from multiple GCMs to capture inter-GCM uncertainty.
-
Bayesian Neural Networks (BNNs): These models learn a probability distribution over their weights, rather than point estimates. By sampling from this posterior distribution, BNNs can provide probabilistic predictions that inherently quantify both aleatoric and epistemic uncertainty [144]. Techniques like Monte Carlo dropout are often used as a practical approximation to Bayesian inference in deep networks [144]. Bayesian AIG-Transformer and Precipitation CNN (PCNN) are examples of models incorporating these techniques for downscaling wind, and precipitation [143,145].Strengths: Provide a principled way to decompose uncertainty into aleatoric and epistemic components.Weaknesses: Can be computationally more expensive to train and sample from compared to deterministic models or simple ensembles.
- Generative Models for Probabilistic Output: GANs and Diffusion Models can, in principle, learn the conditional probability distribution and generate multiple plausible high-resolution realizations for a given low-resolution input, thus providing a form of ensemble for UQ. Diffusion models, in particular, are noted for their ability to model complex distributions effectively [32].
- Quantile Regression: As mentioned for extremes, models that predict quantiles of the distribution (e.g., Quantile Regression Neural Networks [40]) directly provide information about the range of possible outcomes.
Challenges in UQ
- Computational Cost: Probabilistic methods like BNNs and large ensembles can be computationally intensive.
- Validation of Uncertainty: Validating the reliability of uncertainty estimates, especially for future projections where ground truth is unavailable, is a significant challenge. Pseudo-reality experiments are often used for this [79].
- Communication of Uncertainty: Effectively communicating complex, multi-faceted uncertainty information to end-users and policymakers is crucial but non-trivial.
9.5. Reproducibility, Data Handling, and Methodological Rigor
-
Reproducibility: Ensuring that research findings can be independently verified is a cornerstone of scientific progress. In ML-based downscaling, this involves:
- −
- Public Code and Data: Sharing model code, training data (or clear pointers to standard datasets), and pre-trained model weights [8].
- −
- Containerization and Deterministic Environments: Using tools like Docker to create reproducible software environments and ensuring deterministic operations in model training and inference where possible [146].
- −
-
Well-Defined Train/Test Splits and Evaluation Protocols: Clearly documenting how data is split for training, validation, and testing, and using standardized evaluation protocols (like VALUE [117]) to facilitate fair comparisons across studies.Baselines. The seven-method study by Vandal et al. [1] justifies using strong linear/bias-correction baselines (BCSD, Elastic-Net, hybrid BC+ML) alongside modern DL.Spectral/structure metrics. Following Harris et al. [100] and Annau et al. [98], include power spectra/structure functions, fraction skill scores, and spatial-coherence diagnostics to detect texture hallucinations and scale mismatch.Uncertainty metrics. For probabilistic models (GAN/VAEs/diffusion), report CRPS, reliability diagrams/PIT, and quantile/interval coverage (as in [100]; [40]).Tail-aware metrics. Report quantile-oriented scores (e.g., QVSS), return-level/return-period consistency, and extreme-event FSS where relevant (cf. [124]).Explicitly include warming/OOD tests (e.g., pseudo-global-warming or future-slice validation). Rampal et al. [30] show intensity-aware losses and residual two-stage designs can improve robustness for extremes under warming.
- −
- Active Frontiers: As noted in recent papers (e.g., Quesada-Chacón et al. [8]), while reproducibility advances are being made through such efforts, consistent adoption of best practices across the community is still needed to ensure the robustness and verifiability of research findings.
-
Data Handling Issues:
- −
- Collinearity: High correlation among predictor variables can make it difficult to interpret model feature importance and can sometimes lead to unstable model training. This is addressed through feature selection techniques (e.g., PCA), regularization methods inherent in many ML models, or by careful predictor selection based on domain knowledge [132].
- −
- Feature Evaluation: Systematically evaluating the importance of different input features for downscaling performance is crucial for model understanding and simplification. XAI techniques can aid in this, but ablation studies (removing features and observing performance changes) are also common [132].
- −
- Random Initialization: The performance of DL models can be sensitive to the random initialization of model weights. Reporting results averaged over multiple runs with different initializations is good practice for robustness Quesada-Chacón et al. [8], Bano-Medina et al. [11]. In addition to seed averaging, two complementary practices help reduce sensitivity and convey uncertainty: (i) train independent replicas and aggregate them as a deep ensemble to capture variability due to different initializations González-Abad and Bano-Medina [79], Lakshminarayanan et al. [99]; and (ii) use approximate Bayesian methods such as dropout-as-Bayesian at test time to reflect parameter uncertainty Gal and Ghahramani [144].
- −
- Suppressor Variables: These are variables that, when included, improve the predictive power of other variables, even if they themselves are not strongly correlated with the predictand. Identifying and understanding their role can be complex but important for model performance [147].
-
Methodological Rigor in Evaluation:
- −
-
Beyond Standard Metrics: While RMSE is a common metric, it may not capture all relevant aspects of downscaling performance, especially for spatial patterns, temporal coherence, or extreme events. A broader suite of metrics is needed, including:
- *
- Spatial correlation, structural similarity index (SSIM) [55].
- *
- Metrics for extremes (e.g., precision, recall, critical success index for precipitation thresholds; metrics from Extreme Value Theory like GPD parameters or return levels) [8].
- *
- Metrics for distributional similarity (e.g., Earth Mover’s Distance, Kullback-Leibler divergence) [148].
- *
- Metrics for temporal coherence and spatial consistency (e.g., spectral analysis, variogram analysis, or specific metrics like Restoration Rate and Consistency Degree from TemDeep [7]). The Modified Kling-Gupta Efficiency (KGE) decomposes performance into correlation, bias, and variability [136,149].
- −
-
Out-of-Sample Validation: Crucially, models must be validated on data that is truly independent of the training set. This is particularly challenging for spatio-temporal climate data, which is inherently non-independent and identically distributed (non-IID) due to strong spatial and temporal autocorrelation. Standard k-fold cross-validation, which randomly splits data, often violates the independence assumption. Spatial autocorrelation means that nearby data points are more similar than distant points, so random splits can lead to data leakage, where information from the validation set is implicitly present in the training set due to spatial proximity, resulting in overly optimistic performance estimates [117]. Similarly, temporal dependencies in climate time series mean that standard cross-validation can inadvertently train on future data to predict the past, which is unrealistic for prognostic applications [80]. The failure to use appropriate validation for non-IID data contributes significantly to the "performance paradox", where models appear to perform well under flawed validation schemes but fail when evaluated more rigorously or deployed on truly independent OOD data. Therefore, robust OOD validation, using specialized cross-validation techniques, is essential to assess true generalization and avoid misleading performance metrics. Such techniques include:
- *
- Spatial k-fold (or blocked) cross-validation: Data is split into spatially contiguous blocks to ensure greater independence between training and validation sets.
- *
- Leave-Location-Out (LLO) cross-validation: Entire regions or distinct geographical locations are held out for testing, providing a stringent test of spatial generalization [93].
- *
- Buffered cross-validation: A buffer zone is created around test points, and data within this buffer is excluded from training to minimize leakage due to spatial proximity [150].
- *
- Temporal (blocked) cross-validation / Forward Chaining: For time-series aspects, data is split chronologically, ensuring the model is always trained on past data and tested on future data, mimicking operational forecasting. Beyond these, ’warm-test’ periods (pseudo-future), such as those created through pseudo-global warming (PGW) experiments, are also used for extrapolation assessment [151]. Adopting these robust validation strategies is a prerequisite for accurately assessing generalization and building trust in reported model performance
9.5.1. Challenges in Benchmarking and Inter-comparison
10. Future Trajectories: Grand Challenges and Open Questions
10.1. Grand Challenge 1: Overcoming Non-Stationarity
- Foundation Models: Large, pretrained backbones learned from massive, diverse earth-system data (e.g., multiple GCMs or reanalyses) that provide broad, reusable priors; usable zero-/few-shot or with fine-tuning [23].
- Domain Adaptation and Transfer Learning: Methods to adapt models from a source to a target distribution like (historical→future, reanalysis→GCM, region A→B), including fine-tuning FMs or smaller models and explicit shift-handling techniques [28].
- Rigorous OOD Testing: Systematically using Pseudo-Global Warming (PGW) experiments and holding out entire GCMs or future time periods for validation to stress-test and quantify extrapolation capabilities [30].
- How can we formally verify that a model has learned a causal physical process rather than a spurious shortcut?
- What are the theoretical limits of generalization for a given model architecture and training data diversity?
- Can online learning systems be developed to allow models to adapt continuously as the climate evolves, mitigating concept drift in near-real-time applications?
10.2. Grand Challenge 2: Achieving Verifiable Physical Consistency
- Hybrid Dynamical-Statistical Models: Frameworks like dynamical-generative downscaling leverage a physical model to provide a consistent foundation, which an ML model then refines. This approach strategically outsources the enforcement of complex physics to a trusted dynamical core [9].
- How can we design computationally tractable physical loss terms for complex, non-differentiable processes like cloud microphysics or radiative transfer?
- What is the optimal trade-off between the flexibility of soft constraints and the guarantees of hard constraints for multi-variable downscaling?
- Can we develop methods to automatically discover relevant physical constraints from data, rather than relying solely on pre-defined equations?
10.3. Grand Challenge 3: Reliable and Interpretable Uncertainty Quantification (UQ)
- How can we effectively validate UQ for far-future projections where no ground truth exists?
- How can we decompose total uncertainty into its constituent sources in a computationally efficient manner for deep learning models?
- How can we best communicate complex, multi-dimensional uncertainty information to non-expert stakeholders to support robust decision-making?
10.4. Grand Challenge 4: Skillful Prediction of Climate Extremes
- Integration with Extreme Value Theory (EVT): Hybrid models that combine ML with statistical EVT offer a principled way to model the extreme tails of climate distributions [123].
- How do we ensure that generative models produce extremes that are not only statistically realistic but also physically plausible in their spatio-temporal evolution?
- How will the statistics of compound extremes (e.g., concurrent heat and drought) change, and can ML models capture their evolving joint probabilities?
- Can we develop models that explicitly predict changes in the parameters of EVT distributions (e.g., GPD parameters) as a function of large-scale climate drivers?
10.5. Current State Assessment
- Performance Paradox: ML models, particularly deep learning architectures like CNNs, U-Nets, and GANs, often demonstrate excellent performance on in-sample test data or when downscaling historical reanalysis products. They excel at learning complex spatial patterns and non-linear relationships, leading to visually compelling high-resolution outputs. However, this strong in-sample performance frequently does not translate to robust extrapolation on out-of-distribution data (e.g., future climate scenarios from different GCMs or entirely new regions)—a critical limitation given that downscaling is intended to inform future projections.
- Trust Deficit: The limited transparency of many deep learning models, together with historically sparse uncertainty quantification, constrains end-user confidence and practical uptake. Without clear reasoning traces and robust uncertainty estimates, the utility of ML-downscaled products for decision-making remains limited.
- Physical Inconsistency: Many current ML downscaling methods do not inherently enforce fundamental physical laws (e.g., conservation of mass/energy, thermodynamic constraints). Resulting fields can be statistically plausible yet physically unrealistic, undermining scientific interpretability and downstream use.
- Challenges with Extreme Events: Accurately capturing the frequency, intensity, and spatial characteristics of extremes remains difficult. Class imbalance and commonly used loss functions tend to underestimate magnitudes and misplace patterns of high-impact events; specialized targets, data curation, and evaluation for extremes are required.
- Data Limitations and Methodological Gaps: Scarcity of high-quality, high-resolution reference data in many regions, together with inconsistent metrics, validation protocols, and reporting standards, impedes apples-to-apples comparison and cumulative progress. Recent work emphasizes that computational repeatability is essential for building trust and enabling rigorous comparison across methods [8].
Answers to the Research Questions
10.6. Priority Research Directions
-
Robust Extrapolation and Generalization Frameworks (Addressing RQ2:
- Systematic Evaluation Protocols: Develop and adopt standardized protocols and benchmark datasets specifically designed to test model transferability across different climate states (historical, near-future, far-future), different GCMs/RCMs, and diverse geographical regions. This includes rigorous out-of-sample testing beyond simple hold-out validation.
- Metrics for Generalization: Establish and utilize metrics that explicitly quantify generalization and extrapolation capability, rather than relying solely on traditional skill scores computed on in-distribution test data.
- Understanding Failure Modes: Conduct systematic analyses of why and when ML models fail to extrapolate, linking failures to model architecture, training data characteristics, or violations of physical assumptions.
-
Physics-ML Integration and Hybrid Modeling Standards (Addressing RQ2):
- Standardized PIML Approaches: Develop and disseminate standardized methods and libraries for incorporating physical constraints (both hard and soft) into common ML architectures used for downscaling. This includes guidance on formulating physics-based loss terms and designing constraint-aware layers.
- Validation Suites for Physical Consistency: Create benchmark validation suites that explicitly test for adherence to key physical laws (e.g., conservation principles, thermodynamic consistency, realistic spatial gradients and inter-variable relationships).
- Advancing Hybrid Models: Foster research into hybrid models that effectively combine the strengths of process-based dynamical models with the efficiency and pattern-recognition capabilities of ML, including RCM emulators and generative AI approaches for refining RCM outputs Tomasi et al. [18].
-
Operational Uncertainty Quantification (Addressing RQ3):
- Beyond Point Estimates: Shift the focus from deterministic (single-value) predictions to probabilistic projections that provide a comprehensive assessment of uncertainty.
- Efficient UQ Methods: Develop and promote computationally efficient UQ methods suitable for high-dimensional DL models, such as scalable deep ensembles, practical Bayesian deep learning techniques (e.g., with improved variational inference or MC dropout strategies), and generative models capable of producing reliable ensembles [19,79,144].
- Decomposition and Attribution of Uncertainty: Advance methods to decompose total uncertainty into its constituent sources (e.g., GCM uncertainty, downscaling model uncertainty, internal variability) and attribute uncertainty to specific model components or assumptions.
- User-Oriented Uncertainty Communication: Develop effective tools and protocols for communicating complex uncertainty information to diverse end-users in an understandable and actionable manner.
-
Explainable and Interpretable Climate AI (Addressing RQ3):
- Domain-Specific XAI Metrics: Establish XAI metrics and methodologies that are specifically relevant to climate science, moving beyond generic XAI techniques to those that can provide physically meaningful insights.
- Linking ML Decisions to Physical Processes: Develop XAI techniques that can causally link ML model decisions and internal representations to known climate processes and drivers, rather than just highlighting input feature importance [133].
- Standards for Model Documentation and Interpretation: Promote standards for documenting ML model architectures, training procedures, and the results of interpretability analyses to enhance transparency and facilitate critical assessment by the broader scientific community [133].
-
Community Infrastructure and Benchmarking (Addressing all RQs):
- Shared Evaluation Frameworks: Expand and support the community-driven evaluation frameworks (e.g., extending the VALUE initiative [117]) to facilitate systematic intercomparison of ML downscaling methods using standardized datasets and metrics.
- Reproducible Benchmark Datasets: Curate and maintain open, high-quality benchmark datasets specifically designed for training and evaluating ML downscaling models across various regions, variables, and climate conditions. These should include data for testing transferability and extreme event representation.
- Open-Source Implementations: Encourage and support the development and dissemination of open-source software implementations of key ML downscaling methods and PIML components to lower the barrier to entry and promote reproducibility.
- Collaborative Platforms: Foster collaborative platforms and initiatives (e.g., CORDEX Task Forces on ML [76]) for sharing knowledge, best practices, model components, and downscaled datasets.
11. Ethical Considerations, Responsible Development, and Governance in Ml-Based Climate Downscaling
11.1. Algorithmic Bias, Fairness, and Equity
Why This Matters for Downscaling (Tied to Prior Sections)

Practitioner Checklist
- Report data coverage maps and per-region sample counts used in training and validation (ties to Section 6.5); stratify metrics by data-rich vs. data-scarce subregions.
- Use shift-robust training/evaluation (Section 9.1): e.g., held-out regions, time-split OOD tests, and stress tests on atypical synoptic regimes.
- Track extreme-aware metrics (Section 9.3): CSI/POD at multi-thresholds, tail-MAE, quantile errors, and bias of return levels.
- Quantify epistemic uncertainty (Section 9.4) and suppress overconfident deployment in regions with low training support; communicate abstentions or wide intervals as a feature, not a bug.
Mini-Case (Sparse Data → Biased Extremes → Policy Risk)
11.2. Transparency, Accountability, and Liability
Make Transparency Operational
11.3. Accessibility, Inclusivity, and the Digital Divide
Tie to Data-Scarce Regions and Model Bias
Actionable Steps
- Publish downscaling baselines and trained weights under permissive licenses; provide lightweight inference paths for low-resource agencies.
- Release region-stratified diagnostics and data coverage artifacts so local stakeholders can judge fitness-for-use.
- Prioritize augmentation/sampling schemes that upweight underrepresented regimes and seasons, with ablation evidence (links to Section 7). Augmentation, of course, must respect the invariant transformations of the augmented data in climate science.
11.4. Misinterpretation, Misuse, and Communication of Uncertainty
Communicating Uncertainty for Decisions
11.5. Data Governance, Privacy, and Ownership
Downscaling-Specific Governance
11.6. The Need for Governance Frameworks and Best Practices
A Minimal, Testable Governance Bundle (Linked to Prior Sections)
12. Future Outlook: The Next Decade of ML in Climate Downscaling
12.1. Emerging Paradigms
12.1.1. Foundation Models for Climate Downscaling
- Potential Benefits:
- Enhanced Transfer Learning: These models could provide powerful, pre-trained representations of atmospheric and Earth system dynamics, enabling effective transfer learning to specific downscaling tasks across various regions, variables, and GCMs with significantly reduced data requirements for fine-tuning [75].
- Multi-Task Capabilities: Foundation models can be designed for multiple downstream tasks, including forecasting, downscaling, and parameterization learning, offering a versatile tool for climate modeling.
- Implicit Physical Knowledge: Through pre-training on vast datasets governed by physical laws, these models might implicitly learn and encode some degree of physical consistency, although explicit PIML techniques will likely still be necessary to guarantee it.
- Challenges:
- Developing and training these massive models require substantial computational resources and curated large-scale datasets. Ensuring their generalizability and avoiding the propagation of biases learned during pre-training are also critical research areas.
12.1.2. Hybrid Hierarchical and Multi-Scale Approaches
- Global ML models or foundation models providing coarse, bias-corrected boundary conditions.
- Regional physics-informed ML models or RCM emulators operating at intermediate scales, incorporating more detailed regional physics.
- Local stochastic generators or specialized ML models (e.g., for extreme events or specific microclimates) providing the final layer of high-resolution detail and variability.
12.1.3. Online Learning and Continuous Model Adaptation
- Benefits:
- This could help mitigate the stationarity assumption by allowing models to learn changing relationships over time and improve their performance for ongoing or near-real-time downscaling applications.
- Challenges:
- Ensuring model stability, avoiding catastrophic forgetting (where learning new data degrades performance on old data), and managing the computational demands of continuous retraining are significant hurdles.
12.1.4. Deep Integration of Causal Inference and Process Understanding
12.2. Critical Success Factors
- Interdisciplinary Collaboration: Sustained and deep collaboration between climate scientists, ML researchers, statisticians, computational scientists, and domain experts from impact sectors is essential. Climate scientists bring crucial domain knowledge about physical processes and data characteristics, while ML experts provide algorithmic innovation.
- Open Science Practices: The continued adoption of open science principles—including the sharing of code, datasets, model weights, and standardized evaluation frameworks—is vital for reproducibility, transparency, and accelerating collective progress [8]. Initiatives like CORDEX and CMIP6, which foster data sharing and model intercomparison, provide valuable models for the ML downscaling community [170,171].
-
Deep Stakeholder Engagement and Co-production Throughout the Lifecycle: While listed as a critical success factor, the principle of stakeholder engagement and co-design deserves elevated emphasis, framed not merely as a desirable component but as an essential element integrated throughout the entire research, development, and deployment lifecycle of ML-based downscaling tools and climate services. Moving beyond consultation, true co-production involves iterative, sustained processes of relationship building, shared understanding, and joint output development with end-users and affected communities.Actively involving end-users from diverse sectors (e.g., agriculture, water resource management, urban planning, public health, indigenous communities) from the very outset of ML downscaling projects offers profound benefits [165]:
- Ensuring Relevance and Actionability: Co-production helps ensure that ML downscaling efforts are targeted towards producing genuinely useful, context-specific, and actionable information that meets the actual needs of decision-makers rather than being solely technology-driven.
- Defining User-Relevant Evaluation Metrics: Collaboration with users can help define evaluation metrics and performance targets that reflect their specific decision contexts and thresholds of concern, moving beyond purely statistical measures to those that indicate practical utility.
- Building Trust and Facilitating Uptake: A transparent, demand-driven, and participatory development process fosters trust in the ML models and their outputs. When users are part of the creation process, they gain a better understanding of the model’s capabilities and limitations, which facilitates the responsible uptake and integration of ML-derived products into their decision-making frameworks.
- Addressing the "Trust Deficit": By fostering a collaborative environment, co-production directly addresses the "trust deficit". It allows for a two-way dialogue where the complexities, uncertainties, and assumptions inherent in ML downscaling are openly discussed and understood by both developers and users, leading to more realistic expectations and appropriate applications.
- Incorporating Local and Indigenous Knowledge: Participatory approaches can facilitate the integration of valuable local and indigenous knowledge systems with scientific data, leading to more holistic and effective adaptation strategies [172].
This deep engagement transforms the development of ML downscaling from a purely technical exercise into a collaborative endeavor aimed at producing societal value and supporting equitable climate resilience [165].
13. Conclusions
- Robustly Generalizable: Capable of extrapolating reliably to unseen climate model outputs, future climate scenarios, and diverse geographical regions. This necessitates rigorous validation frameworks that explicitly test for out-of-distribution performance and the development of models that learn more fundamental, transferable relationships.
- Physically Consistent: Adhering to known physical laws and principles. The integration of physical knowledge, through physics-informed neural networks (hard or soft constraints) or hybrid modeling approaches, is crucial for enhancing the scientific credibility and realism of downscaled projections.
- Interpretable and Explainable: Providing transparent insights into how predictions are made. Advancements in domain-specific XAI are needed to move beyond simple feature attribution to a deeper understanding of whether models are learning scientifically meaningful processes.
- Uncertainty-Aware: Providing comprehensive and reliable quantification of the various sources of uncertainty inherent in climate projections. This involves moving beyond deterministic predictions to probabilistic outputs that can effectively inform risk assessment and decision-making.
- Proficient with Extremes: Specifically designed and validated to capture the changing characteristics of high-impact extreme weather and climate events, which are often the most critical aspects for adaptation.
References
- Vandal, T.; Kodra, E.; Ganguly, A.R. Intercomparison of machine learning methods for statistical downscaling: The case of daily and extreme precipitation. Theoretical and Applied Climatology 2019, 137, 557–576. [Google Scholar] [CrossRef]
- Rampal, N.; Gibson, P.B.; Sood, A.; Stuart, S.; Fauchereau, N.C.; Brandolino, C.; Noll, B.; Meyers, T. High-resolution downscaling with interpretable deep learning: Rainfall extremes over New Zealand. Weather and Climate Extremes 2022, 38, 100525. [Google Scholar] [CrossRef]
- Rampal, N.; Hobeichi, S.; Gibson, P.B.; Baño-Medina, J.; Abramowitz, G.; Beucler, T.; González-Abad, J.; Chapman, W.; Harder, P.; Gutiérrez, J.M. Enhancing regional climate downscaling through advances in machine learning. Artificial Intelligence for the Earth Systems 2024, 3, 230066. [Google Scholar] [CrossRef]
- Maraun, D.; Wetterhall, F.; Ireson, A.M.; Chandler, R.E.; Kendon, E.J.; Widmann, M.; Brienen, S.; Rust, H.W.; Sauter, T.; Themeßl, M.; et al. Precipitation downscaling under climate change: Recent developments to bridge the gap between dynamical models and the end user. Reviews of geophysics 2010, 48. [Google Scholar] [CrossRef]
- Gutiérrez, J.M.; Maraun, D.; Widmann, M.; Huth, R.; Hertig, E.; Benestad, R.; Roessler, O.; Wibig, J.; Wilcke, R.; Kotlarski, S.; et al. An intercomparison of a large ensemble of statistical downscaling methods over Europe: Results from the VALUE perfect predictor cross-validation experiment. International journal of climatology 2019, 39, 3750–3785. [Google Scholar] [CrossRef]
- Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N.; Prabhat, F. Deep learning and process understanding for data-driven Earth system science. Nature 2019, 566, 195–204. [Google Scholar] [CrossRef] [PubMed]
- Wang, L.; Li, Q.; Lv, Q.; Peng, X.; You, W. TemDeep: a self-supervised framework for temporal downscaling of atmospheric fields at arbitrary time resolutions. Geoscientific Model Development 2025, 18, 2427–2442. [Google Scholar] [CrossRef]
- Quesada-Chacón, D.; Stöger, J.; Güntner, A.; Bernhofer, C. Repeatable high-resolution statistical downscaling through deep learning. Geoscientific Model Development 2022, 15, 7217–7244. [Google Scholar] [CrossRef]
- Lopez-Gomez, I.; Wan, Z.Y.; Zepeda-Núñez, L.; Schneider, T.; Anderson, J.; Sha, F. Dynamical-generative downscaling of climate model ensembles. Proceedings of the National Academy of Sciences 2025, 122, e2420288122. [Google Scholar] [CrossRef]
- Baño Medina, J.; Manzanas, R.; Cimadevilla, E.; Fernández, J.; González-Abad, J.; Cofiño, A.S.; Gutiérrez, J.M. Downscaling multi-model climate projection ensembles with deep learning (DeepESD): contribution to CORDEX EUR-44. Geoscientific Model Development 2022, 15, 6747–6758. [Google Scholar] [CrossRef]
- Baño-Medina, J.; Manzanas, R.; Gutiérrez, J.M. Configuration and intercomparison of deep learning neural models for statistical downscaling. Geoscientific Model Development 2019, 12, 4411–4426. [Google Scholar] [CrossRef]
- Vandal, T.; Kodra, E.; Gosh, S.; Ganguly, A.R. DeepSD: Generating High-Resolution Climate Change Projections Through Single Image Super-Resolution. arXiv preprint arXiv:1703.03126, arXiv:1703.03126 2017, [arXiv:cs.CV/1703.03126].
- Baño-Medina, J.; Manzanas, R.; Gutiérrez, J.M. Configuration and intercomparison of deep learning neural models for statistical downscaling. Geoscientific Model Development 2020, 13, 2109–2124. [Google Scholar] [CrossRef]
- Wang, F.; Lu, S.; Hu, P.; Li, J.; Zhao, L.; Lehrter, J.C. Deep Learning for Daily Precipitation and Temperature Downscaling. Water Resources Research 2021, 57, e2020WR028699. [Google Scholar] [CrossRef]
- Quesada-Chacón, D.; Barfus, K.; Bernhofer, C. Downscaling CORDEX through deep learning to daily 1 km multivariate ensemble in complex terrain. Earth’s Future 2023, 11, e2023EF003531. [Google Scholar] [CrossRef]
- Leinonen, J.; Nerini, D.; Berne, A. Stochastic Super-Resolution for Downscaling Time-Evolving Atmospheric Fields with a GAN. In Proceedings of the Proceedings of the ECML/PKDD Workshop on ClimAI (PMLR); 2020. [Google Scholar]
- Price, I.; Rasp, S. Increasing the Accuracy and Resolution of Precipitation Forecasts Using Deep Generative Models. In Proceedings of the Proceedings of AISTATS (PMLR); 2022. [Google Scholar]
- Tomasi, E.; Franch, G.; Cristoforetti, M. Can AI be enabled to perform dynamical downscaling? A latent diffusion model to mimic kilometer-scale COSMO5. 0_CLM9 simulations. Geoscientific Model Development 2025, 18, 2051–2078. [Google Scholar] [CrossRef]
- Srivastava, P.; El Helou, A.; Vilalta, R.; Li, H.W.; Kumar, V.; Mandt, S. Precipitation Downscaling with Spatiotemporal Video Diffusion. In Proceedings of the Advances in Neural Information Processing Systems 37 (NeurIPS 2024); 2024; pp. 19327–19340. [Google Scholar]
- Liu, Y.; Doss-Gollin, J.; Balakrishnan, G.; Veeraraghavan, A. Generative Precipitation Downscaling using Score-based Diffusion with Wasserstein Regularization. arXiv preprint arXiv:2410.00381, arXiv:2410.00381 2024.
- Curran, D.; Saleem, H.; Hobeichi, S.; Salim, F.D. Resolution-Agnostic Transformer-based Climate Downscaling 2024. arXiv preprint, arXiv:2411.14774.
- Pathak, J.; Subramanian, S.; Harrington, P.; Raja, S.; Chattopadhyay, A.; Mardani, M.; Kurth, T.; Hall, D.; Li, Z.; Azizzadenesheli, K.; et al. FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators. arXiv preprint arXiv:2202.11214, arXiv:2202.11214 2022, [arXiv:cs.LG/2202.11214].
- Schmude, J.; Roy, S.; Trojak, W.; Jakubik, J.; Civitarese, D.S.; Singh, S.; Kuehnert, J.; Ankur, K.; Gupta, A.; Phillips, C.E.; et al. Prithvi wxc: Foundation model for weather and climate. arXiv preprint arXiv:2409.13598, arXiv:2409.13598 2024.
- Kumar, R.; Sharma, T.; Vaghela, V.; Jha, S.K.; Agarwal, A. PrecipFormer: Efficient Transformer for Precipitation Downscaling. In Proceedings of the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), February 2025, pp. 489–497.
- Beucler, T.; Rasp, S.; Pritchard, M.; Gentine, P. Achieving conservation of energy in neural network emulators for climate modeling. arXiv preprint, arXiv:1906.06622 2019.
- Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics 2019, 378, 686–707. [Google Scholar] [CrossRef]
- Harder, P.; Jha, S.; Rolnick, D. Hard-Constrained Deep Learning for Climate Downscaling. Journal of Machine Learning Research 2022, 23, 1–38. [Google Scholar]
- Prasad, A.; Harder, P.; Yang, Q.; Sattegeri, P.; Szwarcman, D.; Watson, C.; Rolnick, D. Evaluating the transferability potential of deep learning models for climate downscaling. arXiv preprint arXiv:2407.12517, arXiv:2407.12517 2024.
- Legasa, M.N.; Manzanas, R.; Gutiérrez, J.M. Assessing Three Perfect Prognosis Methods for Statistical Downscaling of Climate Change Precipitation Scenarios. Geophysical Research Letters 2023, 50, e2022GL102267. [Google Scholar] [CrossRef]
- Rampal, N.; Gibson, P.B.; Sherwood, S.; Abramowitz, G. On the extrapolation of generative adversarial networks for downscaling precipitation extremes in warmer climates. Geophysical Research Letters 2024, 51, e2024GL112492. [Google Scholar] [CrossRef]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems 2020, 33, 6840–6851. [Google Scholar]
- Vandal, T.; Kodra, E.; Ganguly, A.R. Intercomparison of machine learning methods for statistical downscaling: the case of daily and extreme precipitation. Theoretical and Applied Climatology 2019, 137, 607–629. [Google Scholar] [CrossRef]
- Wood, A.W.; Leung, L.R.; Sridhar, V.; Lettenmaier, D.P. Hydrologic implications of dynamical and statistical approaches to downscaling climate model outputs. Climatic change 2004, 62, 189–216. [Google Scholar] [CrossRef]
- Pierce, D.W.; Cayan, D.R.; Thrasher, B.L. Statistical downscaling using localized constructed analogs (LOCA). Journal of hydrometeorology 2014, 15, 2558–2585. [Google Scholar] [CrossRef]
- Maurer, E.P.; Ficklin, D.L.; Wang, W. The impact of spatial scale in bias correction of climate model output for hydrologic impact studies. Hydrology and Earth System Sciences 2016, 20, 685–696. [Google Scholar] [CrossRef]
- Tripathi, S.; Srinivas, V.V.; Nanjundiah, R.S. Downscaling of precipitation for climate change scenarios: A support vector machine approach. Journal of Hydrology 2006, 330, 621–640. [Google Scholar] [CrossRef]
- He, X.; Chaney, N.W.; Schleiss, M.; Sheffield, J. Spatial downscaling of precipitation using adaptable random forests. Water resources research 2016, 52, 8217–8237. [Google Scholar] [CrossRef]
- Legasa, M.; Manzanas, R.; Calviño, A.; Gutiérrez, J.M. A posteriori random forests for stochastic downscaling of precipitation by predicting probability distributions. Water Resources Research 2022, 58, e2021WR030272. [Google Scholar] [CrossRef]
- Ghosh, S. SVM-PGSL coupled approach for statistical downscaling to predict rainfall from GCM output. Journal of Geophysical Research: Atmospheres 2010, 115. [Google Scholar] [CrossRef]
- Cannon, A.J. Quantile regression neural networks: Implementation in R and application to precipitation downscaling. Computers & Geosciences 2011, 37, 1277–1284. [Google Scholar] [CrossRef]
- Misra, S.; Sarkar, S.; Mitra, P. Statistical downscaling of precipitation using long short-term memory recurrent neural networks. Theoretical and Applied Climatology 2018, 134, 1179–1196. [Google Scholar] [CrossRef]
- Miao, Q.; Pan, B.; Wang, H.; Hsu, K.; Sorooshian, S. Improving monsoon precipitation prediction using combined CNN-LSTM. Water 2019, 11, 2519. [Google Scholar] [CrossRef]
- Anh, D.T.; Bae, D.J.; Jung, K. Downscaling rainfall using deep learning LSTM and feedforward neural networks. International Journal of Climatology 2019, 39, 2502–2518. [Google Scholar] [CrossRef]
- Vaughan, A.; Adamson, H.; Tak-Chu, L.; Turner, R.E. Convolutional conditional neural processes for local climate downscaling. arXiv preprint arXiv:2101.07857, arXiv:2101.07857 2021, [arXiv:stat.ML/2101.07857].
- Vandal, T.; Kodra, E.; Ganguly, S.; Michaelis, A.; Nemani, R.R.; Ganguly, A.R. DeepSD: Generating High Resolution Climate Change Projections through Single Image Super-Resolution. In Proceedings of the Proceedings of IJCAI, 2018. [CrossRef]
- Baño-Medina, J.; Manzanas, R.; Gutiérrez, J.M. On the suitability of deep convolutional neural networks for continental-wide downscaling of climate change projections. Climate Dynamics 2021, 57, 2941–2951. [Google Scholar] [CrossRef]
- Wang, F.; Tian, D.; Lowe, L.J.; Kalin, L.; Lehrter, J. Deep Learning for Daily Precipitation and Temperature Downscaling. Water Resources Research 2021, 57, e2020WR029308. [Google Scholar] [CrossRef]
- Soares, P.M.M.; Johannsen, F.; Lima, D.C.A.; Lemos, G.; Bento, V.A.; Bushenkova, A. High-resolution downscaling of CMIP6 Earth system and global climate models using deep learning for Iberia. Geoscientific Model Development 2024, 17, 229–257. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Springer International Publishing; 2015; pp. 234–241. [Google Scholar] [CrossRef]
- Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation 2018. 1 2018, 3–11. [Google Scholar] [CrossRef]
- Liu, J.; Shi, C.; Ge, L.; Tie, R.; Chen, X.; Zhou, T.; Gu, X.; Shen, Z. Enhanced Wind Field Spatial Downscaling Method Using UNET Architecture and Dual Cross-Attention Mechanism. Remote Sensing 2024, 16, 1867. [Google Scholar] [CrossRef]
- Pasula, A.; Subramani, D.N. Global Climate Model Bias Correction Using Deep Learning. arXiv preprint arXiv:2504.19145, arXiv:2504.19145 2025.
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. [CrossRef]
- Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks 2016. pp. 2016, 1646–1654. [Google Scholar] [CrossRef]
- Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 1132–1140. [CrossRef]
- Kang, M.; Shin, J.; Park, J. StudioGAN: a taxonomy and benchmark of GANs for image synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence 2023, 45, 15725–15742. [Google Scholar] [CrossRef] [PubMed]
- Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8110–8119.
- Accarino, G.; De Rubeis, T.D.; Falcucci, G.; Ubaldi, E.; Aloisio, G. MSG-GAN-SD: A Multi-Scale Gradients GAN for Statistical Downscaling of 2-Meter Temperature over the EURO-CORDEX Domain. AI 2021, 2, 603–619. [Google Scholar] [CrossRef]
- Iotti, M.; Davini, P.; von Hardenberg, J.; Zappa, G. RainScaleGAN: a Conditional Generative Adversarial Network for Rainfall Downscaling. Artificial Intelligence for the Earth Systems 2025. [Google Scholar] [CrossRef]
- National Renewable Energy Laboratory. Sup3rCC: Super-Resolution for Renewable Energy Resource Data With Climate Change Impacts. https://www.nrel.gov/analysis/sup3rcc, n.d. Accessed: May 27, 2025.
- Stengel, K.A.; Glaws, A.; Hettinger, D.; King, R.N. Adversarial super-resolution of climatological wind and solar data. Proceedings of the National Academy of Sciences 2020, 117, 16805–16815. [Google Scholar] [CrossRef]
- Glawion, L.; Polz, J.; Kunstmann, H.; Fersch, B.; Chwala, C. Global spatio-temporal ERA5 precipitation downscaling to km and sub-hourly scale using generative AI. npj Climate and Atmospheric Science 2025, 8, 219. [Google Scholar] [CrossRef]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10684–10695. [CrossRef]
- Berry, L.; Brando, A.; Meger, D. Shedding light on large generative networks: Estimating epistemic uncertainty in diffusion models. In Proceedings of the The 40th Conference on Uncertainty in Artificial Intelligence; 2024. [Google Scholar]
- Pérez, A.; Santa Cruz, M.; San Martín, D.; Gutiérrez, J.M. Transformer-based super-resolution downscaling for regional reanalysis: Full domain vs tiling approaches. arXiv preprint arXiv:2410.12728, arXiv:2410.12728 2024.
- Yang, F.; Ye, Q.; Wang, K.; Sun, L. Successful Precipitation Downscaling Through an Innovative Transformer-Based Model. Remote Sensing 2024, 16, 4292. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Computation 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.k.; Woo, W.c. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 2015. pp. 802–810.
- Miao, Q.; Liu, Y.; Liu, T.; Sorooshian, S. Improving Monsoon Precipitation Prediction Using Combined Convolutional and Long Short Term Memory Neural Network. Water 2019, 11, 717. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017); Guyon, I.; Luxburg, U.V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; Garnett, R., Eds. Curran Associates, Inc., 2017, pp. 5998–6008.
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929, arXiv:2010.11929 2020, [arXiv:cs.CV/2010.11929].
- Zhong, X.; Du, F.; Chen, L.; Wang, Z.; Li, H. Investigating transformer-based models for spatial downscaling and correcting biases of near-surface temperature and wind-speed forecasts. Quarterly Journal of the Royal Meteorological Society 2024, 150, 275–289. [Google Scholar] [CrossRef]
- Sinha, S.; Benton, B.; Emami, P. On the effectiveness of neural operators at zero-shot weather downscaling. Environmental Data Science 2025, 4, e21. [Google Scholar] [CrossRef]
- Wang, X.; Choi, J.Y.; Kurihaya, T.; Lyngaas, I.; Yoon, H.J.; Fan, M.; Nafi, N.M.; Tsaris, A.; Aji, A.M.; Hossain, M.; et al. ORBIT-2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling. arXiv preprint arXiv:2505.04802, arXiv:2505.04802 2025.
- Shi, J.; Shirali, A.; Jin, B.; Zhou, S.; Hu, W.; Rangaraj, R.; Wang, S.; Han, J.; Wang, Z.; Lall, U.; et al. Deep Learning and Foundation Models for Weather Prediction: A Survey. arXiv preprint arXiv:2501.06907, arXiv:2501.06907 2025.
- Coordinated Regional Climate Downscaling Experiment (CORDEX). Task Force on Machine Learning. https://cordex.org/strategic-activities/taskforces/task-force-on-machine-learning/, 2024. Page accessed on , 2025. Describes ongoing task force activities. Last website update noted as 2025. 26 May.
- González-Abad, J.; Baño-Medina, J.; Gutiérrez, J.M. Using Explainability to Inform Statistical Downscaling Based on Deep Learning Beyond Standard Validation Approaches. Journal of Advances in Modeling Earth Systems 2023, 15, e2022MS003170. [Google Scholar] [CrossRef]
- Daw, A.; Karpatne, A.; Watkins, W.D.; Read, J.S.; Kumar, V. Physics-guided neural networks (pgnn): An application in lake temperature modeling. In Knowledge guided machine learning; Chapman and Hall/CRC, 2022; pp. 353–372.
- González-Abad, J.; Baño-Medina, J. Deep Ensembles to Improve Uncertainty Quantification of Statistical Downscaling Models under Climate Change Conditions. arXiv preprint arXiv:2305.00975, arXiv:2305.00975 2023, [arXiv:cs.LG/2305.00975]. Accepted at ICLR 2023 Tackling Climate Change with Machine Learning Workshop.
- Lanzante, J.R.; Dixon, K.W.; Nath, M.J.; Whitlock, C.E.; Adams-Smith, D. Some Pitfalls in Statistical Downscaling of Future Climate. Bulletin of the American Meteorological Society 2018, 100, 2235–2250. [Google Scholar] [CrossRef]
- Xiang, L.; Hu, P.; Wang, F.; Yu, J.; Zhang, L. A Novel Reference-Based and Gradient-Guided Deep Learning Model for Daily Precipitation Downscaling. Atmosphere 2022, 13, 517. [Google Scholar] [CrossRef]
- Schuster, G.T.; Chen, Y.; Feng, S. Review of physics-informed machine-learning inversion of geophysical data. Geophysics 2024, 89, T337–T356. [Google Scholar] [CrossRef]
- Boateng, D.; Mutz, S.G. pyESDv1. 0.1: an open-source Python framework for empirical-statistical downscaling of climate information. Geoscientific Model Development Discussions 2023, 2023, 1–58. [Google Scholar]
- Wang, Z.; Bugliaro, L.; Gierens, K.; Hegglin, M.I.; Rohs, S.; Petzold, A.; Kaufmann, S.; Voigt, C. Machine learning for improvement of upper tropospheric relative humidity in ERA5 weather model data. EGUsphere 2024, 2024, 1–28. [Google Scholar] [CrossRef]
- Daly, C.; Halbleib, M.; Smith, J.I.; Gibson, W.P.; Doggett, M.K.; Taylor, G.H.; Curtis, J.; Pasteris, P.P. Physiographically sensitive mapping of climatological temperature and precipitation across the conterminous United States. International Journal of Climatology 2008, 28, 2031–2064. [Google Scholar] [CrossRef]
- Herrera, S.; Cardoso, R.M.; Soares, P.M.; Espírito-Santo, F.; Viterbo, P.; Gutiérrez, J.M. Iberia01: a new gridded dataset of daily precipitation and temperatures over Iberia. Earth System Science Data 2019, 11, 1947–1971. [Google Scholar] [CrossRef]
- Cornes, R.C.; van der Schrier, G.; van den Besselaar, E.J.M.; Jones, P.D. An Ensemble Version of the E-OBS Temperature and Precipitation Data Sets. Journal of Geophysical Research: Atmospheres 2018, 123, 9391–9409. [Google Scholar] [CrossRef]
- Technische Universität Dresden. Regionales Klimainformationssystem Sachsen (ReKIS). https://rekis.hydro.tu-dresden.de/, 2023. Accessed on , 2025. General project portal. A summary document "Climate_datasets_Zusammenfassung.pdf" is available from the portal. 26 May.
- Huffman, G.J.; Bolvin, D.T.; Braithwaite, D.; Hsu, K.; Joyce, R.J.; Kidd, C.; Nelkin, E.J.; Sorooshian, S.; Tan, J.; Xie, P. Integrated Multi-satellitE Retrievals for GPM (IMERG) Algorithm Theoretical Basis Document (ATBD) Version 06.3. Technical report, NASA Goddard Space Flight Center, 2020. Version 06.3. Available at https://gpm.nasa.gov/resources/documents/algorithm-information/IMERG-V06-ATBD (Accessed on , 2025). 26 May.
- Entekhabi, D.; Njoku, E.G.; O’Neill, P.E.; Kellogg, K.H.; Crow, W.T.; Edelstein, W.N.; Entin, J.K.; Goodman, S.D.; Jackson, T.J.; Johnson, J.T.; et al. The Soil Moisture Active Passive (SMAP) Mission. Proceedings of the IEEE 2010, 98, 704–716. [Google Scholar] [CrossRef]
- Pastorello, G.; Trotta, C.; Canfora, E.; Chu, H.; Christianson, D.; Cheah, Y.W.; Poindexter, C.; Chen, J.; Elbashandy, A.; Humphrey, M.; et al. The FLUXNET2015 dataset and the ONEFlux processing pipeline for eddy covariance data. Scientific Data 2020, 7, 225. [Google Scholar] [CrossRef]
- Sishah, S.; Abrahem, T.; Azene, G.; Dessalew, A.; Hundera, H. Downscaling and validating SMAP soil moisture using a machine learning algorithm over the Awash River basin, Ethiopia. PLoS ONE 2023, 18, e0279895. [Google Scholar] [CrossRef]
- Quesada-Chacón, D.; Stöger, J.; Güntner, A.; Bernhofer, C. Downscaling CORDEX Through Deep Learning to Daily 1 km Multivariate Ensemble in Complex Terrain. Earth’s Future 2023, 11, e2023EF003554. [Google Scholar] [CrossRef]
- Sha, Y.; Stull, R.; Ghafarian, P.; Ou, T.; Gultepe, I. Deep-Learning-Based Gridded Downscaling of Surface Meteorological Variables in Complex Terrain. Part I: Daily Maximum and Minimum 2-m Temperature. Journal of Applied Meteorology and Climatology 2020, 59, 2057–2075. [Google Scholar] [CrossRef]
- Sarafanov, M.; Kazakov, E.; Nikitin, N.O.; Kalyuzhnaya, A.V. A Machine Learning Approach for Remote Sensing Data Gap-Filling with Open-Source Implementation: An Example Regarding Land Surface Temperature, Surface Albedo and NDVI. Remote Sensing 2020, 12, 3865. [Google Scholar] [CrossRef]
- Roberts, N.M.; Lean, H.W. Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Monthly Weather Review 2008, 136, 78–97. [Google Scholar] [CrossRef]
- Gneiting, T.; Raftery, A.E. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association 2007, 102, 359–378. [Google Scholar] [CrossRef]
- Annau, N.J.; Cannon, A.J.; Monahan, A.H. Algorithmic Hallucinations of Near-Surface Winds: Statistical Downscaling with GANs to Convection-Permitting Scales. AI for the Earth System 2023, 2. [Google Scholar] [CrossRef]
- Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems 2017, 30. [Google Scholar]
- Harris, L.; McRae, A.T.T.; Chantry, M.; Dueben, P.D.; Palmer, T.N. A Generative Deep Learning Approach to Stochastic Downscaling of Precipitation Forecasts. Journal of Advances in Modeling Earth Systems 2022, 14, e2022MS003120. [Google Scholar] [CrossRef] [PubMed]
- Marzban, C.; Sandgathe, S. Verification with variograms. Weather and forecasting 2009, 24, 1102–1120. [Google Scholar] [CrossRef]
- Davis, C.; Brown, B.; Bullock, R. Object-based verification of precipitation forecasts. Part I: Methodology and application to mesoscale rain areas. Monthly Weather Review 2006, 134, 1772–1784. [Google Scholar] [CrossRef]
- Davis, C.; Brown, B.; Bullock, R. Object-based verification of precipitation forecasts. Part II: Application to convective rain systems. Monthly Weather Review 2006, 134, 1785–1795. [Google Scholar] [CrossRef]
- Huth, R.; Kyselỳ, J.; Pokorná, L. A GCM simulation of heat waves, dry spells, and their relationships to circulation. Climatic change 2000, 46, 29–60. [Google Scholar] [CrossRef]
- Mendes, D.; Marengo, J.A. Temporal downscaling: a comparison between artificial neural network and autocorrelation techniques over the Amazon Basin in present and future climate change scenarios. Theoretical and Applied Climatology 2010, 100, 413–421. [Google Scholar] [CrossRef]
- Zolina, O.; Simmer, C.; Belyaev, K.; Gulev, S.K.; Koltermann, P. Changes in the duration of European wet and dry spells during the last 60 years. Journal of Climate 2013, 26, 2022–2047. [Google Scholar] [CrossRef]
- Fall, C.M.N.; Lavaysse, C.; Drame, M.S.; Panthou, G.; Gaye, A.T. Wet and dry spells in Senegal: comparison of detection based on satellite products, reanalysis, and in situ estimates. Natural Hazards and Earth System Sciences 2021, 21, 1051–1069. [Google Scholar] [CrossRef]
- Coles, S.G. An Introduction to Statistical Modeling of Extreme Values; Springer Series in Statistics, Springer-Verlag London, 2001. [CrossRef]
- Vissio, G.; Lembo, V.; Lucarini, V.; Ghil, M. Evaluating the performance of climate models based on Wasserstein distance. Geophysical Research Letters 2020, 47, e2020GL089385. [Google Scholar] [CrossRef]
- Perkins, S.; Pitman, A.; Holbrook, N.J.; Mcaneney, J. Evaluation of the AR4 climate models’ simulated daily maximum temperature, minimum temperature, and precipitation over Australia using probability density functions. Journal of climate 2007, 20, 4356–4376. [Google Scholar] [CrossRef]
- Pall, P.; Allen, M.; Stone, D.A. Testing the Clausius–Clapeyron constraint on changes in extreme precipitation under CO2 warming. Climate Dynamics 2007, 28, 351–363. [Google Scholar] [CrossRef]
- Hobeichi, S.; Nishant, N.; Shao, Y.; Abramowitz, G.; Pitman, A.; Sherwood, S.; Bishop, C.; Green, S. Using machine learning to cut the cost of dynamical downscaling. Earth’s Future 2023, 11, e2022EF003291. [Google Scholar] [CrossRef]
- Doblas-Reyes, F.J.; Sörensson, A.A.; Almazroui, M.; Dosio, A.; Gutowski, W.J.; Haarsma, R.; Hamdi, R.; Hewitson, B.; et al. Linking Global to Regional Climate Change. In Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the IPCC; Cambridge University Press, 2021; pp. 1363–1512. [CrossRef]
- Basile, S.; Crimmins, A.R.; Avery, C.W.; Hamlington, B.D.; Kunkel, K.E. Appendix 3. Scenarios and Datasets. Fifth National Climate Assessment (USGCRP), 2023. [CrossRef]
- Harilal, N.; Bhatia, U.; Kumar, D.N. Augmented Convolutional LSTMs for Generation of High-Resolution Climate Change Projections. IEEE Access 2020, 8, 173918–173943. [Google Scholar] [CrossRef]
- Huang, X. Evaluating Loss Functions and Learning Data Pre-Processing for Climate Downscaling Deep Learning Models. arXiv preprint arXiv:2306.11144, arXiv:2306.11144 2023, [arXiv:cs.LG/2306.11144].
- Maraun, D.; Widmann, M.; Gutierrez, J.M.; Kotlarski, S.; Chandler, R.E.; Hertig, E.; Huth, R.; Wibig, J.; Wilcke, R.A.I.; Themeßl, M.J.; et al. VALUE: A framework to validate downscaling approaches for climate change studies. Earth’s Future 2015, 3, 1–14. [Google Scholar] [CrossRef]
- Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A survey on concept drift adaptation. ACM computing surveys (CSUR) 2014, 46, 1–37. [Google Scholar] [CrossRef]
- Geirhos, R.; Jacobsen, J.H.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F.A. Shortcut learning in deep neural networks. Nature Machine Intelligence 2020, 2, 665–673. [Google Scholar] [CrossRef]
- Cavaiola, M.; Tuju, P.E.; Mazzino, A. Accurate and efficient AI-assisted paradigm for adding granularity to ERA5 precipitation reanalysis. Scientific Reports 2024. [Google Scholar] [CrossRef]
- Hernanz, A.; Rodriguez-Camino, E.; Navascués, B.; Gutiérrez, J.M. On the limitations of deep learning for statistical downscaling of climate change projections: The transferability and the extrapolation issues. Atmospheric Science Letters 2024, 25, e1195. [Google Scholar] [CrossRef]
- Baño-Medina, J. Understanding deep learning decisions in statistical downscaling models. In Proceedings of the Proceedings of the 10th international conference on climate informatics, 2020, pp. 79–85.
- Boulaguiem, Y.; Zscheischler, J.; Vignotto, E.; van der Wiel, K.; Engelke, S. Modeling and simulating spatial extremes by combining extreme value theory with generative adversarial networks. Environmental Data Science 2022, 1, e5. [Google Scholar] [CrossRef]
- Lee, J.; Park, S.Y. WGAN-GP-Based Conditional GAN with Extreme Critic for Precipitation Downscaling in a Key Agricultural Region of the Northeastern U.S. IEEE Access 2025. [Google Scholar] [CrossRef]
- Iotti, M.; Davini, P.; von Hardenberg, J.; Zappa, G. RainScaleGAN: a Conditional Generative Adversarial Network for Rainfall Downscaling. AI for the Earth System 2025. in press.
- Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. The journal of machine learning research 2012, 13, 723–773. [Google Scholar]
- Székely, G.J.; Rizzo, M.L. Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference 2013, 143, 1249–1272. [Google Scholar] [CrossRef]
- Dutta, S.; Innan, N.; Yahia, S.B.; Shafique, M. AQ-PINNs: Attention-Enhanced Quantum Physics-Informed Neural Networks for Carbon-Efficient Climate Modeling. In Proceedings of the Tackling Climate Change with Machine Learning Workshop, NeurIPS 2024, 2024. Accessed via ResearchGate and TheMoonlight.io review. Original publication at NeurIPS 2024 Workshop.
- Radke, T.; Fuchs, S.; Wilms, C.; Polkova, I.; Rautenhaus, M. Explaining neural networks for detection of tropical cyclones and atmospheric rivers in gridded atmospheric simulation data. Geoscientific Model Development 2025, 18, 1017–1039. [Google Scholar] [CrossRef]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626. [CrossRef]
- Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. 2017, 4765–4774. [Google Scholar]
- van Zyl, C.; Ye, X.; Naidoo, R. Harnessing eXplainable artificial intelligence for feature selection in time series energy forecasting: A comparative analysis of Grad-CAM and SHAP. Applied Energy 2024, 353, 122079. [Google Scholar] [CrossRef]
- O’Loughlin, R.J.; Li, D.; Neale, R.; O’Brien, T.A. Moving beyond post hoc explainable artificial intelligence: a perspective paper on lessons learned from dynamical climate modeling. Geoscientific Model Development 2025, 18, 787–802. [Google Scholar] [CrossRef]
- Mamalakis, A.; Barnes, E.A.; Ebert-Uphoff, I. Investigating the fidelity of explainable artificial intelligence methods for applications of convolutional neural networks in geoscience. Artificial Intelligence for the Earth Systems 2022, 1, e220012. [Google Scholar] [CrossRef]
- Cannon, A.J. Quantile regression neural networks: Implementation in R and application to precipitation downscaling. Computers & Geosciences 2011, 37, 1277–1284. [Google Scholar] [CrossRef]
- Gupta, H.V.; Kling, H.; Yilmaz, K.K.; Martinez, G.F. Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling. Journal of hydrology 2009, 377, 80–91. [Google Scholar] [CrossRef]
- Zscheischler, J.; Westra, S.; Van Den Hurk, B.J.; Seneviratne, S.I.; Ward, P.J.; Pitman, A.; AghaKouchak, A.; Bresch, D.N.; Leonard, M.; Wahl, T.; et al. Future climate risk from compound events. Nature climate change 2018, 8, 469–477. [Google Scholar] [CrossRef]
- Zscheischler, J.; Martius, O.; Westra, S.; Bevacqua, E.; Raymond, C.; Horton, R.M.; van den Hurk, B.; AghaKouchak, A.; Jézéquel, A.; Mahecha, M.D.; et al. A typology of compound weather and climate events. Nature reviews earth & environment 2020, 1, 333–347. [Google Scholar]
- Mazdiyasni, O.; AghaKouchak, A. Substantial increase in concurrent droughts and heatwaves in the United States. Proceedings of the National Academy of Sciences 2015, 112, 11484–11489. [Google Scholar] [CrossRef]
- Choi, H.; Kim, Y.; Kim, D. Enhancing Extreme Rainfall Nowcasting with Weighted Loss Functions in Deep Learning Models. EGU General Assembly 2025, EGU25-19416. Abstract available at https://meetingorganizer.copernicus.org/EGU25/EGU25-19416.html (Accessed on , 2025). 26 May.
- Vandal, T.; Kodra, E.; Gosh, S.; Gunter, L.; Gonzalez, J.; Ganguly, A.R. Statistical downscaling of global climate models with image super-resolution and uncertainty quantification. arXiv preprint arXiv:1811.03605, arXiv:1811.03605 2018, [arXiv:stat.AP/1811.03605]. Undermind ref [16].
- Addison, H.; Kendon, E.; Ravuri, S.; Aitchison, L.; Watson, P.A. Machine learning emulation of a local-scale UK climate model. arXiv preprint arXiv:2211.16116, arXiv:2211.16116 2022.
- Gerges, F.; Boufadel, M.C.; Bou-Zeid, E.; Nassif, H.; Wang, J.T.L. A Novel Bayesian Deep Learning Approach to the Downscaling of Wind Speed with Uncertainty Quantification. In Proceedings of the Advances in Knowledge Discovery and Data Mining. PAKDD 2022. Lecture Notes in Computer Science. Springer, Cham, Vol. 13281; 2022; pp. 55–66. [Google Scholar] [CrossRef]
- Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the Proceedings of the 33rd International Conference on Machine Learning (ICML); Balcan, M.F.;Weinberger, K.Q., Eds., 2016, Vol. 48, PMLR, pp. 1050–1059.
- Gerges, F.; Boufadel, M.C.; Bou-Zeid, E.; Nassif, H.; Wang, J.T.L. Bayesian Multi-Head Convolutional Neural Networks with Bahdanau Attention for Forecasting Daily Precipitation in Climate Change Monitoring. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part V; Cerquitelli, T.; Monreale, A.; Mikut, R.; Moccia, S.; Raedt, L.D., Eds. Springer, 2022, Vol. 13717, Lecture Notes in Computer Science, pp. 416–431. [CrossRef]
- Merkel, D. Docker: Lightweight Linux Containers for Consistent Development and Deployment. Linux Journal 2014, 2014, 2. [Google Scholar]
- Cohen, J.; Cohen, P.; West, S.G.; Aiken, L.S. Applied multiple regression/correlation analysis for the behavioral sciences; Routledge, 2013.
- Düsterhus, A.; Hense, A. Advanced information criterion for environmental data quality assurance. Advances in Science and Research 2012, 8, 99–104. [Google Scholar] [CrossRef]
- Kling, H.; Fuchs, M.; Paulin, M. Runoff conditions in the upper Danube basin under an ensemble of climate change scenarios. Journal of Hydrology 2012, 424-425, 264–277. [Google Scholar] [CrossRef]
- Mahoney, M.J.; Johnson, L.K.; Silge, J.; Frick, H.; Kuhn, M.; Beier, C.M. Assessing the performance of spatial cross-validation approaches for models of spatially structured data. arXiv preprint arXiv:2303.07334, arXiv:2303.07334 2023.
- Brogli, R.; Heim, C.; Mensch, J.; Sørland, S.L.; Schär, C. The pseudo-global-warming (PGW) approach: methodology, software package PGW4ERA5 v1. 1, validation, and sensitivity analyses. Geoscientific Model Development 2023, 16, 907–926. [Google Scholar] [CrossRef]
- Climate Change, AI. Data Gaps (Beta). https://www.climatechange.ai/dev/datagaps, n.d. Accessed: , 2025. 27 May.
- World Climate Research Programme. WCRP Grand Challenges (ended in 2022). https://www.wcrp-climate.org/component/content/category/26-grand-challenges, 2022. Accessed: , 2025; Official community theme summary page. 13 August.
- Beucler, T.; Pritchard, M.; Rasp, S.; Ott, J.; Baldi, P.; Gentine, P. Enforcing analytic constraints in neural networks emulating physical systems. Physical review letters 2021, 126, 098302. [Google Scholar] [CrossRef] [PubMed]
- American Bar Association. Climate Change and Responsible AI Affect Cybersecurity and Digital Privacy Conflicts. SciTech Lawyer 2025, Spring. [Google Scholar]
- Savannah Software Solutions. The Role of AI in Climate Modeling: Exploring How Artificial Intelligence is Improving Predictions and Responses to Climate Change. https://savannahsoftwaresolutions.co.ke/the-role-of-ai-in-climate-modeling-exploring-how-artificial-intelligence-is-improving-predictions-and-responses-to-climate-change/, n.d. Accessed: May 27, 2025. 27 May.
- Sustainability-Directory.com. AI Bias in Equitable Climate Solutions. https://sustainability-directory.com/question/ai-bias-equitable-climate-solutions/, n.d. Accessed: , 2025. 27 May.
- Amnuaylojaroen, T. Advancements and challenges of artificial intelligence in climate modeling for sustainable urban planning. Frontiers in Artificial Intelligence 2025, 8, 1517986. [Google Scholar] [CrossRef]
- America, E.; North Africa, A. CORDEX experiment design for dynamical downscaling of CMIP6 2021.
- Sørland, S.L.; Schär, C.; Lüthi, D.; Kjellström, E. Bias patterns and climate change signals in GCM-RCM model chains. Environmental Research Letters 2018, 13, 074017. [Google Scholar] [CrossRef]
- Diez-Sierra, J.; Iturbide, M.; Gutiérrez, J.M.; Fernández, J.; Milovac, J.; Cofiño, A.S.; Cimadevilla, E.; Nikulin, G.; Levavasseur, G.; Kjellström, E.; et al. The worldwide C3S CORDEX grand ensemble: A major contribution to assess regional climate change in the IPCC AR6 Atlas. Bulletin of the American Meteorological Society 2022, 103, E2804–E2826. [Google Scholar] [CrossRef]
- Hawkins, E.; Sutton, R. The potential to narrow uncertainty in regional climate predictions. Bulletin of the American Meteorological Society 2009, 90, 1095–1108. [Google Scholar] [CrossRef]
- Hawkins, E.; Sutton, R. The potential to narrow uncertainty in projections of regional precipitation change. Climate Dynamics 2011, 37, 407–418. [Google Scholar] [CrossRef]
- Bhardwaj, T. Climate justice hangs in the balance will AI divide or unite the planet. Down To Earth 2025. [Google Scholar]
- Jacob, D.; et al. Co-production of climate services: Challenges and enablers. Frontiers in Climate 2025, 7, 1507759. [Google Scholar] [CrossRef]
- World Meteorological Organization. State of Climate Services 2024. https://wmo.int/publication-series/2024-state-of-climate-services, 2024. Assesses global climate services capacity and gaps.
- González-Abad, J.; Baño-Medina, J. Deep Ensembles to Improve Uncertainty Quantification of Statistical Downscaling Models under Climate Change Conditions. arXiv preprint arXiv:2305.00975, arXiv:2305.00975 2023.
- UNDP Climate Promise. What are climate misinformation and disinformation and how can we tackle them? https://climatepromise.undp.org/news-and-stories/what-are-climate-misinformation-and-disinformation-and-how-can-we-tackle-them, n.d. Accessed: May 27, 2025. 27 May.
- EY. AI and sustainability: Opportunities, challenges and impact. https://www.ey.com/en_nl/insights/climate-change-sustainability-services/ai-and-sustainability-opportunities-challenges-and-impact, n.d. Accessed: , 2025. 27 May.
- Giorgi, F.; Jones, C.; Asrar, G.R.; et al. Addressing climate information needs at the regional level: the CORDEX framework. World Meteorological Organization (WMO) Bulletin 2009, 58, 175. [Google Scholar]
- Eyring, V.; Bony, S.; Meehl, G.A.; Senior, C.A.; Stevens, B.; Stouffer, R.J.; Taylor, K.E. Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization. Geoscientific Model Development 2016, 9, 1937–1958. [Google Scholar] [CrossRef]
- WeAdapt. Justice and equity in climate change adaptation: overview of an emerging agenda. https://weadapt.org/knowledge-base/gender-and-social-equality/justice-and-equity-in-climate-change-adaptation-overview-of-an-emerging-agenda/, n.d. Accessed: May 27, 2025. 27 May.




| Category | Metric | Description & Use Case | When to Use | Key Refs |
|---|---|---|---|---|
| Pixel-wise Accuracy | RMSE / MAE | Root Mean Squared Error / Mean Absolute Error. Standard metrics for average error, but can be misleading for skewed distributions (e.g., precipitation) and penalize realistic high-frequency variations. | Standard baseline, but use with caution; supplement with other metrics. | [11] |
| Spatial Structure | Structural Similarity (SSIM) | Measures perceptual similarity between images based on luminance, contrast, and structure. Better than RMSE for assessing preservation of spatial patterns. | To evaluate preservation of spatial patterns and textures. | [55] |
| Power Spectral Density (PSD) | Compares the variance at different spatial frequencies. Crucial for diagnosing overly smooth outputs (loss of high-frequency power) or GAN-induced artifacts (spurious power). | To diagnose smoothing or unrealistic high-frequency noise. | [98,100] | |
| Variogram Analysis | Geostatistical tool that quantifies spatial correlation as a function of distance. Comparing nugget, sill, and range diagnoses noise, variance suppression, and incorrect spatial correlation length. | To quantitatively assess spatial dependency structure and diagnose over-smoothing. | [101] | |
| Method for Object-based Diagnostic Evaluation (MODE) | Identifies and compares attributes (e.g., area, location, orientation, intensity) of distinct objects (e.g., storms). Provides diagnostic information on specific spatial biases beyond grid-point errors. | For detailed diagnostic evaluation of precipitation fields, avoiding the "double penalty" issue. | [102,103] | |
| Temporal Coherence | Temporal Autocorrelation | Measures the correlation of a time series with itself at a given lag (e.g., lag-1 for daily data). Assesses the model’s ability to reproduce temporal persistence or "memory". | To diagnose unrealistic temporal "flickering" or lack of persistence in time series. | [104,105] |
| Wet/Dry Spell Characteristics | Quantifies the statistics of consecutive days above/below a threshold (e.g., 1 mm/day for precipitation). Key metrics include mean/max spell duration, frequency, and cumulative intensity. | Essential for impact studies related to droughts and floods; evaluates temporal clustering of events. | [106,107] | |
| Extreme Events | Fraction Skill Score (FSS) | A neighborhood-based verification metric that assesses the skill of forecasting events exceeding a certain threshold across different spatial scales. Mitigates the "double penalty" issue. | Essential for verifying precipitation fields at specific thresholds. | [96,100] |
| Quantile-based scores (e.g., 99th percentile error) | Directly evaluates the accuracy of specific quantiles (e.g., p95, p99), focusing on performance in the tails of the distribution. | To specifically quantify performance on rare, high-impact events. | [40] | |
| Return Level / Period Consistency | Compares the magnitude of extreme events for given return periods (e.g., the 1-in-100-year event) between the downscaled output and observations, often using Extreme Value Theory. | For climate impact studies where long-term risk from extremes is key. | [108] | |
| Distributional Similarity | Wasserstein Distance (Earth Mover’s Distance) | Measures the "work" required to transform one probability distribution into another. A robust measure of similarity between the full distributions of the downscaled and reference data. | For a rigorous comparison of the entire statistical distribution. | [20,109] |
| CRPS (Continuous Ranked Probability Score) | For probabilistic forecasts, measures the integrated squared difference between the predicted cumulative distribution function (CDF) and the observed value. A proper scoring rule that generalizes MAE. | Gold standard for evaluating probabilistic/ensemble forecast skill. | [97,100] | |
| Perkins Skill Score (PSS) | Measures the common overlapping area between two probability density functions (PDFs). An intuitive, distribution-agnostic metric of overall distributional similarity. | To provide a robust, integrated score of distributional overlap, common in climate model evaluation. | [110] | |
| Uncertainty Quantification (UQ) | Reliability Diagram | Plots observed frequencies against forecast probabilities for binned events to assess calibration. A perfectly calibrated model lies on the diagonal. | To assess if forecast probabilities are statistically reliable. | [100] |
| PIT Histogram | Probability Integral Transform. For a calibrated ensemble, the PIT values of the observations should be uniformly distributed. Deviations indicate biases or incorrect spread. | To diagnose issues with ensemble spread and bias. | [100] | |
| Physical Consistency | Conservation Error | Directly measures the violation of a conservation law (e.g., mass, energy) by comparing the aggregated high-resolution output to the coarse-resolution input value. | When conservation of a quantity is a critical physical constraint. | [25] |
| Multivariate Correlations | Assesses whether the physical relationships and correlations between different downscaled variables (e.g., temperature and humidity) are preserved realistically. | Essential for multi-variable downscaling to ensure physical coherence. | [9] | |
| Clausius-Clapeyron Scaling | Verifies if the intensity of extreme precipitation scales with temperature at the physically expected rate ( 7%/°C). Tests if the model has learned a fundamental thermodynamic relationship. | Critical for assessing the credibility of future projections of extremes under warming. | [111] |
| Architecture | Key Mechanisms/ Characteristics | Strengths in Downscaling | Limitations/ Weaknesses | UQ Capabilities / Robustness to Non-Stat. & Extremes | Typical Climate Variables | Typical Input Res. | Typical Output Res. | Key Refs |
|---|---|---|---|---|---|---|---|---|
| SVM (Support Vector Machines) | Kernel-based supervised learning; finds optimal hyperplane in transformed feature space; can use nonlinear kernels for complex relationships. | Performs well with limited data; robust to high-dimensional predictor spaces; strong baseline for PP downscaling. | Choice of kernel and hyperparameters critical; may underperform on highly non-stationary or extreme events; less scalable to massive training datasets. | UQ typically via bootstrapping or ensembles; deterministic by default; robustness to non-stationarity depends on training sample diversity. | Precip, Temp | GCM scale (e.g., 50–250 km) | Station/grid scale | [36,39] |
| Random Forests (RF, AP-RF, Prec-DWARF) | Ensemble of decision trees trained on bootstrap samples; output is mean/majority vote; AP-RF extends with predictive distribution outputs. | Handles nonlinear predictor–predictand relationships; naturally ranks predictor importance; AP-RF produces stochastic samples. | May smooth fine-scale details; bias in extremes without specialized treatment; interpretability less direct than single trees. | Yes for AP-RF (predictive distribution via gamma parameters); deterministic for standard RF; moderate robustness to non-stationarity if trained on diverse climates. | Precip | 0.25°–1° | 0.125° / site-level | [37,38] |
| CNN (SRCNN, U-Net, ResNet) | Convolutional layers, pooling, shared weights. U-Net: encoder–decoder w/ skip connections. ResNet: residual blocks. | Spatial feature extraction, pattern recognition; U-Nets preserve fine details; ResNets enable deeper learning. | Overfitting; extrapolation issues; can be overly smooth under MSE loss; plain CNNs struggle with depth. | UQ via ensembles; robustness to non-stationarity often limited without targeted strategies (e.g. PGW training). Standard CNNs may smooth extremes unless using specialized losses or architectures. | Temp, Precip, Wind, Solar Rad. | 25–250 km | 1–25 km | [10,12,44,46,49,122] |
| GAN (CGAN, MSG-GAN, evtGAN, Sup3rCC) | Generator and Discriminator trained adversarially. Conditional GANs (CGANs) use input conditions. Sup3rCC uses GANs to learn and inject spatio-temporal features from historical high-res data into coarse GCM outputs for renewable energy resource variables. | Perceptually realistic outputs, sharp details, better extreme event statistics, spatial variability. Sup3rCC provides high-resolution (4km hourly) realistic data for wind, solar, temp, humidity, pressure, tailored for energy system analysis and computationally efficient compared to dynamical downscaling. | Training instability (mode collapse), difficult evaluation, potential artifacts, may not capture the full statistical distribution. Sup3rCC does not represent specific historical weather events, but historical/future climate conditions, and does not reduce GCM uncertainty. | UQ via ensembles, but it can be challenging to calibrate. Potential for better extreme event generation. Robustness to Non-Stationarity is an active research area; can learn spurious correlations if not carefully designed/trained. Sup3rCC aims for physically realistic outputs by learning from historical data. | Temp, Precip, Wind, Solar Rad. Sup3rCC specialized for renewable energy variables (wind, solar, temp, humidity, pressure). | GCM scale (e.g., 25–100 km) | 1–12 km. Sup3rCC: 4km hourly. | [16,58,59,60,98,123,124,125] |
| LSTM / ConvLSTM | Recurrent memory cells (LSTM); ConvLSTM embeds convolutions into gates. | Captures long-range temporal dependencies; suitable for sequence modeling; CNN–LSTM hybrids. | High complexity; ConvLSTM outperforms pure LSTM on spatial data; very long-range spatial dependencies can be limited. | UQ via ensembles or Bayesian RNNs; can model temporal non-stationarity if reflected in training data but may struggle with unseen future shifts and rare extremes without augmentation. | Precip, Runoff, other time-evolving vars. | Gridded time series | Gridded time series | [43,68,69,115] |
| Transformer (ViT, PrecipFormer, etc.) | Self-attention for global context; captures long-range spatio-temporal interactions. | Excellent at modeling long-range dependencies; strong transfer potential, especially in hybrid architectures. | Quadratic attention cost (being mitigated by sparse/linearized variants); relatively new in downscaling; large data requirements. | UQ via attention-weighted ensembles; promising for non-stationarity when pre-trained on diverse climates; attention can focus on localized antecedent signatures of extremes, aiding detection though not guaranteeing tail magnitude accuracy. | Temp, Precip, Wind, multiple vars. | Various (e.g., 50 km, 250 km) | Various (e.g., 0.9 km, 7 km, 25 km) | [21,23,24,65,66,71] |
| Diffusion Model (LDM, STVD) | Iterative denoising process; LDMs operate in latent space. | High-quality, diverse samples; stable training; explicit probabilistic outputs; good spatial detail. | Computationally intensive (though LDMs mitigate cost); relatively nascent for downscaling; slow sampling. | Excellent UQ via learned distributions and ensemble generation; promising for capturing tail behavior and fine-grained spatial detail of extremes ; robustness to non-stationarity is an active research area, but shows potential when trained on diverse climate data. | Temp, Precip, Wind | 100–250 km | 2–10 km | [18,19,20,31,63] |
| Multi-task Foundation Models (e.g., Prithvi-WxC, FourCastNet, ORBIT-2) | Large pre-trained (often Transformer-based) models fine-tuned for downscaling. | Zero/few-shot potential; multi-variable support; leverage extensive pre-training. | Very high pre-training cost; uncertain generalization to new locales/tasks without adaptation; bias propagation risks. | UQ via large-ensemble sampling; pre-training on diverse climates can enhance robustness to non-stationarity and extremes, but careful domain adaptation is essential. | Multiple vars | Coarse GCM / Reanalysis | Fine (task-dependent) | [22,74] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
