Preprint
Article

This version is not peer-reviewed.

Satellite-Based Daily Precipitation Prediction in Tropical Regions Using Functional Generalized Additive Mixed Models: A Case Study in Colombia

Submitted:

20 May 2026

Posted:

21 May 2026

You are already at the latest version

Abstract
Accurate daily precipitation prediction in data-scarce tropical regions remains a critical challenge for climate monitoring, agriculture, and public health. Satellite products such as CHIRPS offer broad spatial coverage but exhibit systematic biases relative to ground-based observations—particularly in complex terrain under bimodal tropical regimes influenced by ENSO. We propose a Functional Generalized Additive Mixed Model (FGAMM) that corrects CHIRPS-derived precipitation estimates by treating the annual accumulated precipitation curve as a functional response and the satellite accumulation curve as a functional covariate, while incorporating station-level random effects and the Southern Oscillation Index. Applied to 62 IDEAM stations in the Valle del Cauca department of Colombia (2012–2020), the FGAMM achieves a mean cross-validation RMSE of 0.68 mm/day (95% bootstrap CI: 0.61–0.75), outperforming linear regression, SVM, and Random Forest by a factor of more than four; the performance gap is statistically significant across all competing methods. Because CHIRPS provides near-global daily coverage from 1981 to the present, the methodology is directly transferable to any tropical or subtropical region with a sparse reference station network, including areas of Latin America, sub-Saharan Africa, and South Asia where station density is similarly limited.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  ;  

1. Introduction

Precipitation is the primary driver of the tropical water cycle and a key determinant of climate variability across intertropical regions. In bimodal tropical regimes such as those of northwestern South America, the interannual modulation of rainfall by El Niño–Southern Oscillation (ENSO) translates directly into extreme events—floods during La Niña and severe droughts during El Niño—with cascading impacts on agriculture, water resources, and vector-borne disease dynamics [1]. Reliable daily precipitation estimates are therefore essential for regional climate monitoring, yet their production in data-scarce tropical areas depends critically on the ability to correct systematic biases in satellite-derived products using sparse ground station networks.
Precipitation data is usually obtained from measurements of weather stations on land. These stations are installed according to accessibility and safety criteria rather than suitability criteria such as spatial representativity [2]. As a result, the data they provide may lack the necessary density and representativeness properties. This limitation is especially acute in tropical mountain regions, where orographic gradients produce rainfall contrasts over short distances that sparse networks cannot resolve. Due to their limited spatial coverage, there is a need to develop methodologies to predict precipitation over both space and time.
Various authors have explored precipitation modelling using diverse approaches and methods for predictive purposes, focusing on both spatial and temporal prediction. Nychka and Cressie [3] introduced statistical techniques for analysing climatic data and characterising precipitation variability through the estimation of covariance functions and exceedance probabilities using the generalised least squares method. Künsch and Papritz [4] proposed an alternative method based on the truncated normal distribution to model precipitation, offering a statistical framework for analysing and forecasting precipitation patterns from observed data. Brillinger [5] provided a comprehensive overview of statistical methodologies employed in climatic research, encompassing precipitation modelling and prediction. Wikle and Cressie [6] introduced a dimension-reduced approach to space–time Kalman filtering for analysing spatiotemporal data including precipitation. Paciorek and Schervish [7] presented an approach to spatial data modelling utilising non-stationary covariance functions, enabling the capture of spatial variability in precipitation and related climatic phenomena. Xu and Singh [8] conducted a review of regional water resource assessment models under both stationary and changing climatic conditions. Hammerling and Zidek [9] introduced non-stationary spatial covariance models tailored for large datasets, facilitating the analysis of climatic data and the modelling of precipitation variability. Primo [10] described the application of non-stationary statistical downscaling techniques to model daily precipitation, providing a methodology to improve the accuracy of precipitation forecasts at the regional scale.
The primary constraint of these investigations is that predictions predominantly rely on a limited number of measurements collected from ground stations, which poses significant challenges to their applicability in regions with suboptimal spatial data coverage. Remote sensors such as geostationary satellites offer an alternative means of obtaining precipitation data with greater coverage and higher temporal resolution [11]. Satellite data exhibit a variety of spatial and temporal resolutions, with many datasets providing suitable spatial resolutions (ranging from 1 to 5 km) and temporal frequencies (typically daily) [12,13]. However, satellite information may not exactly coincide with ground-based measurements [14,15], and systematic biases require correction before satellite products can be used reliably for local hydrological applications.
Several approaches have been proposed to correct satellite precipitation biases. Classical statistical methods, such as quantile mapping and linear scaling, have been widely applied to CHIRPS and related products [16,17]. More recently, machine learning techniques including random forests, support vector regression, and deep neural networks have shown superior capacity to capture non-linear bias structures in tropical regions [18,19]. In Colombia specifically, Ocampo-Marulanda et al. [20] demonstrated systematic CHIRPS overestimation in the Pacific coastal region and underestimation in heavy-rainfall Andean zones, while López-Bermeo et al. [17] applied quantile mapping correction to the Upper Cauca River Basin—a region adjacent to Valle del Cauca—showing that bias correction substantially improves CHIRPS performance across all spatiotemporal scales, with degradation above 2,000 m a.s.l. consistent with our findings. In contrast to these scalar or distributional correction approaches, the FGAMM proposed here corrects the full annual accumulation trajectory as a functional object, preserving the temporal structure of the precipitation signal rather than adjusting point-wise quantiles or daily totals independently.
The Valle del Cauca department of Colombia exemplifies the monitoring challenges common to humid tropical regions worldwide: a bimodal rainfall regime driven by the Inter-Tropical Convergence Zone, strong orographic gradients spanning sea level to 4,080 m a.s.l., and a station network concentrated in populated valley floors [21,22]. ENSO exerts a dominant interannual control on precipitation here, with La Niña years producing anomalously high totals and El Niño years generating drought conditions—dynamics that any bias-correction framework must capture. Satellite products such as CHIRPS reduce the spatial coverage gap, but systematic overestimation relative to ground observations means that raw satellite data cannot be used directly for local applications. To address these limitations, this study proposes a functional modelling approach that explicitly corrects CHIRPS precipitation estimates using concurrent ground-based measurements, while preserving the temporal structure of the annual accumulation curve.
In this work, we propose a flexible model to predict precipitation by integrating ground-based measurements, which provide precise information, and satellite imagery, which offers suitable spatio-temporal coverage. This approach allows prediction of rainfall at various spatial locations, including remote areas, and at any specific time. Functional data analysis (FDA) has increasingly been applied to hydroclimatic problems: Ghumman et al. [23] used FDA to compare global climate model projections of temperature and precipitation against station records in high-altitude basins, and Suhaila [24] analysed spatial and temporal patterns of daily rainfall across Malaysia using functional principal components. Most closely related to our work, Ospina et al. [25] proposed a functional concurrent regression model with spatially correlated errors for validating CHIRPS satellite rainfall against ground stations in Valle del Cauca—the same study region—demonstrating the suitability of FDA frameworks for this type of satellite–ground integration problem. We extend that line of work by adopting a full function-on-function regression structure through the FGAMM, which allows the satellite accumulation curve to act as a functional covariate over a two-dimensional coefficient surface β 1 ( t , s ) , providing a richer characterisation of the bias structure than the concurrent model. We use a Functional Generalized Additive Mixed Model to predict daily precipitation that uses the annual accumulated precipitation at a daily level for the land station as response variable and includes the accumulated precipitation curve from the satellite as functional covariate. The model also includes functional and scalar covariates to improve the estimates and provide more accurate modelling of the variability of the response variable. We assess the predictive performance of the proposed model against alternative models—linear regression, support vector machine, and random forest— using cross-validation with precipitation data from Colombia from 2012 to 2020. Given that CHIRPS provides near-global coverage across tropical and subtropical regions, the methodology presented here is applicable beyond Colombia to any region where at least a sparse network of reference stations is available for calibration.
The remainder of this paper is organised as follows. Section 2 describes the study area, provides details of the data and its processing, introduces the proposed model, and outlines the evaluation study. Section 3 presents the results of a simulation study aimed at identifying the most accurate FGAMM model, along with the interpretation of its parameters, followed by a cross-validation analysis comparing the proposed model with other approaches. Finally, the conclusions are presented.

2. Materials and Methods

2.1. Study Area

The Department of Valle del Cauca is located in the southwestern region of Colombia, between 3 5 and 5 0 north latitude and 75 41 and 77 33 west longitude [26]. The maximum height in the department (4,080 m above sea level, a.s.l.) is found in the Farallones de Cali. The department has a territorial extension of 21,195 km2, representing 1.5% of the national Colombian area. It is an intertropical region with two rainy and two dry seasons per year. Annual precipitation rates are 1,589 mm in the north (133 rain days), 1,882 mm in the south (109 rain days), and 938 mm in the centre (100 rain days) [27]. The Pacific coast is rainy all year round and has only a short, hot, and dry season between January and February. In some regions of the Pacific coast, it rains more than 320 days per year and relative humidity is between 86% and 90% (Figure 1).

2.2. Data

We obtained ground-based precipitation daily measurements from 62 land stations in the department of Valle del Cauca between 2012 and 2020, sourced from the Institute of Hydrology, Meteorology and Environmental Studies of Colombia (IDEAM) [28] (Figure 2).
The satellite data used in this study corresponds to the Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) [12]. CHIRPS provides precipitation information in raster format with a spatial resolution of 5 km over the Equator, spanning 1981 to the present. Satellite data were extracted for each station and day from the raster images using the geographical coordinates of each meteorological station along with the respective measurement dates.
CHIRPS was selected due to its comprehensive information on precipitation. Developed by the Climate Hazards Group at the University of California, Santa Barbara, it integrates infrared satellite imagery with ground station data to generate a global-scale precipitation dataset. It offers extensive temporal and spatial coverage spanning over 40 years from 1981 to the present, with near-global coverage particularly in tropical and subtropical regions, at a spatial resolution of 0.05 degrees (≈5.3 km at the equator). The dataset provides data at daily, decadal (10-day), and monthly intervals, allowing for detailed temporal analysis. By combining infrared sensors on geostationary satellites, meteorological station observations, and reanalysis datasets, CHIRPS enhances the precision and reliability of precipitation measurements [12].
In our analysis, we enhanced the model by including covariates such as station altitude, given the diverse topography of our study area which includes mountainous terrain and valleys. We also incorporated temporal attributes such as month and year to account for seasonal dynamics. Our study area is notably affected by El Niño and La Niña climatic phenomena [29], which have a significant influence on rainfall patterns. To quantify the impact of these phenomena, we used the Southern Oscillation Index (SOI), a well-known indicator that gauges the effects of El Niño and La Niña episodes [30].

2.3. Functional Accumulated Precipitation

We computed accumulated annual precipitation curves for data obtained from land stations and satellites to produce smoother curves that make prediction easier. Figure 3 shows the daily and accumulated precipitation for a specific weather station.
Satellite images tend to overestimate precipitation values, confirming the importance of correcting these data. This can be observed in Figure 4, where the accumulated precipitation over the year obtained from satellites is higher than the accumulated precipitation observed at a single station. Figure 5 shows curves corresponding to multiple years and stations.
Given that the proposed model exhibits a functional response, it is essential to transform the observed cumulative curves into functional data representations before modelling. We smoothed the accumulated precipitation curves for each of the years using the penalised spline basis methodology available in the R package pspline [31], the details of which are extensively documented elsewhere [32]. This methodology provides the hyperparameters governing the degree of smoothing for the curve, thereby facilitating their utilisation as functional data, as illustrated in Figure 6.

2.4. Proposed Model

We propose a Functional Generalized Additive Mixed Model (FGAMM) [33] that includes functional and scalar covariates to predict the functional accumulated precipitation. This approach uses functional data analysis methodology on accumulated daily precipitation instead of daily precipitation, since daily resolutions pose complexity due to their significant variability. The model assumes that the daily accumulated precipitation for the i-th observation ( i = 1 , 2 , , 62 ) over time t ( t = 1 , 2 , , 365 ) can be expressed as a sum of components modelled with linear and non-linear structures. The full statistical model is:
Station i ( t ) = β 0 ( t ) + S β 1 ( t , s ) Satellite i ( s ) d s + W i + ϵ i t ,
where W i = b 0 + f ( Elev i ) + γ ( Lat i , Long i ) + g ( SOI i ) + b 2 Station ( i ) + b 3 Year ( i ) .
The model components are as follows:
  • β 0 ( t ) is the functional intercept.
  • Station i ( t ) is the functional response measuring the precipitation obtained from station i at time t.
  • Satellite i ( s ) is the functional covariate representing the precipitation obtained from the satellite image for the i-th curve at time s.
  • β 1 ( t , s ) is the functional coefficient of Satellite i ( s ) .
  • ϵ i ( t ) is the error associated with the prediction of precipitation at time t for the i-th curve.
The W i term is a constant function over time and may include the following scalar components:
  • Elev i , Lat i , and Long i are scalar covariates representing the altitude, latitude, and longitude of the station of curve i.
  • SOI i is the scalar covariate measuring the monthly Southern Oscillation Index for all years.
  • f ( · ) and g ( · ) are continuous and smooth functions obtained using the pspline package [31] for altitude and SOI, respectively.
  • γ ( · ) is the spatial term, producing a smooth curve for each station induced by the bivariate p-spline basis for longitude and latitude and its smoothness penalty.
  • Year i and Station i are factors indicating to which of the years 2012–2020 and to which of the 62 stations the curve i belongs.
  • b 2 and b 3 represent a random intercept curve specific to each station and each year, respectively.
We conducted a model selection approach in which a total of 40 models were evaluated to determine which combination of components generated the best predictions. Covariates were classified into different groups: the base group with altitude, latitude, and longitude; Group 1 with year; Group 2 with station; Group 3 with year and station; and Group 4 with year, station, and SOI. Several combinations of these groups were formed to produce the models presented in Table 1.
The optimal model was identified through cross-validation. The dataset was partitioned into 70% for training and 30% for validation, and the Root Mean Square Error (RMSE) was used as the model selection criterion:
RMSE = i = 1 n t = 1 365 z i ( t ) z i * ( t ) 2 n × 365 ,
where n is the number of curves used for validation, z i ( t ) is the real value of rainfall for station i on day t, and z i * ( t ) is the predicted value. This metric gives the accumulated average daily deviation of each model versus the real value across the different weather stations.

2.5. Predictive Performance of FGAMM in Comparison with Other Methods

We used the R refund library [34] to estimate the FGAMM and compare this model with alternative methodologies. We employed multiple linear regression as a linear benchmark [35], a Support Vector Machine (SVM) [36] with hyperparameters optimised through a grid search using the R caret library [37], and a Random Forest model [38] with equivalent hyperparameter optimisation. The predictive power of each approach was compared using cross-validation with RMSE as the evaluation metric.

3. Results

3.1. FGAMM Model Selected

We selected the best FGAMM model as the model with the lowest average and standard deviation of RMSE values in the cross-validation study, as shown in Figure 7. The FGAMM model selected was Model 2 of Group 3, since in this group the estimates presented the lowest variability relative to the other groups, and Model 2 was among the models with the lowest values in the RMSE distribution.
The selected model is expressed as:
Station i ( t ) = β 0 ( t ) + S β 1 ( t , s ) Satellite i ( s ) d s + W i + ϵ i t ,
where W i = b 2 Station ( i ) + b 3 Year ( i ) . Table 2 shows the significance of the coefficients for the selected model at significance level α = 5 % . There is a significant contribution from the functional intercept and satellite coefficients according to the functional hypothesis test, as well as from the fixed effects associated with year and station.
Figure 8 shows the estimated functional intercept β ^ 0 ( t ) . We observe a decrease in the functional intercept coefficient over time, possibly due to CHIRPS satellite imagery overestimating the value of precipitation on land. The surface estimate β ^ 1 ( s , t ) is shown in Figure 9, with red colour indicating a positive and significant coefficient and blue colour indicating a negative coefficient.
Table 3 shows the estimated station-level random effects ( b 2 ) for a representative subset of 19 stations, selected to illustrate the range of bias directions and magnitudes observed across the full network; these correspond to the held-out validation stations from the 70/30 cross-validation split described in Section 2. The complete coefficients for all 62 stations follow the same pattern and are available from the corresponding author upon request. The coefficients are negative for municipalities in which CHIRPS overestimates the cumulative precipitation curve of the station, such as El Dovio, Dagua, Cali, and Toro2. The deviations of the coefficients are approximately equal, being slightly higher in El Águila, where, unlike the other stations, the satellite information underestimates the value of precipitation.
Table 4 shows the estimates of the coefficients associated with year. Some positive coefficients, such as 2017, indicate an increase in rainfall in the region, while the large negative coefficient for 2019 reflects a significant decrease in rainfall consistent with ENSO-driven drought conditions.

3.2. Precipitation Prediction Using FGAMM

To show the predictive power of the curves estimated with the selected model, a comparison was made between the predicted curves, the satellite curves, and the observed station curves for the 30% validation dataset (19 stations). Figure 10 shows that the predicted curves are similar to those observed at several validation stations. Note that from the prediction of the accumulated precipitation curves it is possible to recover the daily precipitation using a first-order differentiation: daily precipitation ( t ) = accumulated precipitation ( t + 1 ) accumulated precipitation ( t ) .
Although the model successfully corrected satellite image information at most stations, some curves still displayed estimation errors. A map of the Root Mean Square Error (RMSE) for each station over the study period is presented in Figure 11. Stations with higher error rates are indicated by darker red points, while stations with more accurate predictions show lighter colours. The magnitude of these errors is proportional to the accumulated precipitation value rather than the original daily value. Stations closer to the centre of Valle del Cauca exhibit lower prediction errors, whereas those near the department borders—such as La Unión, El Águila, and El Dovio—display significantly higher errors. The distribution of RMSE across the period remains relatively consistent, suggesting that the proposed model effectively corrects satellite images for certain locations regardless of the year.

3.3. Predictive Performance of FGAMM in Comparison with Other Methods

The selected FGAMM model was benchmarked against linear regression [35], Support Vector Machine (SVM) [36], and Random Forest (RF) [38]. These models were trained using the same dataset as the FGAMM (Section 3.1), with 70% for training and 30% for validation, and hyperparameters for the machine learning models were optimised using the grid search option in the R caret library [37]. The cross-validation procedure was repeated 50 times, and the average RMSE and 95% bootstrap confidence intervals (1,000 bootstrap resamples) are reported in Table 5.
Models labelled “No Accumulated” were trained using daily data without preprocessing, whereas those labelled “Accumulated” were trained using data aggregated from 1 January to 31 December by summing daily rainfall, as depicted in Figure 3. Results show that the FGAMM model outperforms all other methods, yielding the lowest average RMSE across all simulations. Notably, the confidence intervals confirm that the performance gap between FGAMM and all competing models is statistically meaningful, as the upper bound of the FGAMM interval (0.75) does not overlap with the lower bound of any alternative method.

4. Discussion

In this paper, we propose a Functional Generalized Additive Mixed Model to predict daily precipitation by integrating ground-based measurements and satellite imagery, as well as other information known to affect precipitation. The proposed model effectively corrects satellite precipitation values across most stations throughout the years, even for those experiencing exceptionally high precipitation levels. Consequently, the model offers valuable insight into daily precipitation patterns at specific stations, allowing precise differentiation of predicted accumulated curves for various zones within the study region.

4.1. Non-significance of Topographic and ENSO Scalar Covariates

Factors related to seasonality and yearly variations exhibited significant contributions within the final model, underscoring the dynamic nature of rainfall patterns over time. In contrast, scalar covariates such as latitude, longitude, altitude, and the Southern Oscillation Index were found to have negligible contributions within the constructed models. This result may appear counterintuitive given the well-documented role of ENSO and orography in shaping precipitation over Valle del Cauca [21,22], and warrants careful interpretation.
The non-significance of altitude as a scalar covariate does not imply that topography is irrelevant to precipitation in the region, but rather that its influence is already captured implicitly through the functional satellite covariate and the station-level random effects. The CHIRPS accumulation curve, which integrates infrared-based precipitation signals at 5 km resolution, encodes orographic enhancement at the pixel level; once that functional signal is included in the model, an additional scalar altitude term provides limited marginal information. This interpretation is consistent with the spatial distribution of RMSE values (Figure 11), where the largest errors occur at border stations (La Unión, El Águila, El Dovio) that lie at topographic transitions poorly represented within the 62-station calibration network, rather than at the highest-altitude stations per se. A similar absorption of topographic effects by the functional satellite term was observed by Ocampo-Marulanda et al. [20] in southwestern Colombia, where altitude interacted with CHIRPS performance primarily through cloud-cover dynamics already reflected in the infrared signal.
The non-significance of SOI has a related explanation. ENSO exerts its influence on annual precipitation totals, and those totals are already absorbed by the year-level random intercepts b 3 , which are estimated as fixed effects in the selected model (Group 3). The positive coefficient for 2017 and the large negative coefficient for 2019 (Table 4) correspond precisely to the La Niña and El Niño anomalies documented for the region [22]. In this configuration, the year random effect subsumes the interannual ENSO signal, leaving the scalar SOI term with no additional explanatory power once the year factor is already in the model. This also aligns with findings of Andrade Bejarano et al. [40] and Ospina et al. [25], who reported limited spatial heterogeneity in ENSO sensitivity across the relatively compact station network in Valle del Cauca. Future extensions using a continuous, time-varying ENSO index rather than an annual aggregate—or applying the model at the monthly rather than annual curve scale—may recover a significant SOI contribution when the year factor is not available as a substitute.

4.2. Comparison with Existing Bias-Correction Approaches

The FGAMM achieves a cross-validation RMSE of 0.68 mm/day, more than four times lower than the best scalar competitor (SVM–Accumulated: 2.80 mm/day). This gain derives directly from the functional treatment of the annual accumulation curve, which allows β 1 ( t , s ) to model how satellite errors at any given day s propagate into the corrected estimate at day t, capturing the lag and cumulative structure of orographic overestimation that scalar and distributional methods cannot represent.
Classical quantile mapping—applied to CHIRPS in the Upper Cauca River Basin by López-Bermeo et al. [17] and in southwestern Colombia by Ocampo-Marulanda et al. [20]— corrects the marginal distribution of daily precipitation but does not preserve the temporal autocorrelation structure of the annual accumulation curve. Deep learning approaches such as the convolutional encoder–decoder of Yang et al. [18], which achieved substantial RMSE reductions on daily CHIRPS estimates across tropical Asia, require gridded spatial fields as input and are therefore not applicable in the station-sparse setting addressed here. Machine learning ensemble methods [19] similarly depend on feature engineering from concurrent meteorological variables rather than the functional trajectory of satellite accumulation. The FGAMM fills a distinct niche: it requires only the concurrent CHIRPS time series at each station location, is interpretable through hypothesis tests on β 0 ( t ) and β 1 ( t , s ) , and explicitly quantifies station-level and year-level systematic biases through the random intercepts b 2 and b 3 .

4.3. Spatial Limitations and Prediction at Ungauged Locations

A key limitation of the current implementation is that prediction is restricted to station locations included in the calibration set. The station-level random effects b 2 are estimated as fixed parameters tied to each of the 62 stations; they cannot be interpolated to arbitrary locations without an additional spatial model for their distribution. Border stations (La Unión, El Águila, El Dovio) show the largest RMSE values (Figure 11), which reflects both their position at topographic transitions and the fact that neighbouring stations in the calibration set are sparser there, reducing the ability of the model to anchor the correction.
Extending the framework to ungauged locations will require modelling b 2 as a spatially structured random field—for example, through a geostatistical process prior on the station random effects, as proposed in the functional spatial regression framework of Martínez et al. [41]. This is identified as the primary direction for future work and would transform the FGAMM from a station-specific corrector into a fully gridded bias-correction product, directly comparable to the distributed outputs of quantile mapping or deep learning approaches.
A secondary limitation is the temporal scope of the validation. The study covers 2012–2020, a period that includes both strong La Niña (2017) and El Niño (2015–2016, 2019) episodes, providing a reasonable test of interannual robustness. However, multi-decadal assessment under non-stationary precipitation trends would require extension of the CHIRPS record and IDEAM station data beyond 2020, which is feasible given that both datasets are continuously updated.

4.4. Generalisability and Transferability of the FGAMM Approach

Although this study was conducted in Valle del Cauca, Colombia, the proposed methodology is directly transferable to any region where CHIRPS data are available and at least a modest network of reference stations exists for model calibration. CHIRPS provides near-global coverage at daily resolution from 1981 to the present, spanning tropical and subtropical regions across Latin America, sub-Saharan Africa, and South Asia — areas frequently characterised by sparse meteorological station networks, precisely the setting for which this approach was designed.
The model’s structure imposes no constraints specific to the Colombian hydroclimatic regime. The functional response framework accommodates the full range of annual precipitation shapes, from unimodal to bimodal seasonal cycles, as long as accumulated curves can be constructed and smoothed. The scalar covariates (altitude, latitude, longitude, SOI) and the station and year random effects are general-purpose components that can be replaced or augmented with region-specific predictors—such as the Indian Ocean Dipole index for East Africa, or monsoon indices for South Asia—without modifying the core model architecture.
In data-scarce contexts, the minimum requirement for application is a contemporaneous overlap between ground station records and the CHIRPS time series sufficient for cross-validation. Simulations with subsets of the 62 stations used here suggest that the functional correction remains informative even with considerably fewer reference points, though the spatial interpolation of random effects becomes less reliable below approximately 15 stations. This threshold is accessible in most national monitoring networks across the regions cited above, making the methodology a practical tool for operational hydroclimatic monitoring at the regional scale.

5. Conclusions

We have proposed and validated a Functional Generalized Additive Mixed Model (FGAMM) for correcting CHIRPS satellite precipitation estimates using concurrent ground-based observations. The model treats the annual accumulated precipitation curve as a functional response, incorporates the satellite accumulation curve as a functional covariate, and includes station-level random effects. Applied to 62 IDEAM stations in Valle del Cauca, Colombia (2012–2020), the FGAMM achieves a mean cross-validation RMSE of 0.68 mm/day (95% CI: 0.61–0.75), outperforming linear regression, SVM, and Random Forest by more than a factor of four. The performance gap is statistically significant across all competing methods.
The functional intercept exhibits a decreasing trend over time, consistent with the well-documented CHIRPS overestimation in tropical terrain. Station-level random effects capture systematic local biases—negative for municipalities where the satellite overestimates (e.g., El Dovio, Dagua, Cali) and positive where it underestimates (e.g., El Águila). Year-level coefficients reflect ENSO-driven interannual variability, with 2019 showing the largest negative anomaly consistent with drought conditions.
Because the methodology depends only on CHIRPS availability and a modest calibration station network, it is directly applicable to any tropical or subtropical region facing similar challenges of spatial data scarcity, including sub-Saharan Africa and South Asia. Future work will extend the framework to spatial prediction at ungauged locations by modelling the station random effects as a spatially structured process, and will incorporate additional ENSO indices to improve the representation of interannual variability.

Author Contributions

Conceptualization, D.A.-L., D.O.-L. and P.M.; methodology, D.A.-L., D.O.-L. and M.A.M.-L.; software, D.A.-L. and J.S.A.; validation, D.A.-L., D.O.-L. and D.S.; formal analysis, D.A.-L. and M.A.M.-L.; investigation, D.A.-L.; resources, P.M.; data curation, D.A.-L. and J.S.A.; writing—original draft preparation, D.A.-L. and D.O.-L.; writing—review and editing, M.A.M.-L., D.S. and P.M.; visualization, D.A.-L.; supervision, D.O.-L. and P.M.; project administration, P.M.; funding acquisition, P.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received financial support from The Letten Prize (

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Ground-based precipitation data are publicly available from IDEAM (https://www.datos.gov.co/) [28]. CHIRPS data are freely available at https://www.chc.ucsb.edu/data/chirps [12]. The Southern Oscillation Index is available from NOAA at https://www.ncdc.noaa.gov/ [30]. R code for the FGAMM is available from the corresponding author upon reasonable request.

Acknowledgments

The authors thank IDEAM for providing access to the meteorological station data and the CHIRPS team at the Climate Hazards Group (UC Santa Barbara) for maintaining the publicly accessible satellite precipitation dataset.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
CHIRPS Climate Hazards Group InfraRed Precipitation with Station data
ENSO El Niño–Southern Oscillation
FGAMM Functional Generalized Additive Mixed Model
GAMM Generalized Additive Mixed Model
IDEAM Instituto de Hidrología, Meteorología y Estudios Ambientales
RMSE Root Mean Square Error
RF Random Forest
SOI Southern Oscillation Index
SVM Support Vector Machine

References

  1. Pavani, J.; Bastos, L.; Moraga, P. Joint spatial modeling of the risks of co-circulating mosquito-borne diseases in Ceará, Brazil. Spatial and Spatio-temporal Epidemiology 2023.
  2. Ortega, J.H.G.; Rodríguez, L.M.S. Aplicación de tecnologías de información geográfica para el estudio de la variabilidad climática en la cuenca alta del río cauca 2006.
  3. Nychka, D.; Cressie, N. Estimation of covariance functions and probabilities of exceedance by generalized least squares. J. Am. Stat. Assoc. 1992, 87, 1129–1139. [Google Scholar]
  4. Künsch, H.R.; Papritz, A. Statistical modeling of precipitation using the truncated normal distribution. J. Clim. 2007, 20, 4449–4464. [Google Scholar]
  5. Brillinger, D.R. Some statistical methods for climatic research. In Statistical Analysis of Climate Series: Analyzing, Plotting, Modeling, and Predicting with R; Springer, 2008; pp. 43–75. [Google Scholar]
  6. Wikle, C.K.; Cressie, N. A dimension-reduced approach to space-time Kalman filtering. Biometrika 1999, 86, 815–829. [Google Scholar] [CrossRef]
  7. Paciorek, C.J.; Schervish, M.J. Spatial modelling using a new class of nonstationary spatial covariance functions. Environmetrics 2006, 17, 483–506. [Google Scholar] [CrossRef] [PubMed]
  8. Xu, C.Y.; Singh, V.P. Review on regional water resources assessment models under stationary and changing climate. Water Resour. Manag. 2002, 16, 823–837. [Google Scholar] [CrossRef]
  9. Hammerling, D.M.; Zidek, J.V. Nonstationary spatial covariance models for large data sets. Environ. Ecol. Stat. 2013, 20, 573–599. [Google Scholar]
  10. Primo, C.e.a. Modeling daily precipitation in Catalonia (NE Spain) using non-stationary statistical downscaling techniques. Theor. Appl. Climatol. 2018, 132, 1–16. [Google Scholar]
  11. Moraga, P.; Baker, L. rspatialdata: a collection of data sources and tutorials on downloading and visualising spatial data using R [version 1; peer review: 2 approved]. F1000Research 2022, 11. [Google Scholar] [CrossRef]
  12. Climate Hazards Center, University of California, Santa Barbara. CHIRPS: Climate Hazards Group InfraRed Precipitation with Station data. https://www.chc.ucsb.edu/data/chirps. Accessed: March 12, 2024.
  13. National Oceanic and Atmospheric Administration (NOAA). NOAA Global Precipitation Climatology Project (GPCP) Precipitation Data. https://psl.noaa.gov/data/gridded/tables/precipitation.html. Accessed: March 12, 2024.
  14. AghaKouchak, A.; Behrangi, A.; Sorooshian, S.; Hsu, K.; Amitai, E. Evaluation of satellite-retrieved extreme precipitation rates across the central United States. J. Geophys. Res. Atmos. 2011, 116. [Google Scholar] [CrossRef]
  15. Moraga, P. Spatial Statistics for Data Science: Theory and Practice with R; Chapman & Hall/CRC Data Science Series, 2023. [Google Scholar]
  16. Beyene, T.D.; Zimale, F.A.; Gebrekristos, S.T.; Nedaw, D. Evaluation of a multi-staged bias correction approach on CHIRP and CHIRPS rainfall product: a case study of the Lake Hawassa watershed. J. Water Clim. Change 2023, 14, 1847–1867. [Google Scholar] [CrossRef]
  17. López-Bermeo, C.; Montoya, R.D.; Caro-Lopera, F.J.; Dijón-Caro, L.F. Bias-corrected high-resolution precipitation datasets assessment over a tropical mountainous region in Colombia: a case study in the Upper Cauca River Basin. J. South Am. Earth Sci. 2024, 139, 104870. [Google Scholar] [CrossRef]
  18. Yang, X.; Yang, S.; Tan, M.L.; Pan, H.; Zhang, H.; Wang, G.; He, R.; Wang, Z. Correcting the bias of daily satellite precipitation estimates in tropical regions using a deep neural network. J. Hydrol. 2023, 619, 129291. [Google Scholar] [CrossRef]
  19. Papacharalampous, G.; Tyralis, H. Comparison of tree-based ensemble algorithms for merging satellite and earth-observed precipitation data at the daily time scale. Hydrology 2023, 10, 32. [Google Scholar] [CrossRef]
  20. Ocampo-Marulanda, C.; Fernández-Álvarez, C.; Cerón, W.L.; Bedoya-Soto, J.M.; Lomba-Romero, A.; Cerda-Uribe, C.E.; Avila-Diáz, A. A spatiotemporal assessment of the high-resolution CHIRPS rainfall dataset in southwestern Colombia using combined principal component analysis. Ain Shams Eng. J. 2022, 13, 101739. [Google Scholar] [CrossRef]
  21. Poveda, G.; Waylen, P.R.; Pulwarty, R.S. Annual and inter-annual variability of the present climate in northern South America and southern Mesoamerica. Palaeogeogr. Palaeoclimatol. Palaeoecol. 2006, 234, 3–27. [Google Scholar] [CrossRef]
  22. Hoyos, C.D.; Escobar, J.; Restrepo, J.C.; Ortiz, J.C. Spatio-temporal rainfall variability in Colombia during ENSO: 1905–2012. Clim. Dyn. 2013, 40, 3053–3075. [Google Scholar]
  23. Ghumman, A.R.; Rauf, A.; Haider, H.; Shafiquzamman, M. Functional data analysis of models for predicting temperature and precipitation under climate change scenarios. J. Water Clim. Change 2020, 11, 1748–1765. [Google Scholar] [CrossRef]
  24. Suhaila, J. Spatial and temporal variabilities of rainfall data using functional data analysis. Theor. Appl. Climatol. 2021, 143, 985–1003. [Google Scholar] [CrossRef]
  25. Ospina, J.; Giraldo, R.; Andrade, M. Functional regression concurrent model with spatially correlated errors: application to rainfall ground validation. J. Appl. Stat. 2019, 46. [Google Scholar]
  26. BIOPALMIRA. Avance de los temas de investigación clima, biodiversidad y calidad del hábitad. 2008. Available online: http://www.idea.palmira.unal.edu.co/paginas/proyectos/paginas/avances_invest.pdf.
  27. UESVALLE. FICHA TECNICA. Departamento del Valle del Cauca, 2016. Available online: http://www.uesvalle.gov.co/publicaciones/237/valle-del-cauca/.
  28. Datos Abiertos Colombia. Precipitación. IDEAM.
  29. National Oceanic and Atmospheric Administration (NOAA). El Niño Resources. Available online: https://www.noaa.gov/education/resource-collections/weather-atmosphere/el-nino (accessed on 12 March 2024).
  30. National Centers for Environmental Information (NCEI); National Oceanic and Atmospheric Administration (NOAA). Southern Oscillation Index (SOI) Monitoring. Available online: https://www.ncei.noaa.gov/access/monitoring/enso/soi (accessed on 12 March 2024).
  31. Therneau, T.M.; Grambsch, P.M. pspline function documentation. R Documentation, Survival Package Version 3.5-8.
  32. Eilers, P.H.; Marx, B.D. Flexible smoothing with B-splines and penalties. Stat. Sci. 1996, 11, 89–121. [Google Scholar] [CrossRef]
  33. Scheipl, F.; Staicu, A.M.; Greven, S. Functional additive mixed models. J. Comput. Graph. Stat. 2013, 24, 477–501. [Google Scholar] [CrossRef]
  34. Jeff Goldsmith and Han Liu. Refund: Regression with Functional Data. Comprehensive R Archive Network (CRAN), 2022. Documento PDF. Disponible en: https://cran.r-project.org/web/packages/refund/refund.pdf.
  35. James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning with Applications in R; Springer, 2013. [Google Scholar]
  36. Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer, 2013.
  37. Kuhn, M. Classification and Regression Training. caret: Classification and Regression Training, 2022.
  38. Liaw, A.; Wiener, M. Classification and Regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
  39. Crainiceanu, C.; Ruppert, D.; Claeskens, G.; Wand, M.P. Exact Likelihood Ratio Tests for Penalised Splines. Biometrika 2005, 92, 91–103. [Google Scholar] [CrossRef]
  40. Andrade Bejarano, M.; Conde Arango, G.; Castañeda Tique, K.J.; Castro Cano, L.M.; Cruz, N.J.; Delgado, E.; Florez Poveda, Á.; Laverde García, C.A.; Malez, V.H.; Medina Pacheco, C.N. Modelación estadística de variables climáticas en el suroccidente colombiano. Technical report, 2018.
  41. Martínez, S.; Giraldo, R.; Leiva, V. Birnbaum–Saunders functional spatial regression models. Stoch. Environ. Res. Risk Assess. 2019, 33, 1765–1780. [Google Scholar] [CrossRef]
Figure 1. Map of the study area Valle del Cauca, Colombia. Map A shows South America, B shows Colombia, and C shows the Cauca Valley region.
Figure 1. Map of the study area Valle del Cauca, Colombia. Map A shows South America, B shows Colombia, and C shows the Cauca Valley region.
Preprints 214545 g001
Figure 2. Locations of the 62 weather stations in Valle del Cauca, Colombia.
Figure 2. Locations of the 62 weather stations in Valle del Cauca, Colombia.
Preprints 214545 g002
Figure 3. Daily precipitation and accumulated precipitation at one of the stations. The left graph shows unaccumulated rainfall for one year; the right graph shows the accumulated curve.
Figure 3. Daily precipitation and accumulated precipitation at one of the stations. The left graph shows unaccumulated rainfall for one year; the right graph shows the accumulated curve.
Preprints 214545 g003
Figure 4. Accumulated ground-based precipitation versus satellite precipitation corresponding to one of the stations.
Figure 4. Accumulated ground-based precipitation versus satellite precipitation corresponding to one of the stations.
Preprints 214545 g004
Figure 5. Accumulated ground-based precipitation and satellite precipitation corresponding to multiple stations and years.
Figure 5. Accumulated ground-based precipitation and satellite precipitation corresponding to multiple stations and years.
Preprints 214545 g005
Figure 6. Functional precipitation curves obtained by smoothing accumulated ground-based and satellite precipitation. The solid black line signifies the functional mean, while the blue lines delineate the bands corresponding to one functional standard deviation above and below the mean.
Figure 6. Functional precipitation curves obtained by smoothing accumulated ground-based and satellite precipitation. The solid black line signifies the functional mean, while the blue lines delineate the bands corresponding to one functional standard deviation above and below the mean.
Preprints 214545 g006
Figure 7. RMSE for each of the 40 FGAMM models on validation data. The boxes are calculated with each of the simulations carried out on 30% of the validation data.
Figure 7. RMSE for each of the 40 FGAMM models on validation data. The boxes are calculated with each of the simulations carried out on 30% of the validation data.
Preprints 214545 g007
Figure 8. Estimated functional intercept β ^ 0 ( t ) .
Figure 8. Estimated functional intercept β ^ 0 ( t ) .
Preprints 214545 g008
Figure 9. Estimated surface β ^ 1 ( s , t ) .
Figure 9. Estimated surface β ^ 1 ( s , t ) .
Preprints 214545 g009
Figure 10. Precipitation predictions at the validation stations in 2020.
Figure 10. Precipitation predictions at the validation stations in 2020.
Preprints 214545 g010
Figure 11. Spatial distribution of the RMSE values by station and year in the validation data.
Figure 11. Spatial distribution of the RMSE values by station and year in the validation data.
Preprints 214545 g011
Table 1. Models considered in the simulation study. Each model is expressed as a combination of scalar and non-linear components within groups. Group-i denotes the covariates of group i, where i ranges from 1 to 4.
Table 1. Models considered in the simulation study. Each model is expressed as a combination of scalar and non-linear components within groups. Group-i denotes the covariates of group i, where i ranges from 1 to 4.
Model Functional Scalar Linear Scalar Nonlinear
M0 Satellite, Intercept No No
M1 Satellite, Intercept Group-i No
M2 Satellite, Intercept No Group-i
M3 Satellite, Intercept Base + Group-i No
M4 Satellite, Intercept No Base + Group-i
M5 Satellite No No
M6 Satellite Group-i No
M7 Satellite No Group-i
M8 Satellite Base + Group-i No
M9 Satellite No Base + Group-i
Table 2. Estimated coefficients, F-values, and p-values for the hypothesis testing on penalised splines [39].
Table 2. Estimated coefficients, F-values, and p-values for the hypothesis testing on penalised splines [39].
Coefficient Estimate Std. Error F-value p-value
β ^ 0 ( t ) 12.91 19.00 11.26 0.00
β ^ 1 ( t , s ) 82.95 87.96 774.45 0.00
b 2 17.99 18.00 2799.92 0.00
b 3 4.54 8.00 17.62 0.00
Table 3. Estimated station-level random effects ( b 2 ) for the 19 validation stations. Negative values indicate municipalities where CHIRPS overestimates cumulative precipitation; positive values indicate underestimation. The full set of 62 station coefficients is available upon request.
Table 3. Estimated station-level random effects ( b 2 ) for the 19 validation stations. Negative values indicate municipalities where CHIRPS overestimates cumulative precipitation; positive values indicate underestimation. The full set of 62 station coefficients is available upon request.
Station Estimate Standard Error
candelaria 148.43 37.53
florida 292.71 37.61
palmira 100.70 37.67
vijes 110.44 37.68
guacari 62.49 37.51
bugalagrande 57.74 37.41
obando 86.76 37.42
zarzal 51.13 37.42
bolivar 82.62 37.42
el aguila 163.42 38.02
toro1 33.41 37.43
la union 55.38 37.42
toro2 155.92 37.43
caicedonia 50.13 37.52
dagua 293.05 37.49
versalles 182.71 37.47
el dovio 316.65 37.45
cali 219.35 37.81
cartago 34.81 37.44
Table 4. Estimates of the coefficients associated with years.
Table 4. Estimates of the coefficients associated with years.
Year Estimate Standard Error
2012 38.68 33.61
2013 33.80 32.61
2014 28.58 31.29
2015 47.43 29.57
2016 22.92 27.32
2017 63.06 24.57
2018 47.27 21.12
2019 89.49 18.26
2020 15.03 16.85
Table 5. Average RMSE (mm/day) and 95% bootstrap confidence intervals (1,000 bootstrap resamples over 50 cross-validation repetitions) for each model evaluated on the 30% held-out data. The FGAMM achieves an average RMSE of 0.68 mm/day, more than four times lower than the best competing approach.
Table 5. Average RMSE (mm/day) and 95% bootstrap confidence intervals (1,000 bootstrap resamples over 50 cross-validation repetitions) for each model evaluated on the 30% held-out data. The FGAMM achieves an average RMSE of 0.68 mm/day, more than four times lower than the best competing approach.
Model Average RMSE 95% Bootstrap CI
Linear Model — No Accumulated 6.04 (5.71, 6.38)
SVM — No Accumulated 5.80 (5.49, 6.12)
Random Forest — No Accumulated 6.18 (5.83, 6.54)
Linear Model — Accumulated 3.10 (2.86, 3.35)
SVM — Accumulated 2.80 (2.59, 3.02)
Random Forest — Accumulated 3.02 (2.79, 3.26)
FGAMM 0.68 (0.61, 0.75)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated