Ensemble Machine Learning Outperforms Empirical Equations for the Ground Heat Flux Estimation with Remote Sensing Data

: Estimating evapotranspiration at the ﬁeld scale is a major component of sustainable water management. Due to the difﬁculty to assess some major unknowns of the water cycle at that scale, including irrigation amounts, evapotranspiration is often computed as the residual of the instantaneous surface energy budget. One of the Surface Energy Balance components with the largest uncertainties in their quantiﬁcation over bare soils and sparse vegetation areas is the ground heat ﬂux (G). Over the last decades, the estimation of G with remote sensing (RS) data has been mainly achieved with empirical equations, on the basis of the G and net radiation (Rn) ratio, G/Rn. The G/Rn empirical equations generally require vegetation data (Type I empirical equations), in combination with surface temperature (Ts) and albedo (Type II empirical equations). In this article, we aim to evaluate the estimation of G with RS data. Here, we compared eight G/Rn empirical equations against two types of machine learning (ML) methods: an ensemble ML type, the Random Forest (RF), and the Neural Networks (NN). The comparison of each method was evaluated using a wide range of climate and land cover datasets, including data from Eddy-Covariance towers that extend along the mid-latitude areas that encompass the European and African continents. Our results have shown evidence that the driver of G in bare soils and sparse vegetation areas (Fraction of Vegetation, Fv ≤ 0.25) is Ts, instead of vegetation greenness indexes. On the other hand, the accuracy in the estimation of G with Rn, Ts or Fv decreases in densely vegetated areas (Fv ≥ 0.50). There are no signiﬁcant differences between the most accurate Type I and II empirical equations. For bare soils and sparse vegetation areas the empirical equation which combines the Leaf Area Index (LAI) and Ts (E7) estimates G best. In densely vegetated areas, an exponential empirical equation based on Fv (E4), shows the best performance. However, ML better estimates G than the empirical equations, independently of the Fv ranges. An RF model with Rn, LAI and Ts as predictor variables shows the best accuracy and performance metrics, outperforming the NN model.


Introduction
Subtropical and mid-latitude regions such as the Mediterranean area have been identified as one of the climate hot spots of the earth [1]. Since the 1980s, in the North of Africa and South of Europe, the expansion of the Hadley cell during the warm half of the year has caused a poleward shift in the tropical high-pressure systems; triggering decreases in precipitation, and an increase in drought frequency during the warm half of the year [2]. Positive anomalies of the anticyclonic weather types in the subtropical belt during the warm half of the year have provoked a northward migration of mid-latitude climate types [3]. In addition, increases in temperature and atmospheric water demand have caused an increase in drought severity [4]. By the mid-end of the 21st century, climate projections for midlatitude areas project an upward trend of temperature and heat waves [5]. In combination with the projected decreases in precipitation [6], this will trigger an increase of aridity and drought episodes [7]. Dwindling water resources have direct consequences in the ecological as well as socioeconomic sphere [8] and the changing climate scenarios pointed out require further efforts for the improvement of the water estimation. In water-limited areas, evapotranspiration (ET), the flux of water returned back to the atmosphere from the soil (evaporation) or crop canopy (transpiration), is the main relevant negative flux (loss) of the hydrological cycle [9]. Generally, the estimation of ET with Remote Sensing (RS) data is performed with single-source energy or dual-source energy balance models. For the latter case, the vegetation and the soil are analyzed individually based on their respective surface energy balance (SEB). With RS data, the estimation of ET is obtained as the residual term of the SEB, or in other words, the difference between the SEB heat fluxes (LE = Rn − H − G)); where LE is the latent heat flux, Rn is the net radiation, H is the sensible heat flux and G is the energy absorbed or released at the soil surface (all in W/m 2 ). The latter, G, is one of the SEB components most difficult to quantify [10]. Hence, further efforts for the bias reduction of the G are required in order to improve the energy fluxes closure, and in the end, provide a better estimation of ET.
The estimation of G at local scales is generally performed with the Harmonic method, based on the Fourier series [11][12][13][14][15][16]. However, the Harmonic method cannot estimate the G for large areas, since it requires soil properties data (i.e., soil conductivity and temperature, at different soil depths), which are only measured at specific sites; usually with other SEB fluxes, in Eddy-Covariance (EC) towers. Remote sensing data can overcome the spatial limitations mentioned, providing an estimation of G with acceptable temporal and spatial resolution. For this reason, this article is focused only on the estimation of G with RS data. Over the last decades, the scientific literature has proposed several formulations for the estimation of G with RS data. The proposed methods could be summarized into two groups: (i) an adaptation of the Harmonic method, with an estimation of surface temperature (Ts) based on radiometric brightness, usually Meteosat data, but with a coarse spatial-resolution (ca. 3 km) limitation [17,18]. On the other hand, for sun-synchronous satellites with a single daytime overpass, the estimation of G is also performed with (ii) the ratio of G and Rn, the so-called G/Rn ratio [19,20]. In fact, several empirical equations have been applied for the estimation of the G/Rn. For instance, the ratio can be calculated following a sinusoidal function depending on the hour of the day and the maximum G/Rn observed [21] or with meteorological parameters, such as wind [22]. However, the most common approach is the estimation of the G/Rn with vegetation greenness data. The G/Rn is usually calculated with vegetation indexes, such as the well-known Normalized Difference Vegetation Index (NDVI) [23], the Leaf Area Index (LAI) [24], the Fraction of Vegetation (Fv) [25], the Modified Soil Adjusted Index (MSAVI) [26], or determined as a constant depending on vegetation height [27]. Finally, the G/Rn can also be estimated with the vegetation indexes named, in combination with Ts [28] and albedo (α) data [29]. The accuracy of the different empirical equations based on the G/Rn ratio for the estimation of G has been analyzed in previous works [30,31], showing large uncertainties, especially in high canopies covers. Therefore, other methodologies should be proposed for the estimation of G.
In the last decade, machine learning (ML) methods have been applied in environmental science, generally obtaining accurate results. Nevertheless, until today, only Canelón et al. [32] modelled the G with RS data and the Artificial Neural Networks (ANN) ML algorithm. Furthermore, only de Andrade et al. [33] provided a comparison of the ANN model against two G/Rn empirical equations. To the best of our knowledge, no study compared the performance of ensemble ML models, such as boosted regression trees (e.g., Random Forest (RF)), against the results obtained with neural networks (NN), and the ones obtained with the calibrated version of the G/Rn empirical equations. In addition, an estimation of G with the above-mentioned methods, evaluated over an extensive dataset of EC data in the tropical and mid-latitude area is still lacking. In this article, we aim to address these knowledge gaps, by proving the first systematic evaluation of the estimation of G with RF, NN and eight G/Rn empirical equations, evaluated by vegetation ranges. The analysis is performed in a climate hot-spot area (e.g., Mediterranean basin), with a dense dataset of G measurements acquired at several EC towers, with a wide range of land covers and climate types.
The manuscript is structured as follows: in Section 2 we present a description of the geographical settings of each EC, the data used and the methods implemented together with the evaluation metrics used. Subsequently, in Section 3 the results are presented and discussed. Finally, the conclusions are summarized in Section 4.

Materials and Methods
In this section, we present a detailed description of the meteorological and vegetation data used in this work together with the methodology followed.

Experimental Datasets Description
The spatial distribution of the EC towers included in this work is found in Figure 1, whereas the main geographical details of each EC tower are summarized in Table 1. The dataset analyzed includes daily records of vegetation greenness data (LAI), recorded with hemispherical photographs, once every 2 to 3 weeks, during specific stages of the phenological crop season. Haouz (Hao), Lamasquère (Lam) and Avignon (Avi) LAI records were validated with a planimeter using destructive methods. Days without LAI data were gap-filled with the broadly used local polynomial regression (LOESS) [31]. Meteorological, radiation budget and surface heat fluxes records were measured at meteorological EC towers. In this case, we used 30 min records of meteorological data (Ts, Rn, G and α) at 10:30 am and 1:30 pm, corresponding to typical sun-synchronous satellite overpass times. Meteorological records without data were excluded from the analysis. The data has been quality checked and previously used at [34]. Net radiation ratio records were acquired by a 4-component net radiometer (CNR1, 4.5-42 µm wavelength domain manufactured by Kipp & Zonen). Ground heat fluxes were acquired by self-calibrated heat plates sensors (HFP01SC, manufactured by HukseFlux) placed at different depths ranging from 5 cm to 100 cm. Depending on the site, different Ts measurement devices were installed. Further details of each site can be found below.
Barbeau (Bar) is located 60 km southeast of Paris (N of France) at 90 m of elevation. Bar is an oak-forest site (Fv > 0.95) under the Cfb climate influence, with a mean annual air temperature (MAAT) of 11 • C and a mean annual precipitation (MAP) of 680 mm. Ts was recorded by an infrared temperature sensor (IR120, 8-14 µm wavelength domain manufactured by Campbell Scientific Inc.) (Logan, UT, USA) and a Type-T (manufactured by Thermocouple) sensor. An accurate description of the Bar site is provided at [35].
Moving southwards, at 25 km southwest of Toulouse (southwest France), two EC sites at a distance of 12 km are found: Auradé (Aur) and Lam. Aur (165 m of elevation) and Lam (185 m) sites are under the influence of a Cfb climate type, with a MAAT of 13 • C and MAP of 656 mm. Both of the EC sites include crops of wheat and sunflower (depending on the year), reaching mid to high LAI values (ca. 0.5; Figure 2a). Aur Ts was recorded by IRTS-P infrared sensors (6 to 14 µm wavelength domain, manufactured by Campbell Scientific Inc.) and a CS10X (manufactured by Campbell) sensor, placed at different depths from 4 to 100 cm under the surface. Lam Ts was calculated with upwelling longwave radiation data and an infrared temperature sensor CS-109 (manufactured by Campbell), with measurements at different depths ranging from 1 to 100 cm under the ground. Further technical details of Aur and Lam measurements can be found at [34,36].
Avignon (Avi) EC site is located northwest of Montpellier (southeast France), at 32 m of elevation. Avi is ruled by a Mediterranean climate type (Csa), with a MAAT of 14 • C and a MAP of 677 mm. Avi includes a wide range of crops types (depending on the season) such as peas, wheat and sorghum crops, and bare soil covers between crop seasons. Ts was calculated with upwelling longwave radiation data. A detailed technical description of the Avi can be found at [37]. Remote Sens. 2022, 14, x FOR PEER REVIEW  Figure 2a). Aur T recorded by IRTS-P infrared sensors (6 to 14 m wavelength domain, manufactur Campbell Scientific Inc.) and a CS10X (manufactured by Campbell) sensor, placed ferent depths from 4 to 100 cm under the surface. Lam Ts was calculated with upw longwave radiation data and an infrared temperature sensor CS-109 (manufactur Campbell), with measurements at different depths ranging from 1 to 100 cm und ground. Further technical details of Aur and Lam measurements can be found at [3 Avignon (Avi) EC site is located northwest of Montpellier (southeast France) m of elevation. Avi is ruled by a Mediterranean climate type (Csa), with a MAAT of and a MAP of 677 mm. Avi includes a wide range of crops types (depending on th son) such as peas, wheat and sorghum crops, and bare soil covers between crop se Ts was calculated with upwelling longwave radiation data. A detailed technical de tion of the Avi can be found at [37].
Kairouan (Kai) is located in the Kairouan plain, in the South of Tunis (Tunisia m of elevation. Kai is a crop site under the influence of a Csa climate type, with a M of 20 °C and a MAP of 287 mm. Ts was measured with an infrared sensor (IR120; 8   [34], and references therein. Haouz (Hao), is located in the Tunift basin, 45 km east of Marrakech (Morocco) at 450 m of elevation. Hao is under the influence of the BSk climate type, with a MAAT of 20 • C and a MAP of 150 mm, in an irrigated land of wheat crops. Ts was measured with a nadir-looking infrared radiometer (IRTS-P; 8 to 14 µm wavelength domain, manufactured by Campbell Scientific Inc.) at 2 m height. A detailed description of Hao is found at [38,39].
Furthermore, three EC towers were installed near Niamey (Niger) in a desert area under a BWh climate influence, with a MAAT of ca. 30 • C and a MAP of ca. 300 mm. Agofou (Ago) site is located in a millet zone, whereas in Wankama (Wan) one EC is located on the savannah, and another one is placed in a zone of wheat crops. Ts was recorded with an incidence thermal infrared thermometer (KT15; 8 to 14 m wavelength domain, manufactured by Heitronics) at 2.9 m height. Further details of the Niger EC sites can be found at [40,41].
The dataset has been gathered through the joint effort done between the international laboratories composed by (i) the Service National d'Observation (

Estimation of Vegetation Indexes
Reflectance and biophysical variables (NDVI and Fv) were not available for the whole dataset. In this case, satellite-borne RS data could be biased, since the spatial resolution values of the data are the averages values for a given area, which does not correspond with the field-scale measurements recorded at each EC site. Hence, LAI values recorded with hemispherical photographs were converted to NDVI values following the method introduced by Clevers [42], and successfully applied and validated for all crop types [43]: where k is the calibrated extinction, set to 1.13 [43]. NDVI ∞ is the NDVI value of the fully developed canopy, whereas NDVI soil is the NDVI of bare soils. According to previous works, the NDVI ∞ was set to 0.97 [43]. We tested different NDVI soil values, ranging from 0 to 0.2, but no significant G/Rn differences were found. Therefore the NDVI soil was set to 0.1 [30].
The Fv was calculated with the semi-empirical method introduced by Roujean [44], where Fv is performed with the following exponential function of LAI: where b, F(o) and I are constants, corresponding to −0.945, 0.5 and 1, respectively.

Empirical Equations and Machine Learning Algorithms
The empirical equations included in this work are divided into Type I and Type II. Type I gathers the empirical equations that estimate G with only vegetation indexes. This includes E1, E2 and E3 (based on NDVI), E4 and E5 (Fv), and finally E6 (LAI). Type II includes E7, which estimates G with Ts and LAI, and E8 which estimates G with NDVI, Ts and α. A detailed description of the empirical equations can be found in Table 2. Regarding the ML algorithms, NN I and RF I (ML I) are trained with LAI data, providing a fair comparison with Type I equations. Neural network II and RF II (ML II) are trained with LAI and Ts. Last, NN III and RF III (ML III) are trained with LAI, Ts and Rn. Table 2. Empirical equations and ML algorithms used for the estimation of the G. The NN algorithm creates a distributed system of neurons with non-linear functions and several layers interconnected for the prediction of one variable, in this case, G. In this work we tuned the two principal hyper-values of the NN, named size and decay ( Table 2). The parametrization of the NN size ranged from 10 to 30, with increments of 5; while the parametrization of the decay ranged between 0.001 and 0.2, with increments of 0.05. An accurate description of the mathematical formulation of the NN can be found at [46,48].

Random Forest (RF)
The RF method is a non-parametric method based on decision trees and bootstrap aggregation [47]. A group of trees is developed using multiple decision trees trained from random subsets of the dataset. In this work, we applied The RF regression is calculated as follows: The tunning of the regression RF has been analyzed at [49]. In their study, they conclude that the number of decision trees randomly selected in the model, the subset of training samples, together with the variables split at each tree node, prevents the algorithm from overfitting. Two hyper-values have been tuned of the RF. The number of predictors evaluated in each node (mtry), and the number of trees (ntree) of a random sample. We performed a sensitivity analysis of the RF hyper-values ( Table 2), but no significant differences were found in the accuracy metrics (Table S1). In this work, the mtry was set as 1 and the ntree was equal to 800.
We used a grid search and 10-fold cross-validation to validate the hyper-values of both of the ML algorithms. We applied the RF with the randomForest package [50] and the NN with the Caret package [51], both of them of R Studio [52].

Evaluation
In order to avoid the evaluation of the methods over the same training data, the initial dataset was randomly split into two groups. The training dataset (conforming the 70% of the dataset) and the testing dataset (which constitutes the remaining 30%). In the training dataset, the empirical equations have been calibrated and the ML algorithms tuned. Table 2 shows the different ML hyper-values tested and the values set, as well as the calibrated constant values of the empirical equations. Subsequently, the trained ML algorithms and the calibrated empirical equations have been applied in the independent and testing dataset. Hence, the results obtained with the calibrated empirical equations and the trained ML algorithms are comparable, given that the evaluation and calibration of all the methods have been performed under the same conditions. We also evaluated the accuracy and performance of each method by Fv ranges, from 0 to 1, in increments of 0.25. The accuracy and performance of the models have been evaluated using four types of statistical metrics: the residual (Equation (5)), the mean absolute error (MAE; Equation (6)), the Root Mean Square Error (RMSE; Equation (7)), the coefficient of determination (R 2 ; Equation (8)) and Willmott's D (D; Equation (9)). Residual = yi −ŷ where N is the number of the samples yi is the observed value,ŷ is the estimated and y represents the average value of the estimated values. The residual is the difference between yi andŷ. The MAE and RMSE summarise the mean differences between the predicted values and the observed values. Low RMSE and MAE values are inversely related to the high accuracy of the method proposed. The RMSE is sensitive to high values or outliers [53] and can be used as an indicator of the magnitude of extreme errors. On the contrary, high values of D and R 2 are related to the high performance of the methods.

Results and Discussion
In the first section we determine which are the meteorological and vegetation drivers of G. Subsequently, we analyze the observed and estimated values of the heat flux. Then, we provide a detailed analysis of the accuracy and performance metrics of all the methods, together with an evaluation of the methods by Fv ranges. The results are discussed and compared with previous works. Finally, we elaborate on the reason for the good accuracy and performance metrics obtained with ML.

Characterization and Drivers of G
The G values expected by each equation with the same sample data are shown in the theoretical plot presented in Figure 2b. Both of the empirical equation's types, based only on vegetation indexes (Type I); as well as on vegetation indexes, ancillary remote sensing and meteorological data (Type II) estimate an overall decrease of the G depending on Fv. In bare soils and sparse vegetation areas (Fv ≤ 0.25), the equations suggest that G supposes ≥ 20 W/m 2 of the SEB, providing evidence that G is not a negligible SEB component, in these land cover conditions [31]. For mid to high vegetation ranges (Fv > 0.25), the E1, E2, E4, E5, E6 and E7 show a parallel shape for the G estimation, and hence similar values can be expected. The highest difference between equations is found between E3 and E8. Whilst the equations have a curve shape for Fv > 0.4, for bare soils and sparse vegetation areas E8 assumes a constant G value, and E3 estimates G as an exponential function of Fv (Figure 2b).   Figure 3 shows evidence that the different drivers of G depend on the Fv rang bare soils and sparse vegetation areas, the best parameter for estimating G is Ts (R 2 = followed by Rn (R 2 = 0.30). Over densely vegetated areas and high canopies (Fv ≥ the minimum G values are observed (ca. 5 W/m 2 ;). On many days, the G can be neg or be assumed to be practically equal to 0 [19]. This is because high levels of vege cover thermally insulate the surface, significantly reducing the solar radiation that r the ground [54], and buffering temperature gradients in areas with significant rates [31]. Given that the G variability under densely vegetated areas and high canopies mainly ruled by Rn and Ts, the incorporation of these variables as a predictor vari redundant.
The observed and estimated G values by the different empirical equations an algorithms are shown in Figures 3 and 4. The observed mean value of G ranged fr to 18 W/m 2 . In bare soils and sparse vegetation areas, where the majority of the found (Figure 2a), the highest G values are measured in accordance with previous  Figure 3 shows evidence that the different drivers of G depend on the Fv range. For bare soils and sparse vegetation areas, the best parameter for estimating G is Ts (R 2 = 0.48), followed by Rn (R 2 = 0.30). Over densely vegetated areas and high canopies (Fv ≥ 0.50), the minimum G values are observed (ca. 5 W/m 2 ). On many days, the G can be neglected or be assumed to be practically equal to 0 [19]. This is because high levels of vegetation cover thermally insulate the surface, significantly reducing the solar radiation that reaches the ground [54], and buffering temperature gradients in areas with significant rates of ET [31]. Given that the G variability under densely vegetated areas and high canopies is not mainly ruled by Rn and Ts, the incorporation of these variables as a predictor variable is redundant. Remote Sens. 2022, 14, x FOR PEER REVIEW 1  The observed and estimated G values by the different empirical equations and ML algorithms are shown in Figures 3 and 4. The observed mean value of G ranged from 59 to 18 W/m 2 . In bare soils and sparse vegetation areas, where the majority of the data is found (Figure 2a), the highest G values are measured in accordance with previous works [21]. In these land types, characteristics of arid and semiarid climates zones, the highest values of G are generally observed during the warmest months of the year [31]. The G rules a significant part of the SEB hourly variability, with instantaneous G values ranging from 5-10% up to 50% of the Rn [30,55,56].

Analysis of the Accuracy and Performance of the Different Methods Used for the Estimation of G
The calibration of the constant values slightly improved the Type I equation's accuracy, but negligible differences are found for Type II equations ( Figure 5). Regarding the ML models, no significant differences in the accuracy metrics have been observed between the training and the testing dataset, providing evidence of the lack of overfitting. In addition, negligible differences have been found between the different hyper-values tested during the ML tunning process (Table S1).
The majority of the empirical equations have shown a systematic underestimation of G, specially E3 for Fv ≥ 0.75. The highest bias is observed with the empirical equations that conform to Type I. The residual term (the difference between the estimated and the observed values) of Type I increases with Fv, except for E4 and E6. On the other hand, Type II overestimates G for bare soils and sparse vegetation areas (5 W/m 2 ) and underestimates the heat flux for the other Fv ranges. The residual term for Type I is larger in densely vegetated areas and high canopies covers (>5 W/m 2 ) than in bare soils and sparsely vegetated areas, and therefore the data suggest that Type I equations are not able to reproduce the decrease of G depending on Fv. This is because Type I equations determine a constant value for all the Fv ranges. In contrast, other empirical equations such as E7, use a threshold as a function of LAI, determining two empirical equations for the estimation of G (Table 2) and in the end providing better results.

Analysis of the Accuracy and Performance of the Different Methods Used for the Estimati of G
The calibration of the constant values slightly improved the Type I equation's ac racy, but negligible differences are found for Type II equations ( Figure 5). Regarding ML models, no significant differences in the accuracy metrics have been observed betw the training and the testing dataset, providing evidence of the lack of overfitting. In ad tion, negligible differences have been found between the different hyper-values tes during the ML tunning process (Table S1). The majority of the empirical equations have shown a systematic underestimation of G, specially E3 for Fv ≥ 0.75. The highest bias is observed with the empirical equations that conform to Type I. The residual term (the difference between the estimated and the observed values) of Type I increases with Fv, except for E4 and E6. On the other hand, The estimation of G with ML has shown lower residual differences than those obtained with the empirical equations ( Figure 6). In fact, the residual term with ML is practically the same (avg. ca. 5 W/m 2 ) for the majority of Fv ranges. In addition, no systematic biases are estimated with ML II and ML III. Furthermore, ML II and III are able to reproduce the highest (percentile 90th) G values (Figure 3). The validation metrics by Fv ranges are shown in Table 3 and Figures 7 and 8. The results provide evidence that the estimation of G based on Type I empirical equations leads to high uncertainties in comparison with the ML methods. The error metrics for the majority of the Type I empirical equations are practically the same under bare soils and scarce vegetated areas (MAE = 36 and RMSE = 51 W/m 2 ). Nevertheless, the highest differences between Type I equations are observed in densely vegetated areas. The accuracies of E1, E2 and E3 are worse than the group constituted by E4, E5 and E6 (avg. MAE 24 vs. 18 W/m 2 ), confirming the E1, E2 and E3 systematic underestimation of the G. Type II improve the estimation of G to some degree. The accuracy of E8 (MAE = 28, RMSE = 39 W/m 2 ) is slightly better than the E7 (MAE = 28, RMSE = 37 W/m 2 ), but two Type I equations (E4 and E6) outperformed all the empirical equations for the estimation of G at dense and high canopies covers. Therefore, these results suggest that the estimation of G is more accurate with Fv data (E4 and E5) and LAI (E6) than with NDVI (E1, E2 and E3). The results presented in this article are in accordance with previous studies that estimated the G based on vegetation indexes [10,19,31]. The lack of difference between E7 and E8 suggests that the inclusion of α in the empirical formulations that already include Ts does not substantially improve the estimation of G. This is probably explained because of the collinearity existent between Ts and α. between E7 and E8 suggests that the inclusion of α in the empirical formulations that already include Ts does not substantially improve the estimation of G. This is probably explained because of the collinearity existent between Ts and α.     ML I, which only includes LAI as a predictor variable, shows similar accuracy metrics as Type I and hence a low performance for the estimation of G. Machine learning II, which includes LAI and Ts as predictor variables, outperformed all the empirical equations. Moreover, ML III, including LAI, Ts and Rn, significantly improved all the estimations of G and showed the best accuracy and performance (Figures 6-8). In particular, RF II shows slightly better results than NN II. Yet, the difference between algorithms increases when Rn is included in the regression models (Table 3). Except for E4, in bare soils and sparse vegetation areas, Type I error is almost double (MAE = 39.8 W/m 2 ) than that measured with the RF III (MAE = 23.6 W/m 2 ). Furthermore, the performances of RF III by Fv ranges are more constant than the other methods. The minimum errors are observed in bare soils and sparse vegetation areas (D = 0.7 and R 2 = 0.5), which could be explained because the majority of the data is found in this Fv range. In densely vegetated areas, the RF III performed also better than all the other methods. For instance, at Fv > 0.5, the R 2 values of RF III are 0.3, whereas the other methods rarely reach an R 2 of 0.1 ( Figure 8). However, NN III and RF III do not significantly improve the large uncertainty in the highest vegetation range (Fv ≥ 0.75). For NN III, the D and R 2 values are only 0.3 and 0.1, respectively. Random forest III shows a slightly better performance, with a D and an R 2 of 0.3 and 0.2, respectively.
Uncertainties in the SEB fluxes quantification can be related to measurement bias, which corresponds with ca. 10% for G, and ca. 5% for Rn [21]. However, all the methods have been evaluated under the same conditions, and hence the superiority of RF II and III is related to the algorithm architecture. The regression-based ML methods have shown promising results over the last years for the prediction of continuous variables, and have been successfully applied in a wide range of scientific disciplines, including climate and hydrological modelling [57]. For example, ML provides better results than Multiple Linear Regression for the spatial interpolation of air temperature [58]. Machine learning also has shown optimal results for the estimation of environmental variables, such as H and LE retrievals [59], other SEB components such as Rn and H [60,61] or ET [62][63][64]. For the estimation of G, Canelón et al. [32] modelled the heat flux with ANN and RS data, and de Andrade et al. [33] compared the ANN and two empirical equations based on NDVI, Ts and α. In accordance with our findings, in both of the works, ML outperformed the other methods for the estimation of G. The underestimation of the empirical equations and the large errors found at high Fv ranges are partly because the majority of Type I and II equations rely on a single linear relationship with the predictor variables. The main strengths of the NN and RF are that they are very flexible, adaptable to the shifting G values depending on the vegetation and meteorological conditions, and do not follow a stationary and linear assumption. The bootstrap aggregating (bagging) algorithms, such as RF, creates several models by selecting a random subset of the variable for each decision tree. Moreover, the inclusion of Rn in RF III improved the G estimation, showing evidence that the recursive model partitioning of RF is resistant to overlaps in the covariates, and can handle correlated variables such as Rn and Ts. In addition, RF is a non-parametric model, to overcome noise and outliers [49]. On the other hand, the limitation of the ML algorithms is that generally reproduce the data of the training dataset, and could not extrapolate the values outside of the training. The estimation of G with ML also requires more computational time than the empirical equations. The training time required to train the RF algorithms was equal to 5, 8 and 11h for RF I, II and III, respectively. On the contrary, the training time of the NN was 26, 28 and 31 for NN I, II and III, respectively. Nonetheless, once the ML algorithms are trained, they are computationally efficient. In this case, the prediction time required was lower than one minute. Finally, further research is needed in order to improve the energy balance closure and the estimation of G in high canopies covers. If the accuracy between ML methods shows small differences, the introduction of other predictor variables should be tested. Further works should focus to combine ML with physical modelling [62].

Conclusions
Anthropogenic climate change is leading to changes in the water and energy fluxes of some areas of the planet, such as the tropical and mid-latitude regions. Therefore, accurate hydrological estimation is crucial in order to better quantify water balances, and support socioecological decision-making. This work compared different methods for the estimation of a SEB component with large uncertainties, the G. For the first time, we evaluated an ensemble ML method (the RF), against NN and several empirical equations that have been extendedly used for decades for the estimation of G with RS data. The evaluation was performed using data from EC sites, with a wide range of vegetation (i.e., forests, crops of wheat or millet) and climate (from the marine west coast to desert) types, along the mid-latitude area found between continental Europe up to the middle of the African continent.
The data have shown that the driver of G depends on the Fv range. In bare soils and sparse vegetation areas, Rn followed by Ts rules the estimation of G, decreasing with Fv. On the contrary, meteorological and vegetation data have not shown a relevant and statistically significant link with the estimation of G in densely vegetated areas. Neural network and RF models, with LAI and Ts as predictor variables, outperform the empirical equations for the estimation of the G, independent of the Fv range considered. The inclusion of Rn in the model significantly improved the estimation of G. Despite the computational time required (8 h), RF III have shown the highest R 2 , D as well as the lowest MAE and RMSE. The accuracy values of the empirical equations are almost double when compared to the measured values with RF III. Further works should analyze the estimation of G over large areas with the predictor variables used here, in combination with other predictor variables, for instance, downscaled soil moisture data. Also, further research could test the combination of ML with physical-based models for the estimation of G.