1. Introduction
Eutrophication, an important manifestation of water pollution in freshwater reservoirs, results from the excessive nutrient loads that cause algal blooms [
1,
2]. The excessive algae and aquatic plant growth result in oxygen depletion, ultimately causing severe impacts on aquatic ecosystems and considerably increasing the costs related to water treatment [
3,
4,
5,
6].
Chlorophyll-a (Chla), a photosynthetic pigment present in major algae groups, is widely used as a key indicator of phytoplankton presence [
7,
8]. Because the abundance of algae can reflect the state of eutrophication, Chla is one of the most important parameters for evaluating the trophic condition of water bodies [
9,
10]. While Chla concentration has long served as a key parameter in monitoring harmful blooms, accurately predicting Chla in reservoirs has proven to be a persistent challenge [
11,
12]. This difficulty is mainly due to the non-linear and non-stationary characteristics of Chla concentration, which are influenced by anthropogenic and hydrometeorological factors [
2].
In regions characterized by semi-arid climates, such as the Northeastern Brazil, establishing an extensive network of multi-purpose artificial reservoirs has emerged as a reliable solution to confront the water scarcity challenges imposed by environmental constraints [
13,
14]. These reservoirs are notably susceptible to eutrophication due to hydroclimatic characteristics that combined favor photosynthesis and biodegradation [
15,
16], such as interannual variability of precipitation and stored volume [
17], high temperatures and evaporation rates [
18], and prolonged hydraulic retention time [
19]. Moreover, this susceptibility is further aggravated by continued anthropogenic pressure on water bodies due to internal enrichment from aquaculture practice [
3,
20], inadequate coverage of sanitation systems [
21], and a dense reservoir network [
22,
23].
Therefore, regularly monitoring Chla concentrations is crucial for implementing effective water quality management strategies to prevent further deterioration [
24]. However, traditional sampling methods are expensive, time-consuming, and impractical for many reservoirs [
25,
26]. As an attractive option, satellite remote sensing and machine learning (ML) techniques offer a cost-effective approach for monitoring Chla concentrations and their spatiotemporal variations, providing data of complex environmental systems on larger scales [
27]. The Sentinel-2 constellation has been proved to be a valuable asset in monitoring inland and coastal waters, once it has improved spatial resolution compared to other freely available sensing systems data, such as Landsat 8 [
28,
29].
The ML approach excels in retrieving complex non-linear relationships within satellite data, capturing the underlying structure bonding the satellite data and the desired target variable [
30,
31]. Combining ML architectures with remote sensing data have been able to provide top-notch results in a plethora of scientific fields, such as solar irradiance forecasting [
32], mapping of mineral extraction sites [
33], forest fire mapping [
34], and crop water stress evaluation [
35]. ML and satellite data performance has also been explored for Chla monitoring. A random forest (RF) based model was developed in [
36]. In their study, the authors used Sentinel-2 imagery to retrieve Chla concentrations for lake Chagan, in China. Their proposed model provided good performance in determining the Chla concentrations, also complying to the biological mechanism in lakes, offering robust results to seasonal changes.
In Cao et al. [
9], the authors used Landsat-8 remote sensing data together with extreme gradient boosting tree model (XGBoost) ML to determine Chla in lakes located in China. Their approach was implemented to analyze the spatiotemporal data from 2013 to 2018, and it demonstrated satisfactory performance in identifying the Chla behavior in the study location. In Hu et al. [
37], the authors developed methodologies to mitigate spectral noise in satellite data to improve the performance of ML models to estimate Chla in global oceans using remote sensing from several satellites. Their results proved that the support vector regression (SVR) was the best-performing ML approach, surpassing the traditional band-ratio models and providing reduced image noise.
The Group Method of Data Handling (GMDH) has also been applied to hydrological scenarios, including Chla estimation [
38], water quality prediction [
39], and image classification for plant diseases [
40]. However, there exists a gap in the knowledge regarding the usage of this approach on modeling Chla concentration using satellite data, which the present study aims to fulfill. Additionally, the present study seeks to provide a deep insight into the performance of ML paradigms when applied to a vast area containing heterogeneous reservoirs.
Given this scenario, the potential to increase algae blooms and further degradation of these aquatic systems has increased the need to study often poorly monitored reservoirs in semi-arid regions. Therefore, this study aims to estimate Chla concentration through combined techniques of remote sensing and machine learning. For this, the following specific objectives were pursued in this research:
A comprehensive investigation on several input parameters for Chla modeling, including all the 13 bands of the MSI onboard of Sentinel-2 constellation and 16 different spectral indexes.
A comprehensive analysis and characterization of all the 149 tropical reservoirs extensively spread across the state of Ceará, located in the Brazilian semi-arid region.
The usage of stepwise approach on parameters selection.
The investigation of different machine learning paradigms for modeling Chla values.
To fulfill the gap in the knowledge regarding the usage of the GMDH ML model for Chla modeling using remote sensed data and spectral indexes.
To assess the performance of several machine learning paradigms applied to a vast region containing heterogeneous reservoirs.
4. Discussion of the results
Mainly in tropical regions, the mechanisms that regulate phytoplankton growth require an advanced analysis due to the complexity and nonlinearity of the relationship between chlorophyll and physicochemical/environmental factors [
13,
136]. The approach presented in this study achieves this analysis using a highly heterogeneous collection of in-situ observations, investigating the performance of different ML models to determine Chla levels in 149 dams in the Brazilian state of Ceará. Although most dams were predominantly eutrophic throughout the years, different human activities and pollution sources have contributed to eutrophication processes lead to Chla spatiotemporal fluctuations and algal blooms [
17,
20,
137,
138].
We applied the forward stepwise approach to select the parameters, including bands and spectral indexes. To the best of our knowledge, this method has not yet been applied in previous remote sensing studies. The forward stepwise approach showed significant improvement for the Chla determination for the ML models. Existing literature regarding the influence of spectral bands for Chla determination can be found. In work [
139], the authors applied the SHAP analysis to explain the influence of the Sentinel-2 spectral bands for estimating Chla values. Their results elucidated that band 3, band 2, and band 8 were the top-three most influential parameters for Chla determination. Bands 3 and 2 exhibited a positive correlation with the Chla, while band 8 showed a relevant negative correlation with the same parameter. A similar approach was conducted by Kim et al. [
140]. In their work, the SHAP analysis showed relevant participation on Chla prediction of red bands, i.e., bands 4, 5 and 6, as well as blue and green bands (
Table 1), a similar conclusion found by Ha et al. [
141].
The influence of different spectral indexes for Chla modeling was investigated in previous works. The work of Castro et al. [
142] showed that indexes merging red and NIR bands yielded the best outcomes for the determination of Chla concentrations in small reservoirs. Similar conclusions were drawn by Buma and Lee [
143] and by Aubriot et al. [
144], which attested the importance of bands within the red spectral range for Chla characterization in a lake in Chad, and for the Rio de La Plata, respectively. On the other hand, Viso-Vazquez et al. [
145], showed that the green band, i.e., band 3, promoted the highest correlation between the remote sensing data and Chla levels.
Our results show the GMDH methodology could efficiently identify the latent non-linear ties ruling input and output attributes. Moreover, the GMDH approach proved to be a more robust model for analyzing satellite imagery, being more resilient to noises and satisfactory results when handling different bands information, which also can be attributed to its improved performance.
To better understand where the GMDH results lie within the literature, we compared our results with those found in previously published works. Such evaluation, however, may not be completely representative given the different methodologies used in different studies, the different models, input attributes, and the study area. Yet, a comparison between their evaluation metrics is still a viable approach to assess the performance of different models [
146].
Table 4 compiles the results for the GMDH model, while
Table 5 gathers the results found in the literature.
From
Table 5, we observe that the models based on the deep learning methodology all performed remarkably well in determining Chla levels. Compared to the results found in our study, it is possible to notice that the R-squared values for references [
147] and [
148] are in the same range, over 90%. Nevertheless, regarding the work by Guo et al. [
147], the combined utilization of Sentinel-2 and Landsat-8 data significantly enhances the machine learning model’s performance. This improvement is primarily due to the reduction in the revisit time, which subsequently minimizes the variance in the dataset. Consequently, this leads to a more robust and accurate machine-learning model. It is important to note that, although their results are slightly superior, the study location is limited to only one place.
The addition of more water bodies is expected to add variance to the dataset. The model in reference [
149] was built using modeled data, including 11 times more water bodies than the previous studies. Their results proved to be slightly inferior to the other two previous works, and more relatable to the ones found in this present assessment, with RMSE in the same magnitude order.
Regarding the R
2, our GMDH model showed a value nearly 10% superior. However, it must be stated that since the data used for training the ML model was simulated, it may not consider several natural occurring situations. This would lead to a more homogeneous dataset, with less variance, improving the ML model performance compared to the model proposed in our study. Another significant difference is the time window used for testing the developed model in reference [
149], which is significantly smaller than the one used in our model [
152], which reduces the dataset variance improving the ML performance.
The DL approaches presented in
Table 5 still require a huge amount of data to yield reliable outputs. This characteristic may pose a major drawback depending on the data availability for the study location. On the other hand, the GMDH paradigm can be promptly implemented with less information available, and not requiring extensive training dataset [
38], deeming them a more flexible usage for different situations.
References [
150] and [
151] applied SVR to determine Chla. It is noticeable that although the models are the same, their methodology was disparate. Regarding reference [
150], the RMSE values were within the same order of magnitude. Besides that, the R
2 values were relatively close to each other. Contrarywise, reference [
151] performed a much broader study regarding several lakes spread across the Chinese territory. As previously mentioned, the inclusion of different lakes allows the forecasting model to better generalize its results, and thus provide more robust outcomes. Another critical difference between these two works is that the former implemented as input information just the spectral bands of the MSI onboard of Sentinel-2, while the latter used both bands and indexes. In this aspect, the present study achieved superior performance, attesting the GMDH model benefited from the inclusion of spectral indexes, which improved their Chla outcomes.
The work conducted by Aranha et al. [
13] shares the same location as the one used in this study. However, they used only a subset of 5 reservoirs of 149. In their approach, the authors fitted a regression line to their dataset using the three-band spectral index (3BSI), showing a good agreement between the index and the Chla values. A similar methodology was implemented in reference [
145], where the Toming’s index was used to fit a regression line for Chla values. These two studies implemented spectral bands to estimate Chla concentrations. By evaluating the R
2 metric values for [
13] and [
145], the proposed GMDH was superior over both studies with significant improvements of 12% and 5%, respectively. Furthermore, a major difference between these two studies and the methodology presented in this work, is the data handling. The other studies used methodologies which consisted of fitting a regression line using the proposed indexes over the entire dataset, making no distinction between training and testing datasets. This is analogous to assessing our ML model’s performance considering the training dataset only. Therefore, their methodologies lack generalization, being bound to a particular time and geographic location. In addition, while these approaches are considerably less complex than the proposed ML model, they provide valuable insights into Chla’s behavior when analyzed using the evaluated indexes.
It is important to disclose that our study was conducted in a tropical semi-arid region, considering 149 dams, which is considerably larger than any of the presented literature. In addition, a broader investigation of the spectral indexes for the Sentinel-2 constellation was assessed compared to the references in
Table 5. This allows us to gather further insight into the impact of the spatiotemporal influence of both bands and indexes on the Chla prediction. From the results for parameters selection in
Figure 4 and
Figure 5, we observe that bands 8 and 11 do not compose the selected bands set. However, they are still present in the form of indexes. Moreover, careful assessment of the GMDH results (available at the Supplementary Spreadsheet 1) showcase relevant contribution of bands 3, 4, 5, 7, 8 and 11 to the Chla prediction. This indicates that the predictive models benefit from a higher spatial resolution, as well as from green, red and infra-red bands composing the indexes, accordingly with the literature [
13,
29,
59].
5. Conclusions
In this study, several input parameters were evaluated for Chla modeling in a vast spatial coverage of 149 freshwater reservoirs spanning over a semi-arid tropical region across the Brazilian state of Ceará. This evaluation was conducted based on satellite remotely sensed data and a set of ground-truth Chla concentrations measurements that reflected the temporal and spatial distribution, notably impacted by interannual rainfall variability. To this end, we investigated the performance of several ML approaches, namely the k-nearest neighbours, random forests, extreme gradient boosting, the least absolute shrinkage and selection operator, support vector machine, and GMDH.
Forward stepwise parameter selection was implemented to determine the best input attribute selection among the 13 bands from the Sentinel-2 constellation and 16 spectral indexes derived from such bands. The usage of this methodology applied to the remote sensing field is, to the best knowledge of the authors, new and proved to improve the outcomes of the investigated Chla forecasting models.
Figure 7 elucidates the performance of GMDH model, showing that the model under investigation yielded excelent values for the Chla concentrations, achieving R
2 value greater than 90%.
It is important to emphasize that our methodology ensures a clear separation between training and testing data, unlike some studies found in the literature. The approach is substantial for applying ML models, as a way to promote more robust results and capacity to generalize unseen data. This approach is especially crucial for a large dataset such as the one used in the present work, comprising 149 reservoirs in a semi-arid region similar in size to England, where the spatiotemporal variation of the data is substantial. Regarding the spectral bands and indexes, we showed that the Chla modeling benefited mostly from red, NIR, and green bands.
Extensive comparison with the literature found studies showed that the used models in this work offer competitive results. It is important to note that such comparison may be misleading due to the disparate methodologies and study locations, which directly influence the ML outcomes.
In future works, the addition of more indexes, as well as the merge of Sentinel-2 data with Landsat-8 could be investigated. Including more spectral indexes and the Landsat-8 MODIS data would provide more spatiotemporal information and reduce data variance due to the finer temporal resolution. Furthermore, the implementation of atmosphere correction preprocessing could also be beneficial to the predictive paradigms, as it would reduce data noise, diminish the error variance, and improve the forecasting of Chla concentration.
Author Contributions
Conceptualization, I.E.L.N. and P.A.C.R.; methodology, I.E.L.N., P.A.C.R., F.A.S.F., J.V.G.T., and B.G.; software, P.A.C.R.; validation, I.E.L.N., P.A.C.R., F.A.S.F., J.V.G.T., and B.G.; formal analysis, I.E.L.N. and P.A.C.R.; investigation, I.E.L.N., P.A.C.R., F.A.S.F., J.V.G.T., and B.G.; resources, I.E.L.N. and P.A.C.R.; data curation, I.E.L.N. and P.A.C.R.; writing—original draft preparation, B.M.D.M.G., V.O.S., I.E.L.N. and P.A.C.R.; writing—review and editing, B.M.D.M.G., V.O.S., I.E.L.N., F.A.S.F., J.V.G.T., and B.G.; visualization, B.M.D.M.G., V.O.S., I.E.L.N. and P.A.C.R.; supervision, I.E.L.N., P.A.C.R., J.V.G.T. and B.G.; project administration, I.E.L.N., P.A.C.R., J.V.G.T. and B.G.; funding acquisition, B.G. and J.V.G.T. All authors have read and agreed to the published version of the manuscript.