Retrieval of Non-Optically Active Parameters for Small Scale Urban Waterbodies by a Machine Learning-Based Strategy

Water quality retrieval for small urban waterbodies by remote sensing get used to be difficult due to coarse spatial resolution of the remote sensing imagery. The recently launched Sentinel-2 produces imagery with a spatial resolution of 10 m. It provides an opportunity to solve the problem of retrieving water quality for small waterbodies. Additionally, many water management issues also require fine resolution of imagery, e.g. illegal discharge to an urban waterbody. Since illegal discharges are an important issue for urban water management, chemical oxygen demand (COD), total phosphorous (TP), and total nitrogen (TN) were chosen as the target parameters for water quality retrieval in this study. COD, TP and TN, however, are non-optically active parameters. There were limited studies in the past to retrieve these parameters in comparison with optically active parameters, e.g. Chlorophyll-A etc. This study compared three machine learning models, namely Random Forest (RF), Support Vector Regression (SVR), and Neural Networks (NN), to investigate the opportunity to retrieve the above non-optically active parameters. Results showed that R2 of TP, TN, and COD by NN, RF and SVR were 0.94, 0.88, and 0.86, respectively. The performances of water quality retrieval for these non-optically active parameters were significantly improved by the optimized machine learning models. These models hence solved the problem to use remote sensing data to retrieve these non-optically active water quality parameters and provided a new monitoring strategy for small waterbodies. Water quality mapping obtained by Sentinel-2 imagery provided a full spatial coverage of the water quality characterization for the entire water surface. Compared with water samples collecting and testing, it greatly reduced labor cost, reagents cost, and waste treatment cost. It also may help identify illegal discharges to urban waterbodies. The method developed in this research provides a new practical and efficient water quality monitoring strategy in managing water with consideration of environmental sustainability.


Introduction
In urban areas, waterbodies, such as lakes and reservoirs, may be polluted by illegal discharges of domestic sewage and industrial effluent [1]. Deterioration of water quality may increase human exposure to diseases and harmful chemicals; reduce ecosystem productivity and biodiversity; and damage aquaculture, agriculture and other water-related industries [2,3]. Traditional water quality monitoring methods are primarily based on water samples collecting and testing or automatic in-situ measurements. Both methods are either labor intensive or very costly. In addition, most water samples testing would need reagents for testing and the treating of waste generated by testing is also costly. Although these methods may have high accuracy, individual samples only reflect the water quality at specific sampling points and are limited in characterizing the water quality for the entire waterbody [4][5][6][7]. In many cases, the decision makers would need a full picture of water characteristics over the entire waterbody surface for water quality management. Remote sensing has been used to monitor water quality since the 1970s [8,9]. Compared with traditional methods, remote sensing can provide the full coverage required for dynamic water quality monitoring [10].
When plenty of domestic and agricultural wastewater containing N and P enter waterbodies, excessive nutrient elements lead to aquatic plants and algae blooms [22,31]. Chl-a is one of the most important pigments in phytoplankton, and exists in all eukaryotic algae [32]. Relevant studies show that there is a significant correlation between Chl-a and blue, green, red and infrared bands [33,34]. The increase of TP and TN will inevitably change of optical characteristics of waterbodies. When industrial wastewater containing plenty of organic matter is discharged into waterbodies, insoluble substances directly lead to the increase of turbidity. On the other hand, aerobic microorganisms consume oxygen in water to degrade organic matter. At certain depth the dissolved oxygen might gradually decrease to 0, and result in anaerobic condition. In anaerobic environment, the cyclic state of Fe 3+ from Fe2O3 and Fe(OH)3 in water could be destroyed, and accumulate certain amount of Fe 2+ . Meanwhile, S from sulfate and organic sulfur is reduced to H2S, and the anaerobic environment prevents microorganisms from assimilating H2S to organic sulfur compounds. Unassimilated H2S might react with Fe 2+ to form FeS. The FeS turns the water to black by absorbing on the suspended matter [35]. The increase of COD thus changes the optical characteristics of water. Based on these, it is possible to retrieve TP, TN and COD directly from spectral characteristics.
The most widely used remote sensing data for the existing research are TM/ETM+ and OLI, SeaWiFS, MODIS and MERIS [36][37][38][39]. However, the temporal resolution of TM/ETM+ and OLI data is 16 days, and the spatial resolution of SeaWiFS data, MODIS data and MERIS data is greater than 250 m, resulting in challenges in characterizing the water quality for small urban waterbodies. Hyperspectral imagery contain a large number of continuous spectral data and provide more spectral characteristics for water quality retrieval [34,40,41]. However, space borne hyperspectral data (such as Hyperion and CHRIS) are only experimental data rather than operational at present, and airborne hyperspectral data (such as AVIRIS, CASI and CIS) have very limited spatial coverage with high cost [36].
In summary, non-optically active parameters were not well studied in previous studies, and the most remote sensing data has too coarse spatial or temporal resolution. Therefore, the objective of this research is to retrieve non-optically active parameters for small urban waterbodies using Sentinel-2 data. TP, TN, and COD were selected as the target parameters, since they may help in identifying the sources of domestic sewage and industrial effluent or illegal discharges from industrial effluent or domestic wastewater. The imagery of recently launched Sentinel-2 was selected because it is free of charge with a high spatial resolution of 10 m and a temporal resolution of 5 days. Three machine learning algorithms were introduced in the empirical methods [41][42][43] to investigate the opportunity to retrieve the non-optically active parameters ( Figure 1). This research may help urban water management by providing a more practical and efficient water quality monitoring strategy of non-optically active parameters.

Study area
This study selected a small urban lake with an area of 0.6 km 2 (600 m×1 000 m). The lake is located in an industrial park in the City of Tianjin, China, and provides flood control and ecological balance for its surrounding areas ( Figure 2). The total developed area of the industrial park is 2 km 2 with diversified land uses, including industrial, residential, and commercial areas. The ecosystem health of the lake is directly related to the management of the surrounding area.

Water quality parameters measurement
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 8 April 2020 doi:10.20944/preprints202004.0111.v1 Twenty sampling points were evenly distributed on the water area by the grid method. The density of sampling points was 0.03 sites/km 2 . To provide simultaneous sampling with Sentinel-2, water samples were collected by Yunzhou SE40 unmanned surface vessel (USV) at 10:00 ~ 12:00 on May 20, 2019 and June 9, 2019. The two sampling dates were selected in the early summer when aquatic plants did not grow rapidly and massively, since the aquatic plants covering the lake would impact on the retrieval accuracy. A total of 40 water samples were collected, with sampling depths of 30-50 cm. There was no cloud cover above the lake on the sampling dates. The time between sampling and satellite overpass was less than 4 hours. After being collected, water samples were quickly put into amber glass bottles to avoid sunshine, and sent to the laboratory for testing. The testing method for each parameter studied in this research is listed in Table 1. The measured water quality parameters of 40 sampling points constitute the ground truth data set. The measured water quality distribution for two sampling dates was generated by ArcGIS 10.

Satellite data processing
Since December 3, 2015, ESA has officially provided free download services for Sentinel-2 data to global users. Real-time updated imagery can be downloaded freely from https://scihub.copernicus.eu. All images are acquired by multi-spectral imager (MSI). The images contain thirteen spectral bands ranging from the Visible (VNIR) and Near Infra-Red (NIR) to the Short Wave Infra-Red (SWIR). The spatial resolution of Band 2 (B2, 492. The two remote sensing images obtained in this paper were L1C products, namely atmospheric apparent reflectance products after ortho-rectification and sub-pixel geometric correction. Radiometric calibration and atmospheric correction were completed by sen2cor. Then, the L2A products were obtained, namely the bottom of atmosphere (BOA) reflectance products. Four bands with a spatial resolution of 20 m were resampled to 10 m using the Sentinel Application Platform (SNAP) v6.0, creating eight bands with a spatial resolution of 10 m ( Figure 4). Pixel values corresponding to 40 ground truth points were extracted. The entire data set for machine learning models was composed of 'ground truth data set' and 'pixel value data set'. The water area was extracted by Normalized Difference Water Index (NDWI) in ENVI 5.3 [44].
where XGreen and XNIR are the pixel values of the green band and NIR band, respectively. For Sentinel-2 images, the green band and NIR band are B3 and B8, respectively.

Building of machine learning models
Three machine learning algorithms, namely Random Forest (RF), Support Vector Regression (SVR), and Back-Propagation Neural Networks (NN), were selected to construct regression models between water quality parameters and pixel values, respectively. The entire data set was randomly split into 70% training set and 30% test set. The coefficient of determination R-squared (R 2 ), mean absolute percentage error (MAPE) and root mean square percentage error (RMSPE) were used to evaluate the performance of the models [28].
where i y , i y and ˆi y stand for the measured, mean and estimated water quality parameters, respectively; and n stands for the number of sampling points. Learning curves were used to optimize model parameters. Cross validation was used to calculate the evaluation metrics of model performance [45]. For each cross validation, the whole data set was randomly split into 70% training set and 30% test set. A model was fitted on the training set and evaluated for the test set. The final R 2 , MAPE, RMSPE were the average of 10 cross validations. The principles of the three machine learning algorithms are described in the following subsections.

Random Forest
Random Forest (RF) is an ensemble learning method [46,47]. It uses data to construct multiple models, and integrates the modeling results of all models by voting (classification problem) or taking mean (regression problem), so that the results of the entire model have high accuracy and generalization performance. In the implementation of RF, an equal number of training samples are randomly extracted from all samples, then m sub-feature sets are selected from all input features, and finally, an optimal attribute is selected from the sub-feature set for node splitting.
When solving regression problems, each decision tree is a regression tree, which is divided by the minimum mean square deviation. In other words, for splitting feature A, the corresponding arbitrary splitting point s splits the data set into two data sets (D1 and D2), so that the mean square deviation of the two data sets after splitting is minimum. The formula is as follows: Then, at the two nodes after splitting, the splitting continues according to this principle. The final prediction result is the mean of all decision trees. In this research, RF was implemented by scikitlearn 0.21.3 of Python 3.7, and the main parameters were set as follows by learning curves: SVR maps x in low-dimensional space to a high-dimensional feature space φ(x) with a non-linear function φ, and seeks a linear regression hyperplane in high-dimensional space to solve the nonlinear problems in low-dimensional space [48,49]. Linear functions in high-dimensional feature spaces can be constructed as follows: Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 8 April 2020 doi:10.20944/preprints202004.0111.v1 y = ⟨ ( )⟩ + b (6) where x and y are the input and output. ⟨ ( )⟩ is the inner product of the feature space, and the weight vector and bias constant b can be obtained by minimizing the risk function (7): In function (7)： where is the loss function. C is a pre-set penalty coefficient, which is used to punish errors greater than . is the deviation between the estimated and the measured values. N is the number of sampling points. By introducing slack variables and * , the solution of (7) can be transformed to (9): The constraint condition is as follows: where ( , ) is the kernel function: The most commonly used kernel functions include: linear kernel, polynomial kernel, sigmoid kernel, radial basis function (RBF) kernel, etc. [50]. Using learning curve, RBF was selected as the kernel function in this research: ( , ) = −‖ − ‖ 2 2 ⁄ (13) In this research, SVR is implemented by scikit-learn 0.21.3 of Python 3.7, and the main parameters were set as follows by learning curves:  The number of nodes in the input layer (n), hidden layer (l) and output layer (m) are determined by the sequence of input and output(X,Y). The training process of the model employs the following steps: 1. Initialization Initialize and .
is the connection weight between the jth neuron in the input layer and the ith neuron in the hidden layer. is the connection weight between the kth neuron in hidden layer and the jth neuron in the output layer. The thresholds a and b of hidden layer and output layer are preset, giving the activation function of neurons and learning efficiency .

Hidden layer output calculation
The output H of the hidden layer is calculated according to the input variable X, the connection weight between the input layer and the hidden layer and the threshold a of the hidden layer: 3. Output layer calculation The predicted output O is calculated according to the output H of the hidden layer, the connection weight and the threshold b:

Error calculation
The prediction error e is calculated according to the predicted output O and the expected output Y. If the error meets the requirement, the training will be completed, otherwise step 5 will be repeated.

Updating weights and thresholds
Turn back to step 2 after updating the connection weight , and thresholds a, b according to error e.
The common activation functions in Back-Propagation NN are ReLU function, sigmoid function and tanh function. Among them, ReLU function is the most commonly used [51]. It is a piecewise linear function, and makes up for the gradient disappearance problem of sigmoid function and tanh function. The function is as follows: In this research, NN is implemented in Keras 2.2.4 of Python 3.7. The number of neurons in the input layer was set to six, for there are six input bands of the remote sensing imagery. By comparing the error of training results and analyzing the calculation efficiency, the number of hidden layer was set to four. For TP and TN, the number of neurons in each hidden layer was 300. For COD, the number of neurons in each hidden layer was 200.

Bands selection
The average spectral shape across the concentrations of the three water quality parameters is shown in Figure 5. The average BOA reflectance of the two sampling dates showed significant differences, because of the difference between the average concentrations on the two dates. The average BOA reflectance increased with the increase of the concentration. On each sampling date, average BOA reflectance fluctuated significantly with the concentration. Results indicate that it is feasible to retrieve TP, TN and COD from spectral characteristics. the range of concentrations. It was suggested that the above band composition could be used to retrieve water quality parameters. In order to evaluate whether the above-mentioned band composition was most appropriate, this study used all possible band compositions (a total of 255 band compositions) to retrieve each water quality parameter by multiple linear regression. R 2 was selected to evaluate the model performance ( Figure 7). According to Figure 7, with the increase of band number, the average R 2 increased, and reached the maximum when band number was 6. The average R 2 of 7 band composition was greater than that of 8 band composition. The most important bands were B3, B4, and B5, namely, the green and red band (B5 is the vegetation red edge with a wavelength of 705 nm). The results were consistent with the work of Tan et al. [30]. Spectra from phytoplankton dominated water experienced low reflectance in red (650~750 nm) wavelengths due to the absorption by Chl-a and other pigments. For the sediment dominated spectra, the reflectance values are relatively high in the green (543~578 nm) and red (650~750 nm) wavelengths. Based on the average R 2 , the most appropriate band compositions of TP, TN and COD retrieval were 'B3+B4+B5+B6+B7+B8', 'B3+B4+B5+B6+B7+B8' and 'B2+B3+B5+B6+B7+B8'.

Performance of machine learning models
The entire data set consists of ground truth data set and pixel value data set were split randomly into 70% training set and 30% test set to develop RF, SVR and NN models. The pixel value data set of each water quality parameter used the most appropriate band compositions mentioned in Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 8 April 2020 doi:10.20944/preprints202004.0111.v1 Subsection 3.1. By comparing the estimated water quality parameters from models with the ground truth, the scatter plots in Figure 8 were obtained. The average R 2 , MAPE, and RMSPE of 10 cross validations was used to evaluate the performance of the models. The optimal models of TP, TN and COD were quite different. For TP, the performances of NN was good. R 2 reached 0.94, and the MAPE and RMSPE were 12.43% and 16.80%, respectively. For TN, the performances of RF was good. R 2 reached 0.88, and the MAPE and RMSPE were 18.39% and 29.64%, respectively. For COD, the performances of SVR was good. R 2 reached 0.86, and the MAPE and RMSPE were 12.55% and 18.75%, respectively.
Taking the results of multiple linear regression as a comparison, machine learning significantly improved the retrieval accuracy of different water quality parameters (Table 4).  According to Figure 8, the scatter points of each water quality parameter are obviously concentrated in two regions. Especially for TP and TN, there is almost no overlap between the two regions. This is because the water quality parameters on the two sampling dates are quite different. The magnitudes of TP and TN on June 9, 2019 were significantly higher than those on May 20, 2019, while COD on June 9, 2019 was significantly lower than that on May 20, 2019. The decrease of COD may be due to the active metabolism of microorganisms in water and the accelerated degradation of organic matter with the gradual increase of temperature.
Comparing the measured and estimated values of each water quality parameter, Figure 9 was obtained. The dynamic distribution of measured and estimated values was almost consistent. There was no significant difference between the evaluation errors of training set and of test set. Results indicated that the model performed well on both training set and test set.

Water quality mapping
The pixel values of the entire lake were imported into the optimal model for each water quality parameter, and the water quality parameters of each pixel were estimated. The water quality mapping of the entire lake was obtained by the spatial interpolation tool in ArcGIS 10.4 ( Figure 10). The spatial distributions of TP and TN tended to be consistent, which might be due to the fact that TP and TN were from the same pollution sources. For example, domestic sewage from residential areas was discharged into the lake. TP and TN in the south and west of the lake were higher than those in the north and east of the lake. On May 20, 2019, the high values of TP and TN were distributed in the southwest of the lake, while on June 9, 2019, the areas expanded from the southwest of the lake to the north of the lake. Accordingly, the averages of TP and TN increased from 0. 29  According to the mapping of TP and TN, domestic sewage containing N and P might be continuously discharged into the lake. It was observed from the remote sensing images that there was a piece of farm land with an area about 1.25 km 2 as well as a subdivision of a residential area adjacent to the lake on the southwest. Therefore, the increases of TP and TN in the south and west of the lake were likely related to the application of chemical fertilizer and the discharges of domestic wastewater. The high values of COD were widely distributed in the center of the lake. From May 20, 2019 to June 9, 2019, there was an obvious process of contraction to the east of the lake. According to this change, industrial wastewater or domestic wastewater might be discharged into the east of the lake, which may originate from the nearby residential areas and a pharmaceutical factory in the east of the lake.

Preprints
It could be observed that water quality parameters were not evenly distributed, although the waterbody was fairly small (surface area was 0.6 km 2 and less than 1 km 2 ). The study of water quality retrieval based on high resolution satellite image is therefore crucial for many water management issues, e.g. identifying illegal discharges to urban waterbody, spills on the shore etc.

Assessment of water quality conditions
Comparing the estimated water quality parameters to the environmental quality standards for surface water (National Standard of the People's Republic of China, GB3838-2002), the Figure 11 was obtained. The environmental quality standards for surface water classified each water quality parameter into five levels, i.e., Class I~V from good water quality to poor. According to Figure 11, TP on both May 20, 2019 and June 9, 2019 was worse than Class V. Serious phosphorus pollution existed in the entire lake surface. TN of the lake surface on both May 20, 2019 and June 9, 2019 covered from Class II to V. The average TN on both May 20, 2019 and June 9, 2019 was subject to Class IV. On May 20, 2019, 1.05%, 40.37%, 21.33% and 37.25% of the lake surface was subject to Class II, Class III, Class IV and Class V. On June 9, 2019, the water quality deteriorated. Class III, Class IV and Class V area expanded to 28.63%, 25.65% and 74.35%. Class II area almost disappeared. COD of the lake surface on both May 20, 2019 and June 9, 2019 covered from Class II to V. In part of the lake surface, COD was worse than Class V. The average COD on both May 20, 2019 and June 9, 2019 was subject to Class V. On May 20, 2019, 41.21% of the lake surface was worse than Class V. 4.29%, 9.10%, 19.09% and 26.30% of the lake surface was subject to Class I~II, Class III, Class IV and Class V, respectively. On June 9, 2019, area with COD worse than Class V decreased to 34.49%. Area of Class V and Class II decreased to 22.43% and 3.8%, respectively. Area of Class III and Class IV increased to 11.44% and 27.84%, respectively.

Comparison of retrieval accuracy
Taking R 2 as the comparison standard of retrieval accuracy, this research surveyed most existing publications regarding the retrieval of TP, TN and COD from 2000 to present and compared the performance of the proposed method in this research with previous research works, Figure 12 shows the results of comparison. Since the average R 2 of TP in the existing publications [21,22,[24][25][26][27]29] was 0.653, the performance of TP retrieval was improved by about 44% over the average of the past research works. For TN, the average R 2 in the existing publications [21, [23][24][25][26][27] was 0.613, and the performance of TN retrieval in this research was improved by 43% over the average of the past research works. For COD, the average R 2 in the past research works [17][18][19][20] was 0.722, and the performance of retrieval COD in this research was slightly higher than the upper quartile of the past research works. Based on the above analysis, the performance of water quality retrieval in this research has improved by using a combination of machine learning strategies. As the key nutrient for eutrophication, TP has the highest performance. The machine learning strategy proposed in this study may help in water quality management for urban waterbodies.

Conclusions
This research developed a machine learning based monitoring method for non-optically active water quality parameters of small urban waterbodies based on Sentinel-2 imagery. The main conclusions are as follows: 1. This is the first attempt to retrieve water quality parameters for small waterbodies with surface area less than 1 km 2 .
2. Compared with TM/ETM+, MODIS and other data, Sentinel-2 with high spatiotemporal resolution makes it possible to retrieve the water quality characteristics for small urban waterbodies.
4. The performances of water quality retrieval for these non-optically active parameters were significantly improved by the optimized machine learning models, especially TP and TN.
5. According to the water quality mapping by remote sensing and the interview of the residents in the neighborhood, the pollutants, especially the illegal discharges of industrial effluent, were traced back to the source.