Prediction of the Concentration of Dissolved Oxygen in Running Water by Employing A Random Forest Machine Learning Technique

: Dissolved oxygen (DO) is a key indicator in the study of the ecological health of rivers. Modeling DO is a major challenge due to complex interactions among various process components of it. Considering the vital importance of it in water bodies, the accurate prediction of DO is a critical issue in ecosystem management. Given the intricacy of the current process-based water quality models, a data-driven model could be an effective alternative tool. In this study, a random forest machine learning technique is employed to predict the DO level by identifying its major drivers. Time-series of half-hourly water quality data, spanning from 2007 to 2019, for the South Branch Potomac River near Springfield, WV, are obtained from the United States Geological Survey database. Key drivers are identified, and models are formulated for different scenarios of input variables. The model is calibrated for each input scenario using 80% of the data. Water temperature and pH are found to be the most influential predictors of DO. However, satisfactory model performance is achieved by considering water temperature, pH, and specific conductance as input variables. The model validation is made by predicting DO concentrations for the remaining 20% of the data. The comparison with the traditional multiple linear regression method shows that the random forest model performs significantly better. The study insights are, therefore, expected to be useful to estimate stream/river DO levels at various sites with a minimum number of predictors and help build a sturdy framework for ecosystem health management across an environmental gradient.


Introduction
Surface water quality is deteriorating globally due to the high level of pollutant loads [1]. Overall, 40 percent of the rivers have become polluted to varying extents [2]. Pollution has become a major concern due to its increasing trend. It is also posing serious threats to ecological integrity that includes degraded health of aquatic water bodies (i.e., habitat instability of aquatic life), aesthetical nuisance (i.e., algal bloom), hypoxia (i.e., dissolved oxygen < 2.0 mg/L), and so on [3]. Therefore, maintaining and protecting a healthy stream is a priority for a sustainable ecosystem. Dissolved oxygen (DO) is considered one of the most important key indicators that is used to evaluate the biological health of rivers [4,5]. It is highly desirable to maintain a minimum DO level (e.g., ~5.0 mg/L) for the survival of diverse aquatic life [6,7]. Understanding DO dynamics and its effective predictions, therefore, are of critical importance for the design and operation of a sustainable ecosystem.
The main sources of in-stream DO include atmospheric air-water interaction (i.e., reaeration) and aquatic plant photosynthesis [4]. Denitrification and external inputs also act as DO sources. DO sinks include decomposition of carbonaceous organic matter, nitrification, and aquatic respirations [8]. The identification and quantification of these processes are challenging for researchers and water resource managers. Further, the coupling of natural factors and human-induced influences (e.g., agricultural activities, urban sprawling) complicate these interactions up by one more level [9]. 3 of 20 applied in numerous water resources studies. For example, SVM was employed to predict DO by employing five water quality variables in the Terengganu River, Malaysia [26]. Further, SVM was applied to predict DO concentration in a hypoxic river in southeastern China [27]. In a very recent study, the least square SVM was applied to predict DO from four water quality variables at three different USGS stations [28].
The neural network was the most mentioned data-driven technique alongside other machine learning algorithms in DO prediction. Given the availability of different data-driven methods, random forests are somehow underused and often underestimated in water resources for ambiguous reasons, especially for water quality predictions. However, random forests were successfully implemented in several studies for hydrological prediction. For instance, Wang et al. [29] used a random forest to assess flood hazard risk in Dongjiang River Basin, China. A spring discharge was forecasted using a random forest in Umbra region, Italy [30]. Random forests to estimate the evapotranspiration were implemented in separate studies by Granata [31] and Granata et al. [32]. In a different study, groundwater potential map was produced, using a random forest by Naghibi et al. [33]. The random forest technique has also been routinely applied in a variety of industrial and scientific research areas. For example, protein-protein interactions were predicted using a random decision forest framework [34]. A random forest regression was employed in predicting soil surface texture in a semiarid region [35]. Further, in separate studies, random forest networks were adopted in agricultural production systems, i.e., prediction of crop yield [36] and in building energy prediction [37]. These examples underscore the wide applications of random forests technique across disciplines. In addition, the application of a random forest model can extract the most influential drivers of a response variable from myriad interacting variables which eventually can lead to building a parsimonious predictive model. This advantage further motivated to apply this method in the present study.
This study aims to identify the key drivers of DO in a river and assess the ability to perform of the random forest machine learning algorithm in predicting DO levels for various combinations of input variables as well as to select the best model. A certain part of data was randomly selected for model development (i.e., model calibration) and the remaining part for model testing (i.e., model validation). To the best of knowledge, this particular machine learning algorithm is quite new in predicting half-hourly DO concentrations. A brief discussion of this technique is given in the method section (Section 2.4).

Study area and data collection
Historical time-series of half-hourly DO data and associated water quality variables (see The natural drainage area of this study site is approximately 3,784 km 2 and is located at an elevation of 171.2 m from the mean sea level. The contributing catchment area of this station is vegetation dominated (i.e., ~82%, mainly deciduous forest). This area is still free from urban-sprawling (i.e., developed area = ~3.5%) -indicating that the selected river is naturally influenced. However, a small agricultural activity persists (i.e., ~14%). Other land-use activities and/ presence of open water bodies are comparatively negligible. Different land-use types were estimated using recently published NLCD data in ArcGIS 10.6. The NLCD is a national land use/land cover data set of the conterminous USA and can easily be retrieved at the USGS website [39]. The  impact of water quality drivers on the overall ecological health in term of dissolved oxygen given the river is less affected by urbanization/agricultural activities as well as long-term data availability.  Table 1. The data matrix for the main analysis was formed based on data quality and scientific relevance. Data only approved for publication was considered, while provisional data subject to revision was discarded. Any missing data of an individual predictor along with other comeasured variables were removed in order to get a complete data matrix. This noise removal process limited the final data points to ~200 thousand.
where Y is the original data; α (alpha) is the transformation parameter that maximizes the Log-Likelihood function; and X is the transformed value.
Special condition when α = 0: Since the Box-Cox method works on only positive data, the water temperature was converted into the Kelvin scale to avert negative temperature. Further, zero turbidity values were also replaced by a small fraction of positive quantity which is closer to zero. The alpha values for Tw, Q, SC, pH, turbidity, and gage height were computed, respectively, as 0.55, -0.012, 0.38, -1.33, 0.012, and 0.46.
Data standardization was also done using the Z-score method, Eq. where Zi = the standardized value; xi = the observed value; xm and Sd are, respectively, the mean value and standard deviation of a predictor variable. The approximate normal distribution of each predictor variables was visually checked by histograms ( Figure 2).

Correlation coefficients
The correlation matrix was formed by estimating the Pearson correlation coefficients. The matrix helps glean background information among the variables. The cell values of the correlation matrix represent the stochastic connections of nonlinear correspondences by measuring the strength and direction between two corresponding variables. The important information on the presence of possible multicollinearity (i.e., mutual correlations between variables) can also be visualized by looking at the cross-correlations. The correlation coefficient was reported at the 95% confidence intervals (p value < 0.05). The correlation matrix was formed on the transformed domain by using Box-Cox transformed and standardized data.

Important variable selection
Mutually correlated predictor variables with the response variable possibly contributes bias in the model performance [45]. Further, multiple parameter sets might potentially lead to the equifinality in a model [46]. Meaning that it is possible to provide a good fit between modeled and observed data from a combination of predictors. It is therefore critical to select an optimal number of mechanistically meaningful predictor variables for data-driven model building [47]. It is a delicate process and the random removal of predictors might result in bias estimation. Therefore, before identifying the limited number of key predictors, the random forest model was fitted with all available variables and a large number of trees. Then a variable importance metrics (VIMs) was where m, n, and t are respectively the total number of attributes, decision trees, and nodes. DGkij is the Gini decrease value (see in Section 2.4) of the j th node in the i th tree that belongs to the k th attribute.
VISk is the variable importance score of the k th attribute. This matrix therefore unravels the relative significance (i.e., predictive capability) of each predictor in modeling the behavior of the response

Random forests regression model
The random forest is a widely used supervised machine learning technique used for both regression and classification problems. As the target variable (i.e., model output) is continuous, the regression algorithm was applied. It aggregates numerous decision trees (usually user-defined) and The trained model is determined by minimizing prediction errors. Once the prediction error is optimally minimized, the model is considered as the best-trained model. The total error is minimized using Eq. (4) for each decision tree of each node.
where GINI (P) is the impurity index of a particular node expressed as probability; pi is the probability of an attribute in each node that belongs to a particular sub-sample set. At each point of data splitting, the probability is estimated -where the minimum GINI represents the best estimates as it contains the least impurity. The overall flow chart of the random forest regression is given in order to get a basic understanding of how the algorithm functions ( Figure 3).

Multiple linear regression (MLR) model
In order to evaluate the performance of a random forest model, the traditional MLR model was

Data partitioning
It is important to note that the model development using training data set and model testing using a completely new data set were made following the 80/20 data partition as a rule of thumb [51].
Total 80% of the filtered data (~160 thousand data points) was randomly selected in model calibration (i.e., model training) and the remaining 20% data (~40 thousand data points) were utilized in model validation (i.e., model testing) to optimize the model performance.

Model performance assessment
In the present study, the model performance was evaluated using two performance indices.
These are: i) the coefficient of determination (R 2 ) and ii) the ratio of root-mean-square error (RMSE) over the standard deviation (SD) of observations (RSR) and are estimated as follows, Eq. (5a-5c): Where N is the number of data points, Yi,o is the observed response, Yi,m is the modeled/predicted response, Yo,mean and Ym,mean is the mean values of, respectively, Yi,o and Yi,m.
The statistical measure R 2 is a widely used statistical score metric and often useful in model performance evaluation. It indicates the degree of correlation between predicted and observed values.
In addition, it denotes the predictive power of a model. The value of R 2 ranges from 0 to 1, with a higher value representing the best fit model. Statistics provided a basis for the assessment of model calibration and validation and suggested a range of RSR values whether a model should be accepted or rejected [52]. The value of a 'perfect to very good model' ranges from 0 to 0.50, a 'good model' has a RSR value from 0.50 to 0.60, RSR value between 0.60 and 0.70 refers to a 'satisfactory model', and a model with RSR > 0.70 represents an 'unacceptable model'.

Correlations of DO with predictors
The linear correspondences of the response variable DO with different predictors were computed using the Pearson correlation coefficients r (obtained from the transformed and standardized data) ( Table 2). The correlation matrix showed that DO was strongly correlated (r = -0.90) with Tw and was relatively weakly correlated (r = -0.38) with SC. These correlation results were statistically significant at 95 % confidence intervals (p value < 0.05). However, other predictors -Q, pH, TUR, and GH -showed very weak correlations (r = -0.06 to 0.16) with DO and were statistically not significant (p value > 0.05). Further, the mutual correlations among the predictor variables indicated the presence of a moderate to strong multicollinearity in the data matrix. For example, Q

Important predictors based on VIMs
The variable importance score of each predictor was estimated and presented in a vertical bar chart (Figure 4). Water temperature was identified as the top-ranked (i.e., the most influential) predictor of DO, while turbidity was found at the bottom of the ranking (i.e., the least influential). In comparison with turbidity, water temperature and pH had approximately, respectively, 62-and 5times stronger influence on DO dynamics. SC and Q approximately had 2 times stronger control over DO compared to turbidity. However, SC demonstrated a slightly stronger influence than that of Q (i.e., SC was 1.03 times stronger than Q). Further, both turbidity and gage height appeared to have a similar influence on model output. This led to the selection of 6 different combinations of predictors that were separately used in random forests model building as follows: -i) Tw only; ii) Tw and pH only; iii) Tw, pH, and SC only; iv) Tw, pH, SC, and Q only; v) Tw, pH, SC, Q, and GH only; and vi) all predictors. The selection of each scenario was based on the corresponding rank, following a descending order, of each variable.

Random forest model calibration and validation
Models for different combinations of predictors were developed using 80% of the data. Then each model was evaluated with the remaining 20% of the data. The calibration and validation results (Table 3 and Figure 5) for various combinations were described with a sequential manner.  When the model was calibrated considering Tw only, model statistics R 2 and RMSE were estimated as, respectively, 0.851 and 0.892 (Table 3). The corresponding validated R 2 and RMSE of the model output were estimated as, respectively, 0.852 and 0.892 (Table3, Figure 5a). The calibrated Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 19 April 2020 doi:10.20944/preprints202004.0342.v1 model R 2 and RMSE were estimated as, respectively, 0.923 and 0.643, when the model was constructed considering Tw and pH. The corresponding validated R 2 and RMSE of the model output were estimated as, respectively, 0.912 and 0.688 (Figure 5b). calibrated R 2 and RMSE of the response were estimated as, respectively, 0.998 and 0.030, when Tw, pH, SC, and Q were selected as predictor variables in model formulation. Corresponding validated R 2 and RMSE of the predicted response were estimated as, respectively, 0.976 and 0.357 (Figure 5d).
The RSR for different combinations (both in calibration and validation) varied within a range between 0.18 and 0.29, suggesting very good to perfect models.
It is worthy to note that the inclusion of four predictors (Tw, pH, SC, and Q) in calibrating the model was able to explain maximum 99.8% data variance. Therefore, other variables were not taken into account in model development. However, three predictors such as Tw, pH, and SC showed the optimal performance in model development by explaining 98.7% data variance. The comparison between DO used for validations and DO calculated from models also demonstrated that Tw, pH, and SC optimally performed in prediction ( Figure 6).

Performance of the linear regression models
Performances of random forest models were further compared with the performances of MLR models. The model was developed using the training data and regression weights were estimated as follows, Eq. (6a-6d): where, DOcal = DO calibrated value (mg/L); Tw = water temperature ( o C); pH = water pH; SC = specific conductance (μS/cm); Q = river discharge (ft 3 /s); an addition of other predictor variables did not significantly improve the model performances. The satisfactory performance was achieved by considering only three predictors (Tw, pH, and SC).

Discussion
The present study effectively identified the major drivers of DO by estimating the relative influence of each attribute employing the Gini decrease index [48]. This information would be potentially useful for water quality managers in future research and crucial decision making, especially in priority setting in pollution control of streams and rivers.
The comprehensive analyses across all input scenarios suggest that the random forest method emerged as a powerful data-driven tool to estimate DO concentrations, manifested by the model statistics, over the conventional MLR method. All combinations of random forests showed an improved performance (both in model formulation and testing) compared to the performance of MLR method. Although MLR models also showed good calibration and validation accuracy and explained up to ~86% of the total data variance, possible cross-correlations ( Table 2) among participatory variables left the MLR technique an unattractive one due to unstable estimates [53] and high condition numbers (i.e., ill-conditioning of linear models).
The study results show a consistency with other studies. For example, Heddam [54] in a separate study demonstrated that water temperature and pH can predict DO levels with an enough level of confidence. The results and insight from this study can also offer practical benefits. We would be able to replicate established models in similar regions to predict real-time DO levels with less input data. Although four predictors (Tw, pH, SC, and Q) explained the maximum data variance, it is recommended to use three predictors (Tw, pH, and SC) to estimate DO levels which are relatively easy to measure in field. Therefore, it further indicates the parsimony of input variables. The inclusion of Q in model building only increased 1.1% of the data variance. However, it is often difficult to have a continuous measurement of discharge due to the high uncertainty (i.e., sudden flood) associated with it, especially for larger rivers; therefore, can be avoided. Further, only Tw and pH can be used to efficiently predict DO level (calibrated model explained ~91% data variance), particularly in small streams where DO is not properly monitored. Even with Tw only, the model (explained ~85% data variance) can be applied to monitor DO levels with an acceptable level of confidence. At unmonitored sites, the water temperature can be estimated from the available air temperature [55]. This would be particularly useful when a river/stream is inaccessible due to complex geography or high labor expenditure for taking measurements.
A well-designed river water quality monitoring program is a requisite to keep track of DO levels in streams in order to protect valuable aquatic life and their essential habitat [56]. While water monitoring to serve myriad intents is generally well defined, the evaluation of water quality is often problematic due to many input parameters [57,58]. The considerable cost reduction due to the parsimony of required field data would help ameliorate water resources management by taking prompt decisions.
Despite the potential implication, the developed models require further careful testing in various locations with newer set of data. The models potentially hold an empirical basis and are based on the data between 2007 and 2019 obtained from the USGS data repository system. It is likely that the measured values under certain circumstances were not exactly reported due to measurement errors and/ personnel unawareness. Data also contain missing values at times, especially between 2015 -2016, and were filtered following an outlier removal technique that might have inserted errors in results. Moreover, the study area was dominated by a vegetative land cover (i.e., mainly deciduous forest). The predominance of other land use activities (i.e., agriculture dominated/developed area) could bring different findings. Subjected to the availability of more predictor variables, across a diverse environment and land use gradient, with no missing data, the present model should further be calibrated and tested accordingly to observe the overall local or global robustness. Nevertheless, this simple data-driven model would be of enormous help to develop strategic framework in managing ecosystems across the U.S. that require immediate attention, as mandated by the U.S. Clean Water Act [59], and different parts of the world.

Conclusions
A random forest machine learning technique has been successfully applied to predict DO concentrations of the South Branch Potomac River near Springfield, WV. The variable important matrix resulted in 6 distinct combinations of predictors that were separately employed in the development of random forests as follows: -i) Tw only; ii) Tw and pH only; iii) Tw, pH, and SC only; iv) Tw, pH, SC, and Q only; v) Tw, pH, SC, Q, and GH only; and vi) all predictors. The model was trained and validated using, respectively, 80% and 20% of the data. The model statistics (R 2 and RSR) were estimated in each case and the first four input combinations were adequate for model building.
Results showed that Tw and pH can efficiently predict DO concentration, with 91-92% data variance, across the calibration and validation phases. The best model performance (i.e., explained maximum 99.8% data variance) in calibration stage can be achieved by considering four predictors such as Tw, pH, SC, and Q as input, decreasing in rank. However, the recommended input variables, to model formulation, are Tw, pH, and SC given their satisfactory performance (i.e., explained 98.7% data variance). Estimated RSR further suggested very good to perfect (i.e., high efficiency) models. On the contrary, traditional MLR models performed with less accuracy, both in model building and testing, by explaining only 76-86% data variance across different input scenarios. Further, potential cross-correlations among predictor variables and high condition numbers indicated the potential bias in traditional MLR model estimations. Therefore, the random forest model evolved as a powerful data-driven tool to estimate DO concentrations which requires less input data but demonstrates an improved performance with a higher level of accuracy.
Although developed models, using a random forest, presented better performances in predicting DO levels, when compared with a MLR method, there is still a room for the further assessment. Models need to be further calibrated and validated with a new data matrix with no missing data encompassing more predictor variables across diverse climates and management gradients (i.e., various land-cover dominated catchments). However, the developed models can still direct in future research. We would be able to predict DO concentrations with a minimal number of input variables. This would be particularly beneficial for streams/rivers where the water quality monitoring is not properly maintained due to geographical access constraints or high labor cost. The potential of parsimonious model development, with a significant cost reduction in data collection and processing, is therefore expected to guide water resource managers to take strategic and prompt measures towards achieving a healthy ecosystem across the continental U.S., as mandated by the Clean Water Act, and beyond.