1. Introduction
With the ever-increasing demand, surface water has become the most crucial resource among the communities [
1,
2,
3,
4]. From the scale of industrial and electricity generation to the scale of agricultural and drinking purposes, the availability of surface water needs to be abundant with the appropriate quantity and quality [
5,
6,
7]. Among all the surface water bodies, Stream-Water (SW) is considered the most important source to provide countless benefits to human beings [
8,
9,
10]. Conveying not only the drinking water to the communities and irrigation water for agricultural purposes, but streams can also significantly wash away the wastes, and provide habitat for wildlife and hydroelectricity. Often, it is used for several recreational purposes e.g., fishing, swimming, and boating [
11,
12]. The selected SW quantity variables are discharge and water level of the stream is highly influential on the overbank flooding in the surrounding area, demand of water supply, and fluvial ecology. Aquatic life is impacted significantly due to the temporal dynamics between the seasons of the stream discharge and water level [
13,
14,
15,
16,
17]. Surface water quality standards are established by examining surface water quality measures such as dissolved oxygen, pH, turbidity, toxic substances, and aquatic macroinvertebrate life [
18,
19]. According to the New Jersey Department of Environmental Protection (NJDEP) (2012), Integrated Water Quality Report, at least one designated use is classified as Not Supporting (NS) in every sub watershed in Trenton [
20]. Selected SW quality parameters in this study are temperature, Dissolved Oxygen (DO), pH, turbidity and Specific Conductance (SC). The temperature being the most important ecological factor is directly correlated to the physical, chemical, and biological properties of water [
21,
22,
23,
24]. DO is essential for aquatic life to survive, with differing oxygen concentration tolerances among species and life stages [
25]. The pH and SC have a substantial impact on the other metrics of overall water quality, both constructively and adversely. According to the previous studies, the positive correlation between them and nitrate ions, ammonia, phosphorus, calcium, and magnesium, or even the detrimental influence of high pH on exotic species invasions, could induce disruptions in natural ecosystems [
26,
27,
28]. Turbidity is the measure of relative clarity of water caused by suspended particles or dissolved whereas high values can significantly reduce the aesthetic quality of streams and influence natural migrations of species [
29]. Assessment of turbidity improves the evaluation and indication of fecal contamination in water bodies such as Escherichia coli, the most common water infection [
30,
31].
Traditional physics-based numerical models (e.g., HEC-RAS, MIKE) involve spatial and temporal discretization for the entire computational space to compute SW variables which require high computational efforts [
32,
33]. Various numerical scheme (e.g., Finite Volume, Finite Element, Finite Difference) is used to solve partial differential equations i.e., Navier-Stokes equation coupled with the conservation of mass equation. The cost of spatial and temporal discretization increases exponentially with the increase in the required resolution and accuracy [
34,
35,
36]. Input data for the physics-based river models consist of a significant amount of morphological, operational, and measured data. Data preprocessing for the physic-based models can be daunting depending on the spatial and temporal tags of the target variables. Physics-based numerical models require measurable and empirical parameters to estimate the target variables. A substantial amount of work to retrieve the parameters and constants through extensive laboratory-based experimentation and calibration is also a prerequisite which make these models computationally costly for practical implication with varying scale [
37].
Data-informed predictive models provide an efficient alternative approach to forecast and monitor both the SW flow and quality parameters where only observed data can be used for the prediction instead of using many environmental factors required by the physics-based models. They offer reduced computational effort while simplifying complicated system and predict the outcomes using the observational only without any physics-based equations [
38,
39,
40,
41]. In recent days, Deep Learning (DL), an advanced sub-field of artificial intelligence, has become a popular choice in predictive modelling in the field of water resource management [
42,
43]. However, the traditional deep neural network algorithms (e.g., Multilayer Perceptron (MLP) do not have the ability to learn sequential data because they cannot store previous information, resulting in a constrained prediction capability for long-term time series, e.g., temporal distribution of the water table depth [
44,
45]. The MLP algorithms needs complex procedures in the data pre-processing stage to obtain good performance in predicting the target variables [
46,
47,
48]. While the comprehensive data pre-processing can bolster the ability of a MLP model to learn the observed data, subjective user intervention is still necessary, e.g., selecting the number of reconstructed components [
49]. In addition, the pre-processing require substantial amount of time as many reconstructed components needs to be calculated [
50].
The Long Short-Term Memory (LSTM) is a special type of neural network which stores extended sequential data in the hidden memory cell for further processing [
51,
52]. LSTM performs well in processing long term sequential data, utilizing its sophisticated network structure specifically designed to carry the temporal linkage of the time series data. Water quality and quantity data have not been widely investigated in previous work employing LSTM. Therefore, compared with aforementioned MLP model, the proposed LSTM model only requires a simple data pre-processing method [
53]. LSTM neural network is recurrent in nature, where the connections between units form a directed cycle allowing data to flow both forwards and backwards within the network. Therefore, the model is capable of preserving the past information and use them further for future prediction. LSTM model have already been used as a very advanced model in the field of DL, e.g., speech recognition, natural language processing, automatic image captioning and machine translation [
44,
54,
55]. However, only a few studies have applied Recurrent Neural Networks (RNNs) or LSTMs to forecast multivariate time series data in the field of water resource [
56,
57,
58]. The objective of this research is to untangle the pattern of the temporal distribution and linkage among the aforementioned SW variables and perform predictive analysis on the using the previous observed data. To accomplish the goal, a comprehensive Exploratory Data Analysis (EDA) is conducted to investigate the temporal dynamics of the SW variables and LSTM prediction is performed to predict the future values based on past records. Following sections of the paper demonstrate the study location, data source and collection, EDA, LSTM prediction, performance evaluation and possible future directions.
4. Conclusions
Multivariate prediction of the SW variables under both the water quantity and quality categories at a point location using the observed data can be highly beneficial to the water managers and decision makers to perceive the future flooding, irrigation works and fluvial ecology and aquatic life. Unlike the physics-based numerical models where additional terrain, meteorological data and human interventions are pre-requisite, the proposed approach relies only on the previously recorded data of the variables. The LSTM framework to predict the SW variables can be highly beneficial for the nearby community where the short-term prediction of the dynamics of the SW variables daily/weekly/monthly in future play a critical role. In the prediction of water quantity i.e., discharge and water level can substantially aid to prepare for the flood inundation, irrigation work and water supply demand. Prior knowledge of water quality of the SW can be highly beneficial to manage the aquatic life. As the model proposed uses only previous observed data of the variables, additional burden of input data and human intervention is not required for prediction work. Several approaches through physics-based numerical modelling techniques are proven inefficient in terms of real-time forecasting and computational efficiency. On the contrary, the application of the data-informed predictive models is highly efficacious in predicting various SW variables without taking complicated differential equations and assumptions into consideration. LSTM algorithm is capable of preserving both the short- and long-term pattern of the time series to forecast. Traditional physics-based numerical modelling tool requires assumptions, other correlated variables, and expensive calibration of the parameters.
This study contributes to a reproducible template to investigate the uniqueness of the temporal dynamics of SW variables through extensive EDA. Hidden pattern of the distribution of SW variables over seven years of data is discovered various up-to-date data exploration tools which is a mandatory requirement for the satisfactory training of LSTM algorithm. After a successful training step, LSTM is tuned and optimized through an explicit iterative performance record which can further be transferred to forecast SW variables in the identical geographical location. The performance of the LSTM algorithm in predicting the river discharge illustrates the algorithm is highly suitable to the discharge time series. Several error matrices show promising performance with minimum error. The proposed LSTM configuration is proved to offer satisfactory performance for the SW variables with the lead times up to one week. However, increasing the lead time increases the error in prediction limiting the performance of the LSTM model. Physics-based models are also incompetent in real-time prediction where the proposed LSTM can easily be coupled with the sensor and cloud to predict the SW variables in real time. Computational time may increase exponentially with the increase of the size of the dataset. Principle parameters obtained after the training process with minimum error are number of neurons, batch, and epoch size. The parameters optimized to obtain the best the LSTM configuration after training the model can be transferable in the similar climatic and geographic regions. For instance, if the distribution of the values of the SW variables are identical e.g., the difference among the PCS values being negligible, the parameters of trained LSTM model can be transferred and used for predictive analysis in a different location. However, we should not use our LSTM model in an area where the distribution of the features values through time is dissimilar. Future research should be conducted to incorporate high performance computing and cloud-based operations to obtain smart predictive tool to utilize the revolution in data storage capability and computational efficiency. LSTM models with different configurations should also be applied in different geographical and climatic locations to investigate the transferability of the model.
Figure 1.
Aerial photo of the study location with flow measuring station at the Central Delaware (HUC8 02040105).
Figure 1.
Aerial photo of the study location with flow measuring station at the Central Delaware (HUC8 02040105).
Figure 2.
Pipeline of the EDA and LSTM prediction tasks illustrates how the activities are linked from the data preprocessing steps to the model deployment stage. The steps are further categorized into their distinct group namely transformer, estimator, and evaluator.
Figure 2.
Pipeline of the EDA and LSTM prediction tasks illustrates how the activities are linked from the data preprocessing steps to the model deployment stage. The steps are further categorized into their distinct group namely transformer, estimator, and evaluator.
Figure 3.
Bivariate correlation coefficients among the SW variables represented by the correlation heatmap.
Figure 3.
Bivariate correlation coefficients among the SW variables represented by the correlation heatmap.
Figure 4.
Logarithmic transformation is applied to increase the normality of discharge and water level values.
Figure 4.
Logarithmic transformation is applied to increase the normality of discharge and water level values.
Figure 5.
Schematic representation of a LSTM architecture.
Figure 5.
Schematic representation of a LSTM architecture.
Figure 6.
Distribution of observed value from the gage records (dashed blue lines) and predicted values from LSTM model for the SW variables, discharge (a), water level (b), temperature (c), DO (d), pH (e), turbidity (f) and SC (g) with train/ test split (orange.
Figure 6.
Distribution of observed value from the gage records (dashed blue lines) and predicted values from LSTM model for the SW variables, discharge (a), water level (b), temperature (c), DO (d), pH (e), turbidity (f) and SC (g) with train/ test split (orange.
Figure 7.
Improvement of the model prediction capability with the increase in the number of epochs for the train and test set. RMSE value is the indicator of the model performance.
Figure 7.
Improvement of the model prediction capability with the increase in the number of epochs for the train and test set. RMSE value is the indicator of the model performance.
Figure 8.
Error matrices for various lead times for LSTM neural network model to predict discharge.
Figure 8.
Error matrices for various lead times for LSTM neural network model to predict discharge.
Figure 9.
Model performances are presented using the scatterplot of the standardized observed and predicted discharge values from LSTM model and the histogram of the distribution of the difference between the observed and predicted values of the SW parameters.
Figure 9.
Model performances are presented using the scatterplot of the standardized observed and predicted discharge values from LSTM model and the histogram of the distribution of the difference between the observed and predicted values of the SW parameters.
Figure 10.
Change in the error matrix, R^2values with the increase in the number of epochs batch, and neurons.
Figure 10.
Change in the error matrix, R^2values with the increase in the number of epochs batch, and neurons.
Table 1.
List of the SW variables used for EDA and predictive analysis with LSTM model.
Table 1.
List of the SW variables used for EDA and predictive analysis with LSTM model.
SW Parameters |
Unit |
Descriptions |
Discharge |
ft3/s |
Quantity of stream flow |
Water Level |
ft |
Stream water height/level at the gage location |
Temperature |
℃ |
Sensor-recorded temperature in ℃ at the gage |
Dissolved Oxygen (DO) |
mg/L |
The amount oxygen dissolved in the SW. |
Turbidity |
FNU |
Measure of turbidity in Formazin Nephelometric Unit (FNU) |
pH |
- |
the acidity or alkalinity of a solution on a logarithmic scale |
Specific Conductance (SC) |
μS/cm |
Measure of the collective concentration of dissolved ions in solution |
Table 2.
Descriptive Statistics of the SW variables.
Table 2.
Descriptive Statistics of the SW variables.
|
Count |
Mean |
Std |
Min |
25% |
50% |
75% |
Max |
Discharge (ft3/s) |
255066 |
13265.43 |
10657.91 |
2150 |
6240 |
10800 |
16100 |
150000 |
Water Level (ft) |
255066 |
9.98 |
1.47 |
7.8 |
8.89 |
9.73 |
10.73 |
20.76 |
Temperature (℃) |
255066 |
13.35 |
4.43 |
0 |
12.02 |
13.58 |
15.01 |
31.30 |
pH |
255066 |
7.90 |
0.208 |
6.6 |
7.00 |
8.23 |
9.16 |
9.71 |
SC (μS/cm)
|
255066 |
208.19 |
22.23 |
49 |
201.11 |
208.64 |
221.09 |
453 |
Turbidity (FNU) |
255066 |
6.44 |
6.54 |
0.2 |
5.61 |
6.44 |
7.29 |
469 |
DO (mg/L) |
255066 |
11.02 |
1.11 |
6 |
11.02 |
11.07 |
12.67 |
16.90 |