4. Results and Discussion
4.1. Experimental Setup
All experiments were conducted using Google Colab Pro, an online cloud-based platform providing access to high-performance computational resources, including GPU acceleration, which is essential for efficiently training deep learning models on multivariate time-series data. The entire implementation was carried out in Python, ensuring reproducibility and flexibility. The deep learning components, including the Temporal Convolutional Network (TCN) and gated hybrid architecture, were implemented using PyTorch, while Scikit-learn was used for data preprocessing, evaluation metrics, and auxiliary utilities. In addition, a tree-based gradient boosting model (Histogram-based Gradient Boosting) was employed as the baseline forecasting model, enabling a robust hybrid learning framework that combines traditional machine learning with deep neural networks.
The dataset was chronologically divided into training, validation, and testing sets using a 70%–15%–15% split, ensuring that future observations were never used during training, thus preventing data leakage. This temporal split reflects real-world deployment conditions in wastewater monitoring systems, where predictions must be made strictly on unseen future data. The validation set was used exclusively for hyperparameter tuning and early stopping, while the test set was reserved for final performance evaluation.
A consistent and robust preprocessing pipeline was applied across all data splits. Continuous input variables were standardised to have zero mean and unit variance, facilitating stable and efficient neural network training. Time-series inputs were transformed into fixed-length sliding windows, allowing the model to capture temporal dependencies and delayed effects commonly observed in wastewater dynamics. In addition to sequential inputs, a separate set of gating features was constructed to provide contextual information relevant to regime changes and abnormal pollution events.
To explicitly model rare but critical events, a binary shock indicator was introduced to label abnormal pollution regimes, such as sudden discharge events or system disturbances. Since such shock events occur infrequently but have high environmental impact, a weighted loss strategy was adopted during training. Residual prediction errors corresponding to shock periods were penalised more heavily than those from normal conditions, encouraging the model to focus on high-risk scenarios without degrading overall stability.
Model training was performed using the AdamW optimiser with gradient clipping to ensure numerical stability. Early stopping was applied based on validation residual RMSE, with a minimum number of training epochs enforced to allow sufficient learning of both the residual correction network and the shock-gating mechanism. This training strategy reduced overfitting while ensuring robust convergence.
Overall, this experimental setup provides a realistic, challenging, and deployment-oriented evaluation framework for wastewater pollution forecasting. By combining chronological data splitting, regime-aware learning, and hybrid modelling, the proposed approach is well-suited for sustainable environmental monitoring and decision-support systems.
4.2. Comparative Performance Evaluation of COD Prediction Models
The sub-section contains a comparative analysis of different machine learning models to estimate chemical oxygen demand (COD) in an industrial wastewater treatment problem analysis. This analysis is aimed at testing the predictive performance, stability, and extrapolation ability of common baseline models against prediction efficiency of the given hybrid framework of SAGE-GBTCN. In order to evaluate a more objective and comprehensive comparison of all the models, they were all trained and tested on the data according to the same parameters in data handling, feature engineering and validation. Model performance is measured against widely accepted performance regression measures including root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE) and coefficient of determination (). Through this assessment, one is able to obtain a clue on the weaknesses and strength of various modeling paradigms in case they are employed to handle dynamic and non-linear data focusing on wastewater pollution.
Table 6 presents the comparison of the predictive power of the predictive models based on the baseline machine learning models and the proposed SAGE-GBTCN algorithm of COD predictions. The analysis is done based on a number of error-based and goodness-of-fit measures such as RMSE, MAE, MAPE and the
which are taken to provide a moderate estimate and assessment of error and effectiveness, as well as strong performance under dynamic wastewater settings.
HistGradientBoosting (HistGB) is the bestbaseline model with an RMSE of 33.97 and an of 0.927. This result suggests that the use of ensemble models with trees can be significant to model non-linear organizational connections between procedures variable of wastewater action and the degree of concentration of COD. CatBoost also places relatively high in error values with small disparity but similar generalization performance implying two things: it is strong to feature interaction and noise that is usually strong in industrial wastewater data. Random Forest in comparison exhibits much greater prediction errors and reduced explanatory power; i.e. fewer changes to the dynamics of time and changes in pollutants. Support Vector Regression (SVR) and Multilayer Perceptron (MLP) models follow a kind of poor result; the RMSE, MAE are significantly greater, and the is lower. Such findings suggest that both the kernel-based and deep learning methods were not good at generalizing under condition of the data limits and non-stationary scenarios when processing wastewater to reduce pollution, the variation of pollutants will quickly increase, and the regimes of operation are likely to transform. This finding reinforces the point that even a complex model structure may not result in an improved model predictive control of the actual wastewater system.
Figure 16.
Distribution of prediction residuals for the proposed hybrid model on the test dataset. The near-zero centring indicates minimal systematic bias, while heavier tails correspond to rare extreme pollution events.
Figure 16.
Distribution of prediction residuals for the proposed hybrid model on the test dataset. The near-zero centring indicates minimal systematic bias, while heavier tails correspond to rare extreme pollution events.
Interestingly, the proposed SAGE-GBTCN model outperforms all the baseline models by all the evaluation measures; the RMSE and MAE are the lowest, that is, 30.30 and 22.54, respectively, whereas the highest value of the , 0.942 and MAPE of 2.60%. The performance increase is explainable by the hybrid nature of SAGE-GBTCN, including the good forecasting capability of the gradient boosting and high-residual capabilities of the temporal prediction to address the pollutant shocks and short-term dynamics more effectively. These findings demonstrate that through integrating this process conscious temporal intelligence into the system, the resilience to make predictions about the framework, and at the same time not compromising the computational efficiency, which makes this proposed framework particularly applicable in industrial and textile wastewater treatment settings.
4.3. Residual Distribution and Statistical Normality
The residual distribution and the distribution of normality should be evaluated to comprehend the statistical credibility of the proposed hybrid model. The residual histogram ensures that the model has no systematic bias since the prediction errors are found to be centred around zero. The close-to-symmetric distribution proves that the hybrid residual correction mechanism is very efficient in the process of the structural errors that statues of the hybrid predictors propagate.
In order to explore some more, a Q-Q plot of residues is studied. Most of the residual values are close to the theoretical normal reference line especially at the central quantile range. This implies that the residuals under normal operating conditions follow the generic of Gaussian behaviour. Also the linear deviation in the tail areas is mostly related to exceptional and outliers of the pollution which inject heavier tailed features in the error.
This behaviour of tail is normal in the wastewater systems of the real world and does not invalidate the model. Rather, it is a manifestation of inherent uncertainty when it comes to abnormal regimes, which maximizes the significance of shock-conscious consideration.
Figure 17.
Q–Q plot of residuals on the test set. Close alignment with the theoretical normal line in the central region indicates approximate normality, while tail deviations reflect the influence of rare shock events.
Figure 17.
Q–Q plot of residuals on the test set. Close alignment with the theoretical normal line in the central region indicates approximate normality, while tail deviations reflect the influence of rare shock events.
Figure 18.
Temporal evolution of prediction residuals on the test set. Residuals oscillate around zero without persistent drift, indicating stable generalisation over time.
Figure 18.
Temporal evolution of prediction residuals on the test set. Residuals oscillate around zero without persistent drift, indicating stable generalisation over time.
4.4. Temporal and Regime-Wise Error Characteristics
Residual behaviour is studied in time and across operating regimes to provide an analysis of when and under what conditions errors in prediction are possible. The temporal plot of the residual indicates that there were no accumulation or drifts of errors over the test period. Such means that the generalisation through the seasonal and operational changes is stable in the proposed hybrid model.
The instantaneous residual spikes are seen where there is the change in pollution, which is associated with a shock. These deviations are stabilised by rapid mechanisms implying that migrates of the gated residual mechanism stabilise fast without creating instability in the long run.
This behaviour is also pointed out by further comparison of absolute errors regime wise. In a normal working day, the model has a somewhat small median absolute error with a rather narrow interquartile range. Shock days, on the other hand, have larger median errors and more dispersion indicating the inherent inability to predict extreme events. Critically, the magnitude of the errors in periods of shock are limited which explains why even in unfavorable circumstances the model does not result in catastrophic breakdowns.
Collectively, the above findings show that the proposed framework is able to balance its affectedness to time stability and its resilience to abnormal regimes.
Figure 19.
Comparison of absolute prediction errors across normal and shock regimes on the test set. Increased dispersion during shock days reflects higher uncertainty under abnormal conditions, while bounded error ranges indicate robust behaviour.
Figure 19.
Comparison of absolute prediction errors across normal and shock regimes on the test set. Increased dispersion during shock days reflects higher uncertainty under abnormal conditions, while bounded error ranges indicate robust behaviour.
4.5. Predicted-Actual Agreement and Bias Analysis of COD Forecasting Models
This subsection allows a visual evaluation of the agreement between the predicted values and observed values of chemical oxygen demand (COD) for the test data set for the baseline models and the proposed SAGE-GBTCN framework. Predicted–actual scatter plots are used to analyze model behavior over the range of COD values, and attention is paid to systematic bias, dispersion plots, and deviations from the ideal reference line. Such visual diagnostics help to complement numerical diagnostics of performance, by showing tendencies to over- or under-estimate that may not be completely recorded by aggregate measures of error. The analysis is done by focusing on the ability to predict new data (test set), which is important to assess generalization ability, i.e., the ability to generalize under previously unseen conditions, which will be important in the case of real-world wastewater treatment applications. This comparison allows for a better understanding of the modeling approaches with respect to non-linearity in variability and dynamics of pollutants in industrial wastewater systems.
Figure
Table 7 shows the correlation between predicted and observed values of the chemical oxygen demand (COD) in the test data group for the five baseline models (Random Forest (RF), CatBoost, Multilayer Perceptron (MLP), Support Vector Regression with the RBF kernel (SVR-RBF), HistGradientBoosting (HistGB), and our proposed SAGE-GBTCN framework. The predicted-actual scatter plots are one way of providing an intuitive visual assessment of model agreement, dispersion, and systematic bias over the full range of COD, with the dashed
line representing ideal prediction behavior.
Among the baseline models, RF has an evident dispersion around the reference line, especially for higher COD values: this result suggests a greater variance and a tendency to underestimate in conditions of elevated pollutants. CatBoost shows better alignment with the line at most loads of the COD range, although some moderate deviations are observed when the instance is of high load, which could mean sensitivity to abrupt fluctuations of the pollutant. The MLP model exhibits relatively greater scatter and irregular spread, indicating a decrease in stability and poor generalization when non-linear and non-stationary wastewater dynamics are considered. SVR-RBF shows a better clustering than RF and MLP in the middle range of clearly defined COD levels; however, apparent bias is evident at the higher concentrations, where the predictions have a systematic deviation from the observed values. This behavior is used to point out the limitations in capturing sharp regime changes using kernel-based approaches. HistGB produces a consistent and increased agreement with lowered dispersion over low and moderate COD ranges and highlights the effectiveness of ensemble-based tree models in dealing with non-linear interactions between wastewater process variables.
Notably, the interesting fact is that the proposed SAGE-GBTCN framework shows the least deviation from the reference line of in the whole COD spectrum. The predicted values exhibit lower scatter as well as minimal systematic bias, especially at higher COD conditions in which baseline models are prone to divergence. This improved agreement can be traced back to the hybrid design of SAGE-GBTCN that combines good ensemble-based baseline prediction and temporal residual correction to better deal with short-term dynamics and pollutant shock events. Overall, the quantitative performance metrics, show that the proposed framework can achieve better generalization and robustness in COD forecasting in dynamic industrial wastewater treatment scenarios. The proposed SAGE-GBTCN framework is formulated as a deterministic, regime-aware regression model rather than a probabilistic exceedance predictor or binary early-warning classifier. Consequently, the focus of evaluation is placed on forecast robustness and accuracy under shock-induced regime changes, rather than precision–recall analysis or calibration of exceedance probabilities. The proposed framework exhibits a modular prediction behavior that can be interpreted at three complementary levels. The gradient boosting base predictor primarily captures global COD trends and slowly varying process dynamics under stable operating conditions. The temporal convolutional residual learner focuses on structured short-term deviations that arise from localized process disturbances and transient operational fluctuations. The shock-aware gating mechanism dynamically regulates the influence of residual correction by attenuating its contribution during stable regimes and amplifying adaptive correction during periods of elevated volatility, as indicated by the Point Shock Indicator. This hierarchical interaction enables robust forecast stabilization without overreaction to noise.
4.6. Comparison Between Baseline Models and the Proposed Hybrid Framework
To conduct a comprehensive study of the performance of the proposed hybrid model, its performance is assessed in relation to a number of commonly used base machine learning algorithms, such as HistGB, CatBoost, Random Forest, Support Vector Regression (SVR), and Multilayer Perceptron (MLP). Using a variety of complementary performance measures, the comparison on the test data is to be done, which will guarantee the fair and robust assessment.
As shown in the bar-chart comparison in
Figure 20, the quantitative performance of the proposed hybrid framework has evidently improved. The proposed model also has the lowest values of error-based measures, including RMSE, MAE, and MAPE, among the rest of the compared methods, which implies that the model is better at predicting. Simultaneously, it has the largest coefficient of determination (
), demonstrating a greater explanatory capacity and closer correspondence to the observed COD dynamics. Conversely, classical machine learning models progressively increase the magnitude of the errors, especially in the occurrence of nonlinear and highly-complicated patterns of pollution.
A radar-based visualisation is also used to conduct a holistic and scale-free comparison, with each of the metrics normalised through a ratio-based scoring system (the best model score is one, higher scores indicate better performance). This representation allows the assessment of several metrics simultaneously, and one should not be biased towards a specific criterion.
The radar plot in the appendix of
Figure 21 unequivocally indicates that the proposed hybrid model always prevails in all computed dimensions, as the outermost curve of the radar chart. Baseline models have skewed performance and they demonstrate superior performance in specific metrics and poorly in others. This contradiction highlights the well-balanced and healthy nature of the offered framework that combines the benefits of gradient-based learning and temporal residual correction in the single framework.
Comprehensively, both the quantitative and visual results demonstrate that the proposed hybrid model would outperform standard baseline models in terms of accuracy and robustness significantly. The steady achievements in the various measures used demonstrate the aptness of the model to real-waste water pollution prediction where a stable working in the regular and unusual circumstances is highly needed.
4.7. Ablation Study and Component-Wise Contribution Analysis
A quantitative analysis of the contribution made by every component in the proposed hybrid scheme is carried out through an ablation study applied to the test dataset. Four variants of model have been tested: (i) the baseline model, HistGB, (ii) Augmented Residual Network HistGB + TCN Residual, (iii) Gated Residual Correction Model HistGB, and (iv) Full Proposed SAGE-GBTCN. This step-by-step evaluation can allow a satisfactory evaluation of the way each architectural element enhances the predictive functionality.
The quantitative analysis employing real metric values illustrates that there is an increasing performance augmentation with the incorporation of more components. Learning a TCN-based residual correction has lower RMSE and MAE than the baseline model, which demonstrates that Temporal learned residuals provide a more quantitative representation of the systematic errors that not even HistGB are able to predict. Additional enhancement of the gating mechanism also increases robustness, via selective activation of residual corrections, resulting in better performance in a complicated operating environment. The complete proposed SAGE-GBTCN model has the smallest error values and the biggest coefficient of determination, which proves the additional positive effect of residual learning and shock-conscious gating.
To compare all the main performance measures in an unbiased and holistic manner and to alleviate the scale effect, a radar-based visualisation is used with ratio-normalised scores, with the highest performing model in each performance measure receiving a one score. Such representation points to the strengths and weaknesses of each model variation in comparison, but does not favor either of the metrics.
Figure 22.
Ablation study comparison on the test set using RMSE, MAE, MAPE, and metrics. Progressive performance improvements are observed as residual learning and gating mechanisms are incorporated, with the proposed SAGE-GBTCN model achieving the best overall results.
Figure 22.
Ablation study comparison on the test set using RMSE, MAE, MAPE, and metrics. Progressive performance improvements are observed as residual learning and gating mechanisms are incorporated, with the proposed SAGE-GBTCN model achieving the best overall results.
The radar chart makes it obvious that the suggested scheme of SAGE-GBTCN dominates in all the considered dimensions, which creates the outermost envelope of the comparison. Conversely, partial variations present equal opportunity performance gains, thus performing low in certain measures and high in others. This finding supports the importance of a collective incorporation of the temporal residual learning and gated correction to the technologies in order to attain a balanced and vigorous performance in wastewater pollution prediction.
Figure 23.
Radar-based ablation study comparison on the test set using ratio-normalised performance scores. The proposed SAGE-GBTCN model consistently achieves the highest scores across all metrics, indicating balanced and superior performance.
Figure 23.
Radar-based ablation study comparison on the test set using ratio-normalised performance scores. The proposed SAGE-GBTCN model consistently achieves the highest scores across all metrics, indicating balanced and superior performance.
4.8. Model Interpretability Using SHAP Analysis
In order to make the proposed SAGE-GBTCN framework more transparent and interpretable, SHapley Additive exPlanations (SHAP) are utilized to measure the contribution of individual input features to the model prediction. SHAP is based on cooperative game theory and offers the consist and additive feature attributions, which is why it is especially applicable to explaining complex hybrid architectures. The purpose of the current analysis is to determine the prevailing forces behind the COD forecasting and to confirm the existence of the learned relationships as compared to the known facts about the wastewater processes.
The importance of global features is analysed at the first stage with the mean absolute SHAP values taking the average magnitude of the influence of each feature on the model output. As demonstrated in
Figure 24, in the short term, COD dynamics prevails in the prediction. Especially, the first-order COD difference and one-day lagged COD become the most significant predictors. This observation indicates the powerful autoregressive effect with memories of COD development in wastewater systems. Variables associated with nutrients, e.g. Total Nitrogen (TN) and Biochemical Oxygen Demand (BOD), are also an important source of useful information and symptoms characterise the organic load, as well as the treatment performance.
A consistent importance pattern is observed when an alternative global SHAP aggregation view is considered. As illustrated in
Figure 25, the dominance of short-horizon COD-related features persists across different importance representations, confirming the robustness of the identified feature rankings. In contrast, meteorological variables, flow-related indicators, and longer-term rolling statistics exhibit comparatively lower importance, suggesting that short-term temporal dependencies are more critical than slowly varying exogenous factors for COD prediction in this setting.
An alternative global SHAP aggregation perspective suggests a constant pattern of importance. The predominance of short-horizon COD-related features is observed in all the importance representations which supports the strength of the obtained feature ranking as shown in
Figure 24. Conversely, meteorological variables, flow-based pointers, and more long-term rolling statistics are showing even lower significance, implying that short term temporal relationships are more fundamental than those that vary slowly but exogenously in predicting of COD in this context.
In addition to the global significance, a SHAP beeswarm plot is used to analyze the directionality and the distribution of effects of the features. To illustrate in
Figure 26, the higher values of the COD predictive coefficients, namely
COD_diff_1 and
COD_lag_1, are mostly associated with higher SHAP values, therefore, they influence more model outputs. Weaker patterns are observed with TN and BOD. The fact that the concentration of SHAP values around zero on less influential features is confirming the fact that the model is not overly dependent on irrelevant inputs.
In general, the interpretability analysis using the SHAP revealed that the proposed SAGE-GBTCN framework can learn physically intuitive and temporally consistent relations. This correspondence of data-generated feature importance with domain knowledge increases the trust in the predictive validity of the model and contributes to its further usage as a tool of wastewater monitoring and decision-making in the real-world.
4.9. Comparison with Representative AI-Based Wastewater Prediction Models
To contextualise the proposed SAGE-GBTCN framework within recent advances in AI-based wastewater quality prediction, a comparative assessment is conducted against representative state-of-the-art approaches, as summarised in
Table 8. The selected studies span attention-based recurrent architectures [
21], interpretable LSTM models with post-hoc explanation [
22], and hybrid temporal convolutional frameworks [
45], which collectively reflect prevailing methodological trends in the literature.
As shown in
Table 8, existing approaches primarily emphasise improving average predictive accuracy through increasingly complex deep learning architectures. While these models demonstrate strong performance under normal operating conditions, most lack explicit mechanisms for residual correction or regime-aware adaptation. In particular, neither attention-based recurrent models [
21] nor SHAP-enhanced LSTM frameworks [
22] incorporate selective correction strategies, resulting in limited robustness when confronted with abrupt disturbances or shock loads. Similarly, hybrid TCN–LSTM approaches [
45], although effective at capturing temporal dependencies, typically rely on site-specific calibration and do not explicitly address extreme-event behaviour.
In contrast, the proposed SAGE-GBTCN framework introduces an explicit residual learning strategy on top of a strong gradient boosting baseline and employs a learned gating mechanism to selectively activate corrections during abnormal regimes. This design enables the model to maintain stability under normal operating conditions while improving responsiveness during extreme pollution events, a capability that is largely absent in existing studies. Moreover, the integration of SHAP-based interpretability and ablation analysis provides transparent insight into feature contributions and component-wise effects, addressing a common limitation of prior deep learning approaches.
Overall, the comparison highlights that the proposed framework advances the state of the art not merely through architectural complexity, but by explicitly addressing regime shifts, robustness, and deployment-oriented reliability. These characteristics are particularly important for real-world wastewater monitoring and decision-support systems, where non-stationarity and rare but high-impact events are unavoidable.