3. Results
3.1. In-Sample Model Performance
A clear difference is observed between the three nonlinear models (GBR, RF, MLP) and the three linear models (OLS, Ridge, Lasso) as it was shown in the
Table 5: in-sample performance metrics for all six models.
Table 5.
In-sample performance metrics (N = 262), sorted by R² descending.
Table 5.
In-sample performance metrics (N = 262), sorted by R² descending.
| Model |
R² |
RMSE (kg) |
MAE (kg) |
MSE (kg²) |
MAPE (%) |
| Gradient Boosting |
0.9999 |
26.83 |
22.26 |
720 |
3.74% |
| Random Forest |
0.9949 |
260.93 |
160.32 |
68,084 |
19.55% |
| MLP |
0.9915 |
336.94 |
217.09 |
113,530 |
26.87% |
| Linear Regression |
0.2368 |
3,194.38 |
1,576.88 |
10,204,062 |
194.96% |
| Ridge Regression |
0.2368 |
3,194.40 |
1,574.30 |
10,204,164 |
194.31% |
| Lasso Regression |
0.2368 |
3,194.38 |
1,576.16 |
10,204,067 |
194.82% |
As we can see in the
Table 5, model performance across six algorithms spanning nonlinear machine learning and classical linear regression revealed a significant performance hierarchy.
Three nonlinear models achieved exceptional in-sample accuracy: Gradient Boosting (R² = 0.9999; MAPE = 3.74%), Random Forest (R² = 0.9949; MAPE = 19.55%), and MLP (R² = 0.9915; MAPE = 26.87%).
In whole different way, Linear Regression, Ridge Regression, and Lasso Regression all resulted in identical very poor performance (R² ≈ 0.2368; MAPE ≈ 195%). The approximately 52-fold higher MAPE of Linear Regression relative to Gradient Boosting indicates that nonlinear models better capture clinic-specific heterogeneity and predictor-outcome interactions in this dataset.
The consistency of results across Gradient Boosting, Random Forest, and MLP—achieved via three independent algorithmic families provides robust validation that nonlinearity is crucial to the problem.
Figure 4.
Model R²: In-sample (solid), 5-fold CV (hatched), and 10-fold CV (cross-hatched).
Figure 4.
Model R²: In-sample (solid), 5-fold CV (hatched), and 10-fold CV (cross-hatched).
3.2. Cross-Validation Results
The cross-validation results reveal critical information about model generalization. Overfitting gaps (ΔR² = in-sample R² – CV R²) are:
- ➢
Gradient Boosting: ΔR² = 0.1811 (18.1% relative overfitting). CV R² = 0.8188.
- ➢
Random Forest: ΔR² = 0.1772 (17.8% relative overfitting). CV R² = 0.8177.
- ➢
MLP: ΔR² = 1.3451 (135.7% relative overfitting). CV R² = -0.3536.
Table 6.
5-fold and 10-fold cross-validation results (mean ± SD).
Table 6.
5-fold and 10-fold cross-validation results (mean ± SD).
| Model |
5-Fold CV R² |
5F RMSE |
10-Fold CV R² |
10F RMSE |
| Gradient Boosting |
0.8188 ± 0.2722 |
722.3 ± 161.7 |
0.7173 ± 0.3169 |
672.9 ± 215.0 |
| Random Forest |
0.8177 ± 0.2473 |
850.7 ± 367.2 |
0.7435 ± 0.2835 |
682.9 ± 267.9 |
| MLP |
-0.3536 ± 0.9835 |
2,996.8 ± 1,252.4 |
-1.6588 ± 2.8350 |
2,813.1 ± 1,321.8 |
| Linear Regression |
-0.6335 ± 1.3954 |
3,109.5 ± 1,201.2 |
-1.5843 ± 2.4305 |
3,041.6 ± 1,395.9 |
| Ridge Regression |
-0.6239 ± 1.3795 |
3,107.4 ± 1,203.4 |
-1.5626 ± 2.4008 |
3,038.5 ± 1,399.4 |
| Lasso Regression |
-0.6318 ± 1.3922 |
3,109.3 ± 1,201.8 |
-1.5806 ± 2.4255 |
3,041.3 ± 1,396.7 |
Cross-validation revealed important disparities between in-sample and generalized model performance. Gradient Boosting and Random Forest exhibited moderate overfitting (ΔR² = 0.1811 and 0.1772), with cross-validated R² values of 0.8188 and 0.8177, respectively. In contrast, the MLP showed poor generalization (CV R² = -0.3536), indicating instability under the current sample size and augmentation setting.
Accordingly, MLP forecasts are retained only as a high-uncertainty stress scenario and are not used for primary operational recommendations. Primary planning conclusions are based on Random Forest and Gradient Boosting.
For operational planning, the cross-validated RMSE of approximately 1,553 kg (derived from CV R² = 0.82 and observed SD = 3,664 kg) provides a more conservative and honest bound on expected forecast accuracy for both Random Forest and Gradient Boosting
Figure 5.
Distribution of R² scores across 5 cross-validation folds. Box plots show median, IQR, whiskers (1.5×IQR), and individual fold values.
Figure 5.
Distribution of R² scores across 5 cross-validation folds. Box plots show median, IQR, whiskers (1.5×IQR), and individual fold values.
Gradient Boosting results in ΔR² = 0.1811 means that this model memorized about 18% of variance that belongs to noise or clinic-specific quirks that are unique to a particular case not generalizable patterns.
RF demonstrates the smallest overfitting gap among nonlinear models. RF is actually preferable over GBR in practice because it achieves identical CV R² with less overfitting risk (max_depth=10 limits complexity).
MLP overfits so severely because the data augmentation backfire: 30× Gaussian noisy copies were added during training. MLP learned the augmented noise patterns perfectly but those noise patterns do not exist in real data. When tested on real held-out data, the learned noise structure causes severe misforecasts. Gaussian noise does not model true uncertainty: Real clinic-year variation follows the panel structure (clinic effects + year effects + residuals), not isotropic Gaussian noise. The MLP learned a wrong noise distribution.
Based on all of the above, we can summarize all the considerations in the following table:
Table 7.
In-Sample and Cross-Validated Performance of Nonlinear Models with Overfitting Gap (ΔR²).
Table 7.
In-Sample and Cross-Validated Performance of Nonlinear Models with Overfitting Gap (ΔR²).
| Model |
In-Sample R² |
CV R² |
ΔR² |
Overfitting Severity |
| Gradient Boosting |
0.9999 |
0.8188 |
0.1811 |
Moderate |
| Random Forest |
0.9949 |
0.8177 |
0.1772 |
Moderate |
| MLP |
0.9915 |
-0.3536 |
1.3451 |
Catastrophic |
3.3. Hold-Out Test Set Evaluation (80/20 Split)
The models are trained on 80% of the data and then evaluated on the remaining unseen 20% to measure how well it generalizes to new data. The test-set results constitute the strongest evidence of generalization, as these observations were never used during training or cross-validation. The nonlinear models maintain strong performance on unseen data, while linear models remain inadequate.
Table 8.
Out-of-Sample (Test) Performance Comparison of Six Models for Infectious Waste Prediction.
Table 8.
Out-of-Sample (Test) Performance Comparison of Six Models for Infectious Waste Prediction.
| Model |
Test R² |
Test RMSE (kg) |
Test MAE (kg) |
| Gradient Boosting |
0.9692 |
743.15 |
492.04 |
| Random Forest |
0.9596 |
850.88 |
468.03 |
| MLP |
0.2106 |
3,761.02 |
1,544.00 |
| Linear Regression |
0.1903 |
3,809.24 |
1,799.59 |
| Ridge Regression |
0.1904 |
3,808.93 |
1,795.99 |
| Lasso Regression |
0.1902 |
3,809.43 |
1,798.85 |
Best overall model is Gradient Boosting with the highest Test R² = 0.9692 and lowest Test RMSE = 743.15 kg, indicating the strongest generalization and smallest large-error risk.
Second best is Random Forest (R² = 0.9596). It has slightly worse RMSE (850.88 kg) but the best MAE (468.03 kg), meaning lower average absolute error.
MLP underperforms strongly: R² = 0.2106, with much larger errors (RMSE = 3,761.02 kg, MAE = 1,544.00 kg), far behind tree-based models.
Linear family (OLS/Ridge/Lasso) performs similarly and poorly: all around R² ≈ 0.19, RMSE ≈ 3,809 kg, MAE ≈ 1,796–1,800 kg, showing minimal benefit from regularization in this dataset.
The nonlinear models maintain strong performance on unseen data, while linear models remain inadequate. Tree-based ensemble models (Gradient Boosting, Random Forest) are clearly superior; the data appears strongly nonlinear, and linear/MLP models are not competitive for test-set prediction.
3.4. Statistical Significance of Model Differences
The Friedman test indicated significant overall differences among models (χ² = 21.571, p = 6.3149e-04). However, pairwise Wilcoxon tests among nonlinear models did not reach the 0.05 threshold, with two borderline comparisons (p = 0.0625). Therefore, differences between GBR and RF should be interpreted primarily as practical rather than strictly inferential under the current resampling design.
Table 9.
Pairwise Wilcoxon signed-rank tests on 5-fold CV R² between nonlinear models.
Table 9.
Pairwise Wilcoxon signed-rank tests on 5-fold CV R² between nonlinear models.
| Comparison |
W Statistic |
p-value |
Sig. (p < 0.05) |
| GBR vs. RF |
7.00 |
1.0000 |
No |
| RF vs. MLP |
0.00 |
0.0625 |
No |
| GBR vs. MLP |
0.00 |
0.0625 |
No |
Wilcoxon signed-rank testing found no statistically significant pairwise differences at (\alpha = 0.05): GBR vs RF ((W=7.00, p=1.0000)), RF vs MLP ((W=0.00, p=0.0625)), and GBR vs MLP ((W=0.00, p=0.0625)). Therefore, under the current resampling design, none of the observed model gaps can be claimed as statistically significant. The two (p=0.0625) results are borderline and suggest a possible trend, but evidence is insufficient at the conventional threshold. This pattern likely reflects limited test power (small number of folds/splits), so differences should be interpreted as practical rather than inferential
3.5. Actual vs. Predicted Analysis
Actual vs. predicted scatter plots are one of the most informative diagnostic visuals for regression models. They show whether each model reproduces observed values across the full range of outcomes, not just on average.
The plot shows x-axis: actual observed waste values (y), y-axis: model-predicted waste values (\hat{y}), each dot: one clinic-year observation and red dashed line: perfect prediction line (y=\hat{y}).
The actual-versus-predicted scatter plots indicate clear performance stratification between model classes. In the top row (nonlinear models), points align closely with the (y=\hat{y}) reference, indicating high fidelity across the observed outcome range and improved handling of nonlinear clinic-level patterns.
Figure 6.
Actual vs. predicted scatter plots for all six models. Red dashed line = perfect prediction (y = ŷ). Top row: nonlinear models; Bottom row: linear models.
Figure 6.
Actual vs. predicted scatter plots for all six models. Red dashed line = perfect prediction (y = ŷ). Top row: nonlinear models; Bottom row: linear models.
Gradient Boosting and Random Forest exhibit the tightest concentration around the diagonal, whereas MLP shows comparatively larger dispersion. In the bottom row (linear models), point clouds are substantially wider and display value-range compression, with overprediction at lower outcomes and underprediction at higher outcomes.
This visual pattern shows the metric-based evidence that nonlinear models provide superior predictive structure for infectious-waste forecasting in heterogeneous panel data.
3.6. Residual Diagnostics
The residual diagnostics reveal a clear two-tier structure consistent with all prior performance metrics. Gradient Boosting produced residuals closest to the ideal white-noise benchmark: zero mean, SD=26.8 kg, near-symmetric distribution (skewness=0.170), and a Durbin–Watson statistic of 2.027 indicating no serial autocorrelation.
The Jarque–Bera test failed to reject normality (p=0.061), the only model to achieve this. Random Forest and MLP residuals were substantially more dispersed (SD=260.9 and 335.2, respectively) with significant non-normality attributable to the highly skewed target distribution; both nonetheless maintained Durbin–Watson values in the acceptable range (1.841 and 1.767).
Table 10.
Residual diagnostics: Shapiro–Wilk (S–W) and Jarque–Bera (J–B) normality tests, Durbin–Watson (D–W) serial correlation statistic.
Table 10.
Residual diagnostics: Shapiro–Wilk (S–W) and Jarque–Bera (J–B) normality tests, Durbin–Watson (D–W) serial correlation statistic.
| Model |
Mean |
SD |
Skewness |
Kurtosis |
S–W p |
J–B p |
D–W |
| Gradient Boosting |
0.0 |
26.8 |
0.170 |
-0.631 |
2.65e-02 |
6.06e-02 |
2.027 |
| Random Forest |
0.5 |
260.9 |
1.871 |
7.952 |
3.44e-16 |
7.97e-184 |
1.841 |
| MLP |
-34.0 |
335.2 |
-1.160 |
9.177 |
6.50e-14 |
3.78e-213 |
1.767 |
| Linear Regression |
-0.0 |
3,194.4 |
3.649 |
15.923 |
1.87e-24 |
0.00e+00 |
0.235 |
| Ridge Regression |
0.0 |
3,194.4 |
3.658 |
15.955 |
1.69e-24 |
0.00e+00 |
0.235 |
| Lasso Regression |
-0.0 |
3,194.4 |
3.652 |
15.935 |
1.82e-24 |
0.00e+00 |
0.235 |
All models reject the normality assumption for residuals (S–W p < 0.05), consistent with the highly skewed target distribution. The Durbin–Watson statistic values near 2.0 would indicate absence of serial correlation. GBR residuals have the smallest standard deviation (26.8 kg) and skewness closest to zero, indicating the most symmetrically distributed errors.
Figure 7.
Residual distributions for the three nonlinear models. Histograms (density-normalized) with fitted normal curves (dashed). Shapiro–Wilk p-values annotated.
Figure 7.
Residual distributions for the three nonlinear models. Histograms (density-normalized) with fitted normal curves (dashed). Shapiro–Wilk p-values annotated.
Figure 8.
Normal Q–Q plots for nonlinear model residuals. Departure from the red reference line indicates non-normality. GBR shows the tightest adherence to normality.
Figure 8.
Normal Q–Q plots for nonlinear model residuals. Departure from the red reference line indicates non-normality. GBR shows the tightest adherence to normality.
The three linear models OLS, Ridge, and Lasso exhibited diagnostic failure across all criteria. Most critically, their Durbin–Watson statistics of 0.235 indicate strong positive serial autocorrelation (estimated first-order residual correlation ≈ 0.88), confirming that the linear functional form cannot capture the temporal and clinic-level heterogeneity present in the panel data.
Combined with extreme skewness (>3.6) and kurtosis (>15.9), these diagnostics validate the exclusion of linear specifications from forecasting applications in this study.
3.7. Feature Importance Analysis
Permutation-based feature importance, expressed as the mean decrease in predictive performance after perturbing each predictor (ΔR² ± SD), showed strong model-dependent ranking patterns. In the two best-performing nonlinear models, Gradient Boosting and Random Forest, Beds was the dominant predictor by a large margin (GBR: 1.5978 ± 0.1109; RF: 1.5438 ± 0.0991), while Patient Days had a secondary contribution (GBR: 0.2101 ± 0.0227; RF: 0.1357 ± 0.0123), and Patients Treated and Bed Occupancy Percentage had comparatively small effects (all ≈0.02–0.03).
In contrast, the MLP assigned high importance to multiple predictors (Beds: 0.9128 ± 0.1162; Patients: 0.7213 ± 0.1676; Patient Days: 1.1272 ± 0.1592; Occupancy: 0.4628 ± 0.1164), with larger variability indicating reduced stability of attribution.
Linear-model families (OLS, Ridge, Lasso) produced nearly identical profiles, with Patient Days as the strongest predictor (≈0.24–0.25), followed by Patients (≈0.17), Occupancy (≈0.06), and Beds (≈0.03).
Collectively, these findings support the nonlinear structure of the data-generating process and reinforce the use of Gradient Boosting and Random Forest as primary forecasting models.
Table 11.
Permutation importance (mean ΔR² ± SD, 30 repeats) by model and feature.
Table 11.
Permutation importance (mean ΔR² ± SD, 30 repeats) by model and feature.
| Model |
Beds (ΔR²) |
Patients (ΔR²) |
Pat. Days (ΔR²) |
Occup. (ΔR²) |
| Gradient Boosting |
1.5978 ± 0.1109 |
0.0241 ± 0.0030 |
0.2101 ± 0.0227 |
0.0232 ± 0.0026 |
| Random Forest |
1.5438 ± 0.0991 |
0.0295 ± 0.0034 |
0.1357 ± 0.0123 |
0.0275 ± 0.0034 |
| MLP |
0.9128 ± 0.1162 |
0.7213 ± 0.1676 |
1.1272 ± 0.1592 |
0.4628 ± 0.1164 |
| Linear Regression |
0.0334 ± 0.0114 |
0.1686 ± 0.0378 |
0.2493 ± 0.0368 |
0.0633 ± 0.0197 |
| Ridge Regression |
0.0340 ± 0.0115 |
0.1661 ± 0.0374 |
0.2436 ± 0.0362 |
0.0649 ± 0.0199 |
| Lasso Regression |
0.0333 ± 0.0114 |
0.1681 ± 0.0377 |
0.2492 ± 0.0368 |
0.0632 ± 0.0197 |
The following table shows relative feature importance for the two tree-based models. Each row is effectively a weight distribution across predictors (values are proportions and each row sums to about 1.00).
Table 12.
Gini importance (mean decrease in impurity) for tree-based models.
Table 12.
Gini importance (mean decrease in impurity) for tree-based models.
| Model |
Beds |
Patients |
Pat. Days |
Occupancy |
| Random Forest |
0.6422 |
0.0210 |
0.2485 |
0.0883 |
| Gradient Boosting |
0.6175 |
0.0148 |
0.2893 |
0.0784 |
This table shows a very consistent feature-importance pattern across the two best models. In both Random Forest and Gradient Boosting, Beds is the most influential predictor (64.22% and 61.75%). Patient Days is the second most important variable (24.85% and 28.93%). Occupancy has a smaller, supporting contribution (8.83% and 7.84%). Patients contributes the least (2.10% and 1.48%).
Overall conclusion: both tree models agree that infectious-waste prediction is driven primarily by hospital capacity and utilization intensity (Beds and Patient Days), while raw patient count adds minimal additional information.
Figure 9.
Grouped bar chart of permutation importance (ΔR²) for all six models and four features. Error bars = SD across 30 permutation repeats.
Figure 9.
Grouped bar chart of permutation importance (ΔR²) for all six models and four features. Error bars = SD across 30 permutation repeats.
This grouped bar chart shows a clear separation between model families in how they use predictors. Gradient Boosting and Random Forest are strongly dominated by Beds, with Patient Days as a secondary contributor and very small roles for Patients and Occupancy.
MLP assigns substantial importance to all four predictors, but with noticeably larger error bars, indicating higher variability and less stable attribution across repeats.
Linear, Ridge, and Lasso show nearly identical profiles, with modest importance concentrated in Patient Days and Patients, and weak contribution from Beds and Occupancy.
Error bars are generally tight for tree models and linear models, but wider for MLP, consistent with its weaker generalization stability.
Overall, the chart supports the main conclusion: robust models (GBR/RF) rely on a consistent, interpretable structure (Beds + Patient Days), while MLP and linear models either over-distribute importance or capture weaker signal patterns.
Figure 10.
Gini importance (mean decrease in impurity) for Random Forest and Gradient Boosting. Beds is the dominant feature in both tree-based models.
Figure 10.
Gini importance (mean decrease in impurity) for Random Forest and Gradient Boosting. Beds is the dominant feature in both tree-based models.
For tree-based models, the number of beds is overwhelmingly the most important feature, accounting for the majority of explained variance. Beds serve as a proxy for clinic identity and medical specialization: clinics can be uniquely identified by their bed count, allowing tree splits to effectively cluster departments with similar waste profiles.
Patient Days is a secondary contributor. Occupancy and Patients have comparatively small roles. Infectious-waste prediction in the tree models is primarily governed by hospital capacity (Beds), with other variables adding incremental but smaller gains.
For linear models, no feature shows strong importance, consistent with their overall failure.
3.8. Ten-Year Forecasting Results (2022–2031)
Real-world panel data from 24 clinics over 2011-2021 (N=262 clinic-year observations) were modeled to forecast infectious waste generation for 2022-2031 using six algorithms: Random Forest, Gradient Boosting, MLP, Linear Regression (OLS), Ridge, and Lasso.
The predictor set included beds, treated patients, patient days, and bed occupancy, with annual infectious waste (kg) as the target.
Descriptive diagnostics showed substantial heterogeneity and right-skewness in outcomes, consistent with the observed concentration of waste generation in high-volume clinics.
Correlation analysis indicated moderate associations between the target and utilization-related variables, while VIF values suggested no severe multicollinearity, with only patient days showing moderate inflation.
Let us first remaind of historical context of data of annual generation infectious waste in VMA. We can divide this context into two phases, Phase 1 for the period (2011–2019) and Phase 2 for the period (2020-2021) .
Table 13.
Phase 1—Stable/Declining Period (2011–2019).
Table 13.
Phase 1—Stable/Declining Period (2011–2019).
| Total Infectious Waste — All 24 Clinics |
| Year |
Actual Waste (kg) |
YoY Change (%) |
| 2011 |
35,797 |
|
| 2012 |
39,438 |
+10.2% |
| 2013 |
40,602 |
+3.0% |
| 2014 |
41,752 |
+2.8% |
| 2015 |
38,000 |
-9.0% |
| 2016 |
37,690 |
-0.8% |
| 2017 |
36,242 |
-3.8% |
| 2018 |
36,742 |
+1.4% |
| 2019 |
36,004 |
-2.0% |
| Mean |
38,030 |
|
| Min |
35,797 |
|
| Max |
41,752 |
|
Between 2011 and 2019, total infectious waste generation across clinics remained broadly stable, fluctuating around a pre-pandemic baseline with no sustained long-term growth pattern. This period is characterized by moderate year-to-year variability and a near-flat trend, indicating relative operational equilibrium in waste production.
Table 14.
Phase 2—COVID-Era Spike (2020–2021).
Table 14.
Phase 2—COVID-Era Spike (2020–2021).
| Total Infectious Waste — All 24 Clinics (2019 shown as baseline reference) |
| Year |
Actual Waste (kg) |
YoY Change (%) |
Note |
| 2019 |
36,004 |
|
Pre-COVID baseline |
| 2020 |
42,865 |
+19.1% |
COVID-19 onset — infectious waste surge |
| 2021 |
53,287 |
+24.3% |
COVID peak — highest ever recorded |
| 2019→2021 |
+17,490 kg |
+48.2% |
Cumulative 2-year surge (+48.2%) |
In contrast, 2020 and 2021 mark a clear structural break, with sharp consecutive increases culminating in the highest observed level in the series. The magnitude of this rise suggest that COVID-related pressures substantially altered underlying waste-generation dynamics, implying that post-2019 planning and forecasting should be modeled as a distinct regime rather than a continuation of pre-pandemic behavior.
In the
Table 15 we have an aggregate ten years prediction of infectious waste at the Military Medical Academy by using our six models.
Aggregate projections for 2022-2031 revealed consistent growth across all models but with distinct trajectories. Random Forest increased from 44,997.7 kg (2022) to 50,857.8 kg (2031), while Gradient Boosting rose from 46,631.9 kg to 53,703.9 kg. MLP projected the highest path overall, including an early peak of 69,034.3 kg in 2024, and ended at 63,378.5 kg in 2031. The historical mean was 39,856 kg/year.
Linear models produced lower and smoother trajectories, ending near 41,000 kg in 2031. At clinic level forecasts preserved strong inter-clinic asymmetry, with the largest historical generators remaining dominant contributors throughout the forecast horizon.
This creates a clear scenario corridor: linear models define a lower-growth envelope, Random Forest and Gradient Boosting define central operational scenarios, and MLP defines a higher-risk upper scenario.
Figure 11 shows a clear regime shift from historical observations (2011-2021) to projected trajectories (2022-2031). The historical series rises sharply in 2020-2021 (COVID-19 period), and the dashed vertical line correctly separates these observed values from model-based forecasts. After the boundary, all six models maintain levels above the earlier pre-2020 baseline which was established during the COVID-19 pandemia.