Submitted:
23 September 2024
Posted:
23 September 2024
Read the latest preprint version here
Abstract
Keywords:
1. Introduction
2. Material and Methods
2.1. Study Area
2.2. Data Collection
2.3. Data Preprocessing
2.3.1. Missing Data Imputation
2.3.2. Feature Engineering
2.4. Feature Scaling
2.5. Feature Selection
2.6. Data Splitting
2.7. Data Modeling
2.7.1. Negative Binomial Regression (NBR)
2.7.2. Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors (SARIMAX)
2.7.3. Extreme Gradient Boosting (XGBoost) Regression
2.7.4. Long Short-Term Memory (LSTM)
2.8. Predictive Performance and Model Validation
2.8.1. Evaluation Metrics
2.8.2. Model Validation Approach
2.8.3. Cross-Validation Strategy
2.8.4. Overfitting and Generalization
2.8.5. Model Refinement and Iteration
2.8.6. Final Model Selection
2.9. Statistical Analysis
3. Results
3.1. Trends and Fluctuations in Annual Dengue Incidence (2003-2022)
3.2. Descriptive Statistics
3.3. Correlation Analysis
3.4. Negative Binomial Regression (NBR)
3.5. SARIMAX Model
3.5.1. SARIMAX #1 Baseline Model
3.5.2. SARIMAX #2 Full Multivariate Model
3.6. XGBoost Regression
3.6.1. Model #1
3.6.2. Model #2
3.7. LSTM Neural Network
3.7.1. LSTM Model #1
3.7.2. LSTM Model #2
3.7.3. LSTM Model #3





3.8. Assessment of Predictive Performance
4. Discussion
5. Conclusions
Supplementary Materials
Funding
Compliance with Ethical Standards
Acknowledgments
Conflicts of Interest
References
- Chen, J. et al., Collaboration between meteorology and public health: Predicting the dengue epidemic in Guangzhou, China, by meteorological parameters. Frontiers in Cellular and Infection Microbiology, 2022. 12: p. 881745.
- Colón-González, F.J. et al., Probabilistic seasonal dengue forecasting in Vietnam: A modelling study using superensembles. PLoS medicine, 2021. 18(3): p. e1003542.
- Akter, R. et al., Climate variability, socio-ecological factors and dengue transmission in tropical Queensland, Australia: A Bayesian spatial analysis. Environmental Research, 2021. 195: p. 110285.
- Xu, J. et al., Forecast of dengue cases in 20 Chinese cities based on the deep learning method. International journal of environmental research and public health, 2020. 17(2): p. 453.
- McGough, S.F. et al., A dynamic, ensemble learning approach to forecast dengue fever epidemic years in Brazil using weather and population susceptibility cycles. Journal of The Royal Society Interface, 2021. 18(179): p. 20201006.
- Appice, A. et al., A multi-stage machine learning approach to predict dengue incidence: a case study in Mexico. Ieee Access, 2020. 8: p. 52713-52725.
- Salim, N.A.M. et al., Prediction of dengue outbreak in Selangor Malaysia using machine learning techniques. Scientific reports, 2021. 11(1): p. 939.
- Nguyen, L.T. et al., Impact of climate variability and abundance of mosquitoes on dengue transmission in central Vietnam. International journal of environmental research and public health, 2020. 17(7): p. 2453.
- Pham, N.T. C.T. Nguyen, and H.H. Vu, Assessing and modelling vulnerability to dengue in the Mekong Delta of Vietnam by geospatial and time-series approaches. Environmental Research, 2020. 186: p. 109545.
- Mudele, O. et al., Dengue vector population forecasting using multisource earth observation products and recurrent neural networks. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021. 14: p. 4390-4404.
- Patil, S. and S. Pandya, Forecasting dengue hotspots associated with variation in meteorological parameters using regression and time series models. Frontiers in public health, 2021. 9: p. 798034.
- Thiruchelvam, L. et al. Inclusion of Climate Variables for Dengue Prediction Model: Preliminary Analysis. in 2021 IEEE International Conference on Signal and Image Processing Applications (ICSIPA). 2021. IEEE.
- Hilbe, J. Negative binomial regression. 2011, Cambridge University Press.
- Bergmeir, C. R.J. Hyndman, and J.M. Benítez, Bagging exponential smoothing methods using STL decomposition and Box–Cox transformation. International journal of forecasting, 2016. 32(2): p. 303-312.
- Hastie, T. R. Tibshirani, and J. Friedman, The elements of statistical learning: data mining, inference, and prediction. 2017, Springer.
- Lowe, R. et al., Climate services for health: predicting the evolution of the 2016 dengue season in Machala, Ecuador. The lancet Planetary health, 2017. 1(4): p. e142-e151.
- Sutherland, C. et al., Practical advice on variable selection and reporting using Akaike information criterion. Proceedings of the Royal Society B, 2023. 290(2007): p. 20231261.
- Chen, T. and C. Guestrin. Xgboost: A scalable tree boosting system. in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.
- Choi, J.Y. and B. Lee, Combining LSTM network ensemble via adaptive weighting for improved time series forecasting. Mathematical problems in engineering, 2018. 2018(1): p. 2470171.
- Hochreiter, S. Long Short-term Memory. Neural Computation MIT-Press, 1997.
- Wang, Y. et al. Water quality prediction method based on LSTM neural network. in 2017 12th international conference on intelligent systems and knowledge engineering (ISKE). 2017. IEEE.
- Kuhn, M. Applied predictive modeling. 2013, Springer.
- Hyndman, R. Forecasting: principles and practice. 2018: OTexts.
- Shmueli, G. and J. Polak, Practical time series forecasting with r: A hands-on guide. 2024: Axelrod schnall publishers.










| No. | Parameter | Symbol | Unit |
|---|---|---|---|
| 1 | Temperature at 2 Meters Range | T2M_RANGE | oC |
| 2 | Temperature at 2 Meters Maximum | T2M_MAX | oC |
| 3 | Temperature at 2 Meters Minimum | T2M_MIN | oC |
| 4 | Temperature at 2 Meters | T2M | oC |
| 5 | Relative Humidity at 2 Meters | RH2M | % |
| 6 | Precipitation Corrected | PRECTOTCORR | mm/day |
| 7 | Surface Pressure | PS | kPa |
| 8 | Wind Speed at 10 Meters | WS10M | m/s |
| 9 | Wind Speed at 10 Meters Maximum | WS10M_MAX | m/s |
| 10 | Wind Speed at 10 Meters Minimum | WS10M_MIN | m/s |
| 11 | Wind Speed at 10 Meters Range | WS10M_RANGE | m/s |
| 12 | Wind Direction at 10 Meters | WD10M | Degrees |
| 13 | Sea Surface Temperature | SST | oC |
| Variables | count | mean | std | min | 25% | 50% | 75% | max |
|---|---|---|---|---|---|---|---|---|
| DF_case | 1044 | 64.943 | 111.994 | 0 | 15 | 32 | 68 | 913 |
| T2M_RANGE | 1044 | 3.998 | 1.166 | 1.86 | 3.057 | 3.769 | 4.926 | 7.263 |
| T2M_MAX | 1044 | 29.63 | 1.082 | 26.57 | 28.852 | 29.427 | 30.242 | 34.044 |
| T2M_MIN | 1044 | 25.633 | 1.404 | 21.071 | 24.863 | 25.826 | 26.527 | 29.596 |
| T2M | 1044 | 27.376 | 1.131 | 24.126 | 26.768 | 27.34 | 28.012 | 31.453 |
| RH2M | 1044 | 79.275 | 5.937 | 58.286 | 74.646 | 80.393 | 84.312 | 89.134 |
| PRECTOTCORR | 1044 | 4.145 | 4.783 | 0 | 0.197 | 2.457 | 6.729 | 42.527 |
| PS | 1044 | 100.538 | 0.184 | 100.08 | 100.407 | 100.509 | 100.661 | 101.144 |
| WS10M | 1044 | 5.764 | 1.538 | 1.897 | 4.631 | 5.656 | 6.936 | 10.234 |
| WS10M_MAX | 1044 | 7.406 | 1.772 | 2.851 | 6.064 | 7.304 | 8.776 | 12.514 |
| WS10M_MIN | 1044 | 4.093 | 1.367 | 0.844 | 3.106 | 4.004 | 5.099 | 8.21 |
| WS10M_RANGE | 1044 | 3.313 | 0.88 | 1.42 | 2.657 | 3.2 | 3.896 | 6.61 |
| WD10M | 1044 | 154.463 | 72.843 | 49.03 | 86.743 | 123.339 | 239.295 | 269.124 |
| SST | 1044 | 26.999 | 0.934 | 24.7 | 26.3 | 27.1 | 27.6 | 29.8 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 20.2737 | 10.862 | 1.866 | 0.062 | -1.016 | 41.563 |
| T2M_RANGE | -0.1264 | 0.017 | -7.242 | 0 | -0.161 | -0.092 |
| T2M_MAX | -0.2771 | 0.016 | -16.913 | 0 | -0.309 | -0.245 |
| RH2M | 0.0563 | 0.005 | 12.386 | 0 | 0.047 | 0.065 |
| PRECTOTCORR | -0.0141 | 0.003 | -4.037 | 0 | -0.021 | -0.007 |
| PS | -0.1235 | 0.104 | -1.185 | 0.236 | -0.328 | 0.081 |
| WD10M | -0.0004 | 0 | -1.615 | 0.106 | -0.001 | 8.00E-05 |
| Model | MAE (Train) | MAE (Test) | RMSE (Train) | RMSE (Test) | Key Observations |
|---|---|---|---|---|---|
| NBR #1 | 26.321 | 25.822 | 37.380 | 33.413 | Reasonable generalization but struggles with larger fluctuations. |
| NBR #2 | 25.916 | 25.556 | 37.397 | 33.502 | Slight improvement; still lacks peak detection. |
| NBR #3 | 25.426 | 24.846 | 37.101 | 32.364 | Better generalization, but key outbreak peaks remain undetected. |
| NBR #4 | 20.748 | 21.409 | 29.755 | 26.427 | Significant improvement in predictive accuracy and generalization. |
| SARIMAX #1 | 10.539 | 20.307 | 16.101 | 27.190 | Strong overfitting; poor generalization to test data. |
| SARIMAX #2 | 11.367 | 17.017 | 15.786 | 22.635 | Reduced overfitting; still requires fine-tuning for better test performance. |
| XGBoost #1 | 1.074 | 21.767 | 1.461 | 29.732 | Severe overfitting; excellent train performance but poor test generalization. |
| XGBoost #2 | 6.631 | 24.450 | 13.035 | 30.973 | Minor improvement with lagged variables, overfitting persists. |
| LSTM #1 | 23.731 | 28.856 | 38.031 | 38.650 | Captures seasonality but fails to detect individual peaks or outbreaks. |
| LSTM #2 | 18.704 | 18.143 | 29.922 | 24.368 | Deeper model improves fit, but test performance remains inconsistent. |
| LSTM #3 | 13.890 | 24.859 | 20.544 | 34.920 | Improved training results, but test performance is weak, especially with peaks. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).