Submitted:
23 September 2024
Posted:
23 September 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Material and Methods
2.1. Study Area
2.2. Data Collection
2.3. Data Preprocessing
2.3.1. Missing Data Imputation
2.3.2. Feature Engineering
2.4. Feature Scaling
2.5. Feature Selection
2.6. Data Splitting
2.7. Data Modeling

2.7.1. Negative Binomial Regression (NBR)
2.7.2. Seasonal AutoRegressive Integrated Moving Average with eXogenous Regressors (SARIMAX)
2.7.3. Extreme Gradient Boosting (XGBoost) Regression
2.7.4. Long Short-Term Memory (LSTM)
2.8. Predictive Performance and Model Validation
2.8.1. Evaluation Metrics
2.8.2. Model Validation Approach
2.8.3. Cross-Validation Strategy
2.8.4. Overfitting and Generalization
2.8.5. Model Refinement and Iteration
2.8.6. Final Model Selection
2.9. Statistical Analysis
3. Results
3.1. Trends and Fluctuations in Annual Dengue Incidence (2003-2022)
3.2. Descriptive Statistics
3.3. Correlation Analysis
3.4. Negative Binomial Regression (NBR)
3.5. SARIMAX Model
3.5.1. SARIMAX #1 Baseline Model
3.5.2. SARIMAX #2 Full Multivariate Model
3.6. XGBoost Regression
3.6.1. Model #1
3.6.2. Model #2
3.7. LSTM Neural Network
3.7.1. LSTM Model #1
3.7.2. LSTM Model #2
3.7.3. LSTM Model #3
3.8. Assessment of Predictive Performance
4. Discussion
5. Conclusion
Supplementary Materials
Funding
<b>Acknowledgments</b>
Conflicts of Interest
References
- Chen, J.; Ding, R.-L.; Liu, K.-K.; Xiao, H.; Hu, G.; Xiao, X.; Yue, Q.; Lu, J.-H.; Han, Y.; Bu, J. Collaboration between meteorology and public health: Predicting the dengue epidemic in Guangzhou, China, by meteorological parameters. Frontiers in Cellular and Infection Microbiology 2022, 12, 881745. [Google Scholar] [CrossRef] [PubMed]
- Colón-González, F.J.; Soares Bastos, L.; Hofmann, B.; Hopkin, A.; Harpham, Q.; Crocker, T.; Amato, R.; Ferrario, I.; Moschini, F.; James, S. Probabilistic seasonal dengue forecasting in Vietnam: A modelling study using superensembles. PLoS medicine 2021, 18, e1003542. [Google Scholar] [CrossRef] [PubMed]
- Akter, R.; Hu, W.; Gatton, M.; Bambrick, H.; Cheng, J.; Tong, S. Climate variability, socio-ecological factors and dengue transmission in tropical Queensland, Australia: A Bayesian spatial analysis. Environmental Research 2021, 195, 110285. [Google Scholar] [CrossRef] [PubMed]
- Xu, J.; Xu, K.; Li, Z.; Meng, F.; Tu, T.; Xu, L.; Liu, Q. Forecast of dengue cases in 20 Chinese cities based on the deep learning method. International journal of environmental research and public health 2020, 17, 453. [Google Scholar] [CrossRef] [PubMed]
- McGough, S.F.; Clemente, L.; Kutz, J.N.; Santillana, M. A dynamic, ensemble learning approach to forecast dengue fever epidemic years in Brazil using weather and population susceptibility cycles. Journal of The Royal Society Interface 2021, 18, 20201006. [Google Scholar] [CrossRef] [PubMed]
- Appice, A.; Gel, Y.R.; Iliev, I.; Lyubchich, V.; Malerba, D. A multi-stage machine learning approach to predict dengue incidence: a case study in Mexico. Ieee Access 2020, 8, 52713–52725. [Google Scholar] [CrossRef]
- Salim, N.A.M.; Wah, Y.B.; Reeves, C.; Smith, M.; Yaacob, W.F.W.; Mudin, R.N.; Dapari, R.; Sapri, N.N.F.F.; Haque, U. Prediction of dengue outbreak in Selangor Malaysia using machine learning techniques. Scientific reports 2021, 11, 939. [Google Scholar] [CrossRef] [PubMed]
- Nguyen, L.T.; Le, H.X.; Nguyen, D.T.; Ho, H.Q.; Chuang, T.-W. Impact of climate variability and abundance of mosquitoes on dengue transmission in central Vietnam. International journal of environmental research and public health 2020, 17, 2453. [Google Scholar] [CrossRef] [PubMed]
- Pham, N.T.; Nguyen, C.T.; Vu, H.H. Assessing and modelling vulnerability to dengue in the Mekong Delta of Vietnam by geospatial and time-series approaches. Environmental Research 2020, 186, 109545. [Google Scholar] [CrossRef] [PubMed]
- Mudele, O.; Frery, A.C.; Zanandrez, L.F.; Eiras, A.E.; Gamba, P. Dengue vector population forecasting using multisource earth observation products and recurrent neural networks. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2021, 14, 4390–4404. [Google Scholar] [CrossRef]
- Patil, S.; Pandya, S. Forecasting dengue hotspots associated with variation in meteorological parameters using regression and time series models. Frontiers in public health 2021, 9, 798034. [Google Scholar] [CrossRef] [PubMed]
- Thiruchelvam, L.; Dass, S.C.; Mathur, N.; Asirvadam, V.S.; Gill, B.S. Inclusion of Climate Variables for Dengue Prediction Model: Preliminary Analysis. Proceedings of 2021 IEEE International Conference on Signal and Image Processing Applications (ICSIPA); pp. 162–166. [CrossRef]
- Hilbe, J. Negative binomial regression. Cambridge University Press: 2011.
- Bergmeir, C.; Hyndman, R.J.; Benítez, J.M. Bagging exponential smoothing methods using STL decomposition and Box–Cox transformation. International journal of forecasting 2016, 32, 303–312. [Google Scholar] [CrossRef]
- Hastie, T.; Tibshirani, R.; Friedman, J. The elements of statistical learning: data mining, inference, and prediction. Springer: 2017. [CrossRef]
- Lowe, R.; Stewart-Ibarra, A.M.; Petrova, D.; García-Díez, M.; Borbor-Cordova, M.J.; Mejía, R.; Regato, M.; Rodó, X. Climate services for health: predicting the evolution of the 2016 dengue season in Machala, Ecuador. The lancet Planetary health 2017, 1, e142–e151. [Google Scholar] [CrossRef] [PubMed]
- Sutherland, C.; Hare, D.; Johnson, P.J.; Linden, D.W.; Montgomery, R.A.; Droge, E. Practical advice on variable selection and reporting using Akaike information criterion. Proceedings of the Royal Society B 2023, 290, 20231261. [Google Scholar] [CrossRef] [PubMed]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; pp. 785–794. [CrossRef]
- Choi, J.Y.; Lee, B. Combining LSTM network ensemble via adaptive weighting for improved time series forecasting. Mathematical problems in engineering 2018, 2018, 2470171. [Google Scholar] [CrossRef]
- Hochreiter, S. Long Short-term Memory. Neural Computation MIT-Press.
- Wang, Y.; Zhou, J.; Chen, K.; Wang, Y.; Liu, L. Water quality prediction method based on LSTM neural network. Proceedings of 2017 12th international conference on intelligent systems and knowledge engineering (ISKE); pp. 1–5.
- Kuhn, M. Applied predictive modeling. Springer: 2013.
- Hyndman, R. Forecasting: principles and practice; OTexts: 2018.
- Shmueli, G.; Polak, J. Practical time series forecasting with r: A hands-on guide; Axelrod schnall publishers: 2024.













| No. | Parameter | Symbol | Unit |
|---|---|---|---|
| 1 | Temperature at 2 Meters Range | T2M_RANGE | oC |
| 2 | Temperature at 2 Meters Maximum | T2M_MAX | oC |
| 3 | Temperature at 2 Meters Minimum | T2M_MIN | oC |
| 4 | Temperature at 2 Meters | T2M | oC |
| 5 | Relative Humidity at 2 Meters | RH2M | % |
| 6 | Precipitation Corrected | PRECTOTCORR | mm/day |
| 7 | Surface Pressure | PS | kPa |
| 8 | Wind Speed at 10 Meters | WS10M | m/s |
| 9 | Wind Speed at 10 Meters Maximum | WS10M_MAX | m/s |
| 10 | Wind Speed at 10 Meters Minimum | WS10M_MIN | m/s |
| 11 | Wind Speed at 10 Meters Range | WS10M_RANGE | m/s |
| 12 | Wind Direction at 10 Meters | WD10M | Degrees |
| 13 | Sea Surface Temperature | SST | oC |
| Variables | count | mean | std | min | 25% | 50% | 75% | max |
|---|---|---|---|---|---|---|---|---|
| DF_case | 1044 | 64.943 | 111.994 | 0 | 15 | 32 | 68 | 913 |
| T2M_RANGE | 1044 | 3.998 | 1.166 | 1.86 | 3.057 | 3.769 | 4.926 | 7.263 |
| T2M_MAX | 1044 | 29.63 | 1.082 | 26.57 | 28.852 | 29.427 | 30.242 | 34.044 |
| T2M_MIN | 1044 | 25.633 | 1.404 | 21.071 | 24.863 | 25.826 | 26.527 | 29.596 |
| T2M | 1044 | 27.376 | 1.131 | 24.126 | 26.768 | 27.34 | 28.012 | 31.453 |
| RH2M | 1044 | 79.275 | 5.937 | 58.286 | 74.646 | 80.393 | 84.312 | 89.134 |
| PRECTOTCORR | 1044 | 4.145 | 4.783 | 0 | 0.197 | 2.457 | 6.729 | 42.527 |
| PS | 1044 | 100.538 | 0.184 | 100.08 | 100.407 | 100.509 | 100.661 | 101.144 |
| WS10M | 1044 | 5.764 | 1.538 | 1.897 | 4.631 | 5.656 | 6.936 | 10.234 |
| WS10M_MAX | 1044 | 7.406 | 1.772 | 2.851 | 6.064 | 7.304 | 8.776 | 12.514 |
| WS10M_MIN | 1044 | 4.093 | 1.367 | 0.844 | 3.106 | 4.004 | 5.099 | 8.21 |
| WS10M_RANGE | 1044 | 3.313 | 0.88 | 1.42 | 2.657 | 3.2 | 3.896 | 6.61 |
| WD10M | 1044 | 154.463 | 72.843 | 49.03 | 86.743 | 123.339 | 239.295 | 269.124 |
| SST | 1044 | 26.999 | 0.934 | 24.7 | 26.3 | 27.1 | 27.6 | 29.8 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 20.2737 | 10.862 | 1.866 | 0.062 | -1.016 | 41.563 |
| T2M_RANGE | -0.1264 | 0.017 | -7.242 | 0 | -0.161 | -0.092 |
| T2M_MAX | -0.2771 | 0.016 | -16.913 | 0 | -0.309 | -0.245 |
| RH2M | 0.0563 | 0.005 | 12.386 | 0 | 0.047 | 0.065 |
| PRECTOTCORR | -0.0141 | 0.003 | -4.037 | 0 | -0.021 | -0.007 |
| PS | -0.1235 | 0.104 | -1.185 | 0.236 | -0.328 | 0.081 |
| WD10M | -0.0004 | 0 | -1.615 | 0.106 | -0.001 | 8.00E-05 |
| Model | MAE (Train) | MAE (Test) | RMSE (Train) | RMSE (Test) | Key Observations |
|---|---|---|---|---|---|
| NBR #1 | 26.321 | 25.822 | 37.380 | 33.413 | Reasonable generalization but struggles with larger fluctuations. |
| NBR #2 | 25.916 | 25.556 | 37.397 | 33.502 | Slight improvement; still lacks peak detection. |
| NBR #3 | 25.426 | 24.846 | 37.101 | 32.364 | Better generalization, but key outbreak peaks remain undetected. |
| NBR #4 | 20.748 | 21.409 | 29.755 | 26.427 | Significant improvement in predictive accuracy and generalization. |
| SARIMAX #1 | 10.539 | 20.307 | 16.101 | 27.190 | Strong overfitting; poor generalization to test data. |
| SARIMAX #2 | 11.367 | 17.017 | 15.786 | 22.635 | Reduced overfitting; still requires fine-tuning for better test performance. |
| XGBoost #1 | 1.074 | 21.767 | 1.461 | 29.732 | Severe overfitting; excellent train performance but poor test generalization. |
| XGBoost #2 | 6.631 | 24.450 | 13.035 | 30.973 | Minor improvement with lagged variables, overfitting persists. |
| LSTM #1 | 23.731 | 28.856 | 38.031 | 38.650 | Captures seasonality but fails to detect individual peaks or outbreaks. |
| LSTM #2 | 18.704 | 18.143 | 29.922 | 24.368 | Deeper model improves fit, but test performance remains inconsistent. |
| LSTM #3 | 13.890 | 24.859 | 20.544 | 34.920 | Improved training results, but test performance is weak, especially with peaks. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).