Submitted:
04 November 2023
Posted:
08 November 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Materials and Methods
2.1. Study population and data description

2.2. Data preprocessing
- Particle swarm optimization (PSO) algorithm: This technique works by searching for the optimal subset of features. It locates the minimum of a function by creating several ‘particles. These particles store their best position as well as the global position. It is this combination of local and global information that gives rise to ‘swarm intelligence’ [21]. In our study we implemented XGBoost and linear regression algorithms to select the best features.
- Recursive feature elimination: This technique works by selecting the optimal subset of features for estimation by reducing 0 to N features iteratively [22]. The best subset is then chosen based on the model's accuracy, cross-validation score, or Roc-Auc curve.
- Univariate feature selection: This approach works by selecting the optimal features using the univariate statistical tests. It might be considered a stage in the estimator's preprocessing process [23]. In our study, we implemented the chi-squared statistical test using the SelectKBest method.
- Feature importance: It works by classifying and evaluating each attribute to create splits. Decision tree models that are developed on ensembles, for example, extra trees and random forests can be used to rank the relevance of certain features [24]. In our study, we employed the extra trees classifier for feature selection.
2.3. Building the two-layer ensemble model
2.4. Validation and performance measurement
2.5. Data Oversampling
3. Results
3.1. Results of the classification before SMOTE
3.2. Results of the classification after SMOTE
3.3. Results of the proposed ensemble model
4. Discussion
5. Conclusion
6. Patents
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- “WHO Death On Roads.” [Online]. Available: https://extranet.who.int/roadsafety/death-on-the-roads/#deaths/per_100k.
- “Road traffic injuries.” [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries.
- 3. “NTSA Report on Road Safety 2022.” [Online]. Available: https://www.the-star.co.ke/news/2023-01-18-4690-people-died-in-road-accidents-in-2022-report/.
- Decade of Action for Road Safety. [Online]. Available: https://www.who.int/teams/social-determinants-of-health/safety-and-mobility/decade-of-action-for-road-safety-2021-2030.
- R. E. Al Mamlook, A. Ali, R. A. Hasan, and H. A. Mohamed Kazim, “Machine Learning to Predict the Freeway Traffic Accidents-Based Driving Simulation,” in 2019 IEEE National Aerospace and Electronics Conference (NAECON), Dayton, OH, USA: IEEE, Jul. 2019, pp. 630–634. [CrossRef]
- Z. Li, H. Liao, R. Tang, G. Li, Y. Li, and C. Xu, “Mitigating the impact of outliers in traffic crash analysis: A robust Bayesian regression approach with application to tunnel crash data,” Accid. Anal. Prev., vol. 185, p. 107019, Jun. 2023. [CrossRef]
- Jamal et al., “Injury severity prediction of traffic crashes with ensemble machine learning techniques: a comparative study,” Int. J. Inj. Contr. Saf. Promot., vol. 28, no. 4, pp. 408–427, Oct. 2021. [CrossRef]
- L. Zheng, T. Sayed, and F. Mannering, “Modeling traffic conflicts for use in road safety analysis: A review of analytic methods and future directions,” Anal. Methods Accid. Res., vol. 29, p. 100142, Mar. 2021. [CrossRef]
- T. Bokaba, W. Doorsamy, and B. S. Paul, “Comparative Study of Machine Learning Classifiers for Modelling Road Traffic Accidents,” Appl. Sci., vol. 12, no. 2, p. 828, Jan. 2022. [CrossRef]
- R. E. AlMamlook, K. M. Kwayu, M. R. Alkasisbeh, and A. A. Frefer, “Comparison of Machine Learning Algorithms for Predicting Traffic Accident Severity,” in 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), Amman, Jordan: IEEE, Apr. 2019, pp. 272–276. [CrossRef]
- Y. Berhanu, E. Alemayehu, and D. Schröder, “Examining Car Accident Prediction Techniques and Road Traffic Congestion: A Comparative Analysis of Road Safety and Prevention of World Challenges in Low-Income and High-Income Countries,” J. Adv. Transp., vol. 2023, pp. 1–18, Jul. 2023. [CrossRef]
- M. Al-Nashashibi, W. Hadi, N. El-Khalili, G. Issa, and A. A. AlBanna, “A New Two-step Ensemble Learning Model for Improving Stress Prediction of Automobile Drivers,” Int. Arab J. Inf. Technol., 2021. [CrossRef]
- M. Ameksa, H. Mousannif, H. Al Moatassime, and Z. Elamrani Abou Elassad, “Crash Prediction using Ensemble Methods:,” in Proceedings of the 2nd International Conference on Big Data, Modelling and Machine Learning, Kenitra, Morocco: SCITEPRESS - Science and Technology Publications, 2021, pp. 211–215. [CrossRef]
- P. A. D. Amiri and S. Pierre, “An Ensemble-Based Machine Learning Model for Forecasting Network Traffic in VANET,” IEEE Access, vol. 11, pp. 22855–22870, 2023. [CrossRef]
- K. Yang, C. A. Haddad, G. Yannis, and C. Antoniou, “Classification and Evaluation of Driving Behavior Safety Levels: A Driving Simulation Study,” IEEE Open J. Intell. Transp. Syst., vol. 3, pp. 111–125, 2022. [CrossRef]
- X. Zhang and X. Yan, “Predicting collision cases at unsignalized intersections using EEG metrics and driving simulator platform,” Accid. Anal. Prev., vol. 180, p. 106910, Feb. 2023. [CrossRef]
- W. Xiao, X. Luo, and S. Xie, “Feature semantic space-based sim2real decision model,” Appl. Intell., Jun. 2022. [CrossRef]
- M. J. Crowder, A. C. Kimber, R. L. Smith, and T. J. Sweeting, Statistical Analysis of Reliability Data, 1st ed. Routledge, 2017. [CrossRef]
- F. A. Shakil, S. M. Hossain, R. Hossain, and S. Momen, “Prediction of Road Accidents Using Data Mining Techniques,” in Proceedings of International Conference on Computational Intelligence and Emerging Power System, R. C. Bansal, A. Zemmari, K. G. Sharma, and J. Gajrani, Eds., in Algorithms for Intelligent Systems, Singapore: Springer Singapore, 2022, pp. 25–35. [CrossRef]
- B. Remeseiro and V. Bolon-Canedo, “A review of feature selection methods in medical applications,” Comput. Biol. Med., vol. 112, p. 103375, Sep. 2019. [CrossRef]
- Y. Cao, G. Liu, J. Sun, D. P. Bavirisetti, and G. Xiao, “PSO-Stacking improved ensemble model for campus building energy consumption forecasting based on priority feature selection,” J. Build. Eng., vol. 72, p. 106589, Aug. 2023. [CrossRef]
- Zhang, E. W. Patton, J. M. Swaney, and T. H. Zeng, “A Statistical Analysis of Recent Traffic Crashes in Massachusetts,” 2019. [CrossRef]
- M Ascensión, O. Ibáñez-Solé, I. Inza, A. Izeta, and M. J. Araúzo-Bravo, “Triku: a feature selection method based on nearest neighbors for single-cell data,” GigaScience, vol. 11, p. giac017, Mar. 2022. [CrossRef]
- M. Mittal, S. Gupta, S. Chauhan, and L. K. Saraswat, “Analysis on road crash severity of drivers using machine learning techniques,” Int. J. Eng. Syst. Model. Simul., vol. 13, no. 2, p. 154, 2022. [CrossRef]
- Seraj et al., “Cross-validation,” in Handbook of Hydroinformatics, Elsevier, 2023, pp. 89–105. [CrossRef]
- D. Santos, J. Saias, P. Quaresma, and V. B. Nogueira, “Machine Learning Approaches to Traffic Accident Analysis and Hotspot Prediction,” Computers, vol. 10, no. 12, p. 157, Nov. 2021. [CrossRef]
- J. Xiao, “SVM and KNN ensemble learning for traffic incident detection,” Phys. Stat. Mech. Its Appl., vol. 517, pp. 29–35, Mar. 2019. [CrossRef]
- L. Liu and M. T. Özsu, Eds., “k-Nearest Neighbor Classification,” in Encyclopedia of Database Systems, Boston, MA: Springer US, 2009, pp. 1590–1590. [CrossRef]
- P. Abdullah and T. Sipos, “Drivers’ Behavior and Traffic Accident Analysis Using Decision Tree Method,” Sustainability, vol. 14, no. 18, p. 11339, Sep. 2022. [CrossRef]
- Y. Lu, T. Ye, and J. Zheng, “Decision Tree Algorithm in Machine Learning,” in 2022 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA), Dalian, China: IEEE, Aug. 2022, pp. 1014–1017. [CrossRef]
- Wang, Y. Wang, and X. Zhang, “A Study of Fatigue Driving Detection System Based on AdaBoost Algorithm,” in 2022 4th International Conference on Artificial Intelligence and Advanced Manufacturing (AIAM), Hamburg, Germany: IEEE, Oct. 2022, pp. 32–35. [CrossRef]
- H. Zhao, H. Yu, D. Li, T. Mao, and H. Zhu, “Vehicle Accident Risk Prediction Based on AdaBoost-SO in VANETs,” IEEE Access, vol. 7, pp. 14549–14557, 2019. [CrossRef]
- L. Yang and Q. Zhao, “An aggressive driving state recognition model using EEG based on stacking ensemble learning,” J. Transp. Saf. Secur., pp. 1–22, May 2023. [CrossRef]
- J. Tang, J. Liang, C. Han, Z. Li, and H. Huang, “Crash injury severity analysis using a two-layer Stacking framework,” Accid. Anal. Prev., vol. 122, pp. 226–238, Jan. 2019. [CrossRef]
- P. Wu, X. Meng, and L. Song, “A novel ensemble learning method for crash prediction using road geometric alignments and traffic data,” J. Transp. Saf. Secur., vol. 12, no. 9, pp. 1128–1146, Oct. 2020. [CrossRef]
- Ishaq et al., “Improving the Prediction of Heart Failure Patients’ Survival Using SMOTE and Effective Data Mining Techniques,” IEEE Access, vol. 9, pp. 39707–39716, 2021. [CrossRef]
- Z. Jiang, J. Yang, and Y. Liu, “Imbalanced Learning with Oversampling based on Classification Contribution Degree,” Adv. Theory Simul., vol. 4, no. 5, p. 2100031, May 2021. [CrossRef]
- N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, Jun. 2002. [CrossRef]
- Lee and K. Kim, “An efficient method to determine sample size in oversampling based on classification complexity for imbalanced data,” Expert Syst. Appl., vol. 184, p. 115442, Dec. 2021. [CrossRef]
- Z. Elamrani Abou Elassad, H. Mousannif, and H. Al Moatassime, “Class-imbalanced crash prediction based on real-time traffic and weather data: A driving simulator study,” Traffic Inj. Prev., vol. 21, no. 3, pp. 201–208, Apr. 2020. [CrossRef]
- F. Sağlam and M. A. Cengiz, “A novel SMOTE-based resampling technique trough noise detection and the boosting procedure,” Expert Syst. Appl., vol. 200, p. 117023, Aug. 2022. [CrossRef]
- Theissler, M. Thomas, M. Burch, and F. Gerschner, “ConfusionVis: Comparative evaluation and selection of multi-class classifiers based on confusion matrices,” Knowl.-Based Syst., vol. 247, p. 108651, Jul. 2022. [CrossRef]
- M. Mokoatle, Dr. Vukosi Marivate, and Professor. Michael Esiefarienrhe Bukohwo, “Predicting Road Traffic Accident Severity using Accident Report Data in South Africa,” in Proceedings of the 20th Annual International Conference on Digital Government Research, Dubai United Arab Emirates: ACM, Jun. 2019, pp. 11–17. [CrossRef]
- U. Mansoor, N. T. Ratrout, S. M. Rahman, and K. Assi, “Crash Severity Prediction Using Two-Layer Ensemble Machine Learning Model for Proactive Emergency Management,” IEEE Access, vol. 8, pp. 210750–210762, 2020. [CrossRef]
- Aldhari, M. Almoshaogeh, A. Jamal, F. Alharbi, M. Alinizzi, and H. Haider, “Severity Prediction of Highway Crashes in Saudi Arabia Using Machine Learning Techniques,” Appl. Sci., vol. 13, no. 1, p. 233, Dec. 2022. [CrossRef]
- L. Yang et al., “Comparative Analysis of the Optimized KNN, SVM, and Ensemble DT Models Using Bayesian Optimization for Predicting Pedestrian Fatalities: An Advance towards Realizing the Sustainable Safety of Pedestrians,” Sustainability, vol. 14, no. 17, p. 10467, Aug. 2022. [CrossRef]
- T. Luo, J. Wang, T. Fu, Q. Shangguan, and S. Fang, “Risk prediction for cut-ins using multi-driver simulation data and machine learning algorithms: A comparison among decision tree, GBDT and LSTM,” Int. J. Transp. Sci. Technol., vol. 12, no. 3, pp. 862–877, Sep. 2023. [CrossRef]









| Univariate feature Selection | Recursive Elimination method | Feature importance | Particle swarm optimization (PSO) |
|---|---|---|---|
| Lane gap | Lane gap | Lane gap | Lane gap |
| Speed | Speed | Speed | Speed |
| Brake | Brake | Brake | Brake |
| Education level | Education level | Education level | Driver Experience |
| Driver Experience | Driver Experience | Driver Experience | Surface condition |
| Drivers’ Age | Drivers’ Age | Drivers’ Age | Gender |
| Total instances | Predicted | ||
|---|---|---|---|
| Negative Positive | |||
| Actual | Negative Positive |
True Negative (TN) | False Positive (FP) |
| False Negative (TN | True Positive (TP) | ||
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| AdaBoost | 0.79±0.11 | 0.76±0.13 | 0.71±0.12 | 0.72±0.14 |
| k-NN | 0.79±0.08 | 0.81±0.41 | 0.66±0.19 | 0.68±0.25 |
| DT | 0.85±0.12 | 0.87±0.27 | 0.77±0.22 | 0.80±0.19 |
| NB | 0.83±0.05 | 0.82±0.20 | 0.76±0.18 | 0.78±0.10 |
| Two-Layer Ensemble | 0.83±0.06 | 0.91±0.25 | 0.75±0.19 | 0.79±0.11 |
| Model | Accuracy | Precision | Recall | F1Score |
|---|---|---|---|---|
| AdaBoost | 0.79±0.09 | 0.75±0.12 | 0.73±0.13 | 0.74±0.08 |
| k-NN | 0.72±0.13 | 0.66±0.12 | 0.64±0.08 | 0.65±0.06 |
| DT | 0.77±0.08 | 0.69±0.90 | 0.68±0.08 | 0.68±0.10 |
| NB | 0.81±0.06 | 0.72±0.10 | 0.73±0.12 | 0.73±0.07 |
| Two-Layer Ensemble | 0.85±0.08 | 0.86±0.09 | 0.82±0.09 | 0.83±0.08 |
| Model | Precision | Recall | F1 Score |
|---|---|---|---|
| k-NN | 0.79 | 0.97 | 0.87 |
| Decision Trees | 0.83 | 0.86 | 0.85 |
| AdaBoost | 0.83 | 0.86 | 0.85 |
| Naïve Bayes | 0.85 | 0.94 | 0.90 |
| Two-Layer Ensemble | 0.87 | 0.97 | 0.92 |
| Model | Precision | Recall | F1 Score |
|---|---|---|---|
| k-NN | 0.80 | 0.31 | 0.44 |
| Decision Trees | 0.58 | 0.54 | 0.56 |
| AdaBoost | 0.58 | 0.54 | 0.56 |
| Naïve Bayes | 0.80 | 0.62 | 0.70 |
| Two-Layer Ensemble | 0.89 | 0.62 | 0.73 |
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| k-NN | 0.65±0.09 | 0.56±0.12 | 0.56±0.10 | 0.56±0.09 |
| Decision Trees | 0.81±0.74 | 0.83±0.12 | 0.70±0.78 | 0.73±0.72 |
| AdaBoost | 0.79±0.08 | 0.76±0.10 | 0.71±0.10 | 0.72±0.09 |
| Naïve Bayes | 0.81±0.10 | 0.77±0.12 | 0.76±0.11 | 0.77±0.09 |
| Two-Layer Ensemble | 0.88±0.08 | 0.86±0.09 | 0.83±0.11 | 0.84±0.79 |
| Work | Dataset source | Method | Precision | Recall | F1-score | Accuracy |
| Aldhari et al. [44] | Collected | Ensemble XGBoost RF LR |
94% 91% 65% |
94% 90% 65% |
94% 90% 65% |
94% 90% 65% |
| yang et al. [45] | Australia road deaths database (ARDD) | Ensemble SVM k-NN DT |
88% 87% 88% |
|||
| Luo et al. [46] | Driving Simulator | Classification DT Gradient boosting decision tree (GBDT) Long-short term memory (LSTM) |
77% 80% 87% |
|||
| Mansoor et al. [43] | Canadian Dataset | Ensemble k-NN DT AdaBoost FNN SVM Two-Layer Ensemble |
62% 68% 72% 70% 72% 73% |
70% 70% 72% 70% 69% 77% |
66% 69% 72% 70% 71% 75% |
67% 69% 71% 69% 68% 76% |
| Proposed | Driving Simulator |
Ensemble k-NN DT AdaBoost NB Two-Layer Ensemble |
56% 83% 76% 83% 86% |
56% 70% 71% 76% 83% |
56% 73% 72% 77% 84% |
65% 81% 79% 81% 88% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).