Submitted:
17 June 2024
Posted:
19 June 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Development of a predictive model: We present an innovative machine learning-based model that uses the XGBoost algorithm to predict crime patterns in Montreal. Our study analyzes data on crimes committed in Montreal from 2015 to 2023 and shows that our model based on the XGBoost algorithm outperforms other methods such as DT and RF.
- Classifier performance evaluation: An in-depth evaluation of the various machine learning algorithms was carried out, highlighting the effectiveness of the XGBoost algorithm compared to its competitors (DT and RF) in terms of precision, accuracy, recall and F1-score.
- Real-time prediction web application: Providing an innovative web application that integrates the XGBoost prediction model developed with Python Flask and Swagger UI to provide users with an interactive platform for entering data and receiving predictions on crime categories.
- Applying supervised learning to a new crime analysis data set: For the first time in Montreal, supervised learning is being applied to crime data to develop a tool that will help law enforcement agencies optimize the management of their resources and fight crime more effectively.
2. Related Work
3. Data Overview
3.1. Data Collection
3.2. Data Description
4. Methodology
4.1. Data Preprocessing
4.1.1. Temporal Extraction: Weekday, Day, Month, Year
4.1.2. Redundant Data Removal
4.1.3. Categorizing Variables: Numerical and Categorical
4.1.4. Handling Missing Values
4.1.5. Categorical Features Encoding
4.1.6. Dealing with Data Imbalance Issue
4.2. Chi-Square-Based Feature Selection
- n denotes the total number of observation categories.
- represents the observed frequency in the category.
- denotes the expected frequency in the category, assuming the null hypothesis that observed and expected frequencies are independent.
4.3. Exploratory Data Analysis
- Distribution of crime categories over different times of the day.
- Weekly distribution of crime categories.
- Monthly distribution of crime categories.
- Yearly distribution of crime categories.
- Heatmap of crime numbers by time of day and day of the week.
4.4. Development of the Predictive Model
4.4.1. eXtreme Gradient Boosting (XGBoost)
4.4.2. Decision Tree (DT)
4.4.3. Random Forest (RF)
4.4.4. Model Evaluation
- True Negative (TN) means cases that were accurately classified as negative.
- False Negative (FN) are cases that are incorrectly classified as negative even though they are positive.
- True Positive (TP) represents cases that were accurately classified as positive.
- False Positive (FP) are cases that were incorrectly classified as positive when they are actually negative.
- Accuracy: Defined as the proportion of correctly predicted instances out of the total number of instances. The accuracy is calculated using Equation 2.
- Recall (Sensitivity): Identifies the proportion of correctly predicted positive instances over all instances in the true class. The formula for recall is calculated using Equation 3.
- Precision (P): Describes the proportion of correctly predicted positive instances out of all instances predicted as positive. The precision formula is calculated using Equation 4.
- F1 Score: This metric is the weighted average of precision and recall and therefore takes both false positives and false negatives into account in its calculation. It serves as an indicator of a classifier’s balanced performance between recall and precision. The F1 score is calculated using Equation 5.
5. Results and Discussion
5.1. Results
5.1.1. Interpretation of Results
6. Deployment of the Montreal Crime Category Predictive Model
7. Conclusions and Future Work
Supplementary Materials
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Abbas, S.; Shouping, L.; Sidra, F.; Sharif, A. Impact of Crime on Socio-Economic Development: A Study of Karachi. Malaysian Journal of Social Sciences and Humanities (MJSSH) 2018, 3, 148–159.
- Service de Police de la Ville de Montréal. Rapport d’activités 2022. https://spvm.qc.ca/upload/Rapport_activites_2022_SPVM_Final.pdf, 2022. Accessed: March 5, 2024.
- Dakalbab, F.; Talib, M.A.; Waraga, O.A.; Nassif, A.B.; Abbas, S.; Nasir, Q. Artificial intelligence & crime prediction: A systematic literature review. Social Sciences & Humanities Open 2022, 6, 100342.
- Jenga, K.; Catal, C.; Kar, G. Machine learning in crime prediction. Journal of Ambient Intelligence and Humanized Computing 2023, 14, 2887–2913. [CrossRef]
- Zaidi, N.A.S.; Mustapha, A.; Mostafa, S.A.; Razali, M.N. A classification approach for crime prediction. Applied Computing to Support Industry: Innovation and Technology: First International Conference, ACRIT 2019, Ramadi, Iraq, September 15–16, 2019, Revised Selected Papers 1. Springer, 2020, pp. 68–78.
- Aldossari, B.S.; Alqahtani, F.M.; Alshahrani, N.S.; Alhammam, M.M.; Alzamanan, R.M.; Aslam, N.; Irfanullah. A comparative study of decision tree and naive bayes machine learning model for crime category prediction in Chicago. Proceedings of 2020 the 6th international conference on computing and data engineering, 2020, pp. 34–38.
- Kshatri, S.S.; Singh, D.; Narain, B.; Bhatia, S.; Quasim, M.T.; Sinha, G.R. An empirical analysis of machine learning algorithms for crime prediction using stacked generalization: an ensemble approach. Ieee Access 2021, 9, 67488–67500. [CrossRef]
- AlAbdouli, S.K.; Alomosh, A.F.; Nassif, A.B.; Nasir, Q. Comparison of Machine Learning Algorithms for Crime Prediction in Dubai. International Journal of Advanced Computer Science and Applications 2023, 14. [CrossRef]
- Tamir, A.; Watson, E.; Willett, B.; Hasan, Q.; Yuan, J.S. Crime Prediction and Forecasting using Machine Learning Algorithms. International Journal of Computer Science and Information Technologies 2021, 12, 26–33.
- ARSLAN, R.S.; DÜLGEROĞLU, B. Crime Classification using Categorical Feature Engineering and Machine Learning. database 2023.
- Hossain, S.; Abtahee, A.; Kashem, I.; Hoque, M.M.; Sarker, I.H. Crime prediction using spatio-temporal data. Computing Science, Communication and Security: First International Conference, COMS2 2020, Gujarat, India, March 26–27, 2020, Revised Selected Papers 1. Springer, 2020, pp. 277–289.
- Mahmud, S.; Nuha, M.; Sattar, A. Crime rate prediction using machine learning and data mining. Soft Computing Techniques and Applications: Proceeding of the International Conference on Computing and Communication (IC3 2020). Springer, 2021, pp. 59–69.
- Safat, W.; Asghar, S.; Gillani, S.A. Empirical analysis for crime prediction and forecasting using machine learning and deep learning techniques. IEEE access 2021, 9, 70080–70094. [CrossRef]
- Khan, M.; Ali, A.; Alharbi, Y.; others. Predicting and preventing crime: a crime prediction model using san francisco crime data by classification techniques. Complexity 2022, 2022. [CrossRef]
- VILLE DE MONTRÉAL. Actes criminels, [Jeu de données]. Dans Données Québec, 2016. Mis à jour le 09 Mars 2024. [Online; accessed 19 December 2023].
- Licenses, Creative Commons. Attribution 4.0 International (CC BY 4.0). Creative Commons License, 2013. [Website accessed: 2023-12-20].
- McKinney, W.; others. pandas: a foundational Python library for data analysis and statistics. Python for high performance and scientific computing 2011, 14, 1–9.
- Emmanuel, T.; Maupong, T.; Mpoeleng, D.; Semong, T.; Mphago, B.; Tabona, O. A survey on missing data in machine learning. Journal of Big Data 2021, 8, 1–37.
- Muktar, B.; Fono, V. Towards Safer Roads: Predicting the Severity of Traffic Accident in Montreal using Machine Learning 2024.
- Dahouda, M.K.; Joe, I. A deep-learned embedding technique for categorical features encoding. IEEE Access 2021, 9, 114381–114391. [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 2002, 16, 321–357. [CrossRef]
- Swana, E.F.; Doorsamy, W.; Bokoro, P. Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset. Sensors 2022, 22, 3246. [CrossRef]
- Muntasir Nishat, M.; Faisal, F.; Jahan Ratul, I.; Al-Monsur, A.; Ar-Rafi, A.M.; Nasrullah, S.M.; Reza, M.T.; Khan, M.R.H. A comprehensive investigation of the performances of different machine learning classifiers with SMOTE-ENN oversampling technique and hyperparameter optimization for imbalanced heart failure dataset. Scientific Programming 2022, 2022, 1–17. [CrossRef]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). Ieee, 2008, pp. 1322–1328.
- Hairani, H.; Anggrawan, A.; Priyanto, D. Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link. JOIV: International Journal on Informatics Visualization 2023, 7, 258–264. [CrossRef]
- Hu, Z.; Wang, L.; Qi, L.; Li, Y.; Yang, W. A novel wireless network intrusion detection method based on adaptive synthetic sampling and an improved convolutional neural network. IEEE Access 2020, 8, 195741–195751. [CrossRef]
- Mohammed, A.J.; Hassan, M.M.; Kadir, D.H. Improving classification performance for a novel imbalanced medical dataset using SMOTE method. International Journal of Advanced Trends in Computer Science and Engineering 2020, 9, 3161–3172. [CrossRef]
- Nabil, A.; Seyam, M.; Abou-Elfetouh, A. Prediction of students’ academic performance based on courses’ grades using deep neural networks. IEEE Access 2021, 9, 140731–140746. [CrossRef]
- Ray, S.; Alshouiliy, K.; Roy, A.; AlGhamdi, A.; Agrawal, D.P. Chi-squared based feature selection for stroke prediction using AzureML. 2020 Intermountain Engineering, Technology and Computing (IETC). IEEE, 2020, pp. 1–6.
- Spencer, R.; Thabtah, F.; Abdelhamid, N.; Thompson, M. Exploring feature selection and classification methods for predicting heart disease. Digital health 2020, 6, 2055207620914777. [CrossRef]
- Thaseen, I.S.; Kumar, C.A. Intrusion detection model using fusion of chi-square feature selection and multi class SVM. Journal of King Saud University-Computer and Information Sciences 2017, 29, 462–472. [CrossRef]
- Ahishakiye, E.; Taremwa, D.; Omulo, E.O.; Niyonzima, I. Crime prediction using decision tree (J48) classification algorithm. International Journal of Computer and Information Technology 2017, 6, 188–195.
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.
- Sarveshvar, M.; Gogoi, A.; Chaubey, A.K.; Rohit, S.; Mahesh, T. Performance of different machine learning techniques for the prediction of heart diseases. 2021 international conference on forensics, analytics, big data, security (FABS). IEEE, 2021, Vol. 1, pp. 1–4.
- Mufid, M.R.; Basofi, A.; Al Rasyid, M.U.H.; Rochimansyah, I.F.; others. Design an mvc model using python for flask framework development. 2019 International Electronics Symposium (IES). IEEE, 2019, pp. 214–219.











| Study | Focus | Data Used | Models Evaluated | Key Findings |
|---|---|---|---|---|
| [5] | Crime category prediction | Crime and Communities dataset, UCI | RF, SVM | RF outperforms SVM with 99% accuracy. |
| [6] | Crime category prediction in Chicago | Chicago crime data | Decision Trees, Naive Bayes | Decision Trees outperform Naive Bayes with 91.68% accuracy. |
| [7] | Crime prediction in India | NCRB Indian criminal records | SVM, J48, SMO, Naïve Bayes, Bagging, Random Forest | Ensemble method (SBCPM) shows superior performance with 99.5% accuracy. |
| [8] | Crime prediction in Dubai | Dubai crime data | Random Forest, KNN, SVM, ANN, Naïve Bayes, Decision Tree | KNN outperforms others with 78.47% accuracy. |
| [9] | Crime prediction in urban areas | Chicago Police Department’s CLEAR system data | Random Forest, KNN, AdaBoost, Neural Networks | Neural Network shows highest accuracy at 90.77%. |
| [10] | Automated crime report analysis | San Francisco Crime dataset | Random Forest | Achieved 86.5% accuracy, precision, recall, F-score, AUC of 0.98. |
| [11] | Criminal activity prediction in San Francisco | San Francisco criminal incidents data | Decision Tree, KNN, Random Forest, AdaBoost | Random Forest with random undersampling achieved 99.16% accuracy. |
| [12] | Crime trend analysis in Bangladesh | Bangladesh crime data | KNN, Naive Bayes, Linear Regression | KNN outperforms with 76.92% accuracy in predicting safe routes. |
| [13] | Crime data prediction | Chicago and Los Angeles crime data | LSTM, ARIMA, Logistic Regression, SVM, Naive Bayes, KNN, Decision Tree, MLP, Random Forest, XGBoost | LSTM effective in time series forecasting; ARIMA predicts crime trends. |
| [14] | Crime prediction with machine learning | San Francisco crime data | Naive Bayes, Random Forest, Gradient Boosting Decision Trees | Gradient Boosting Decision Trees show superior performance with 98.5% accuracy. |
| Current Work | Crime category prediction in Montreal | Montreal crime data (2015-2023) | XGBoost, Decision Trees, Random Forest | XGBoost outperforms others with superior precision, accuracy, recall, and F1 score. Deployed as a web application. |
| Attribute | Description |
|---|---|
| Crime category | The nature of the event:
|
| Date | Date of event report to MCPS in YYYY-MM-DD format. |
| Time of Day | Time of day the event was reported to MCPS:
|
| Police District Number | Police district number for the neighborhood where the incident occurred. |
| X | Geospatial position according to the MTM8 projection (SRID 2950). |
| Y | Geospatial position according to the MTM8 projection (SRID 2950). |
| Latitude | Geographical position of the event at an intersection according to the WGS84 geodetic reference system. |
| Longitude | Geographical position of the event at an intersection according to the WGS84 geodetic reference system. |
| Attribute | Number of Missing Data | Percentage Missing (%) |
|---|---|---|
| X | 47193 | 16.854 |
| Y | 47193 | 16.854 |
| Longitude | 47193 | 16.854 |
| Latitude | 47193 | 16.854 |
| Police District Number | 5 | 0.0018 |
| Crime Category | Numeric Encoding |
|---|---|
| Theft from/in Motor Vehicle | 0 |
| Break and Enter | 1 |
| Mischief | 2 |
| Motor Vehicle Theft | 3 |
| Robbery | 4 |
| Offenses Causing Death | 5 |
| Time | Numeric Encoding |
|---|---|
| Day | 0 |
| Evening | 1 |
| Night | 2 |
| Balancing Algorithm | Accuracy |
|---|---|
| SMOTE | 0.588510 |
| SMOTE-Tomek | 0.474414 |
| SMOTE-ENN | 0.632867 |
| ADASYN | 0.902227 |
| Class | Precision | Recall | F1-score | Support | Accuracy |
|---|---|---|---|---|---|
| Results for XGBoost | |||||
| Theft from/in Motor Vehicle | 0.82 | 0.62 | 0.71 | 1026 | |
| Break and Enter | 0.81 | 0.72 | 0.76 | 1848 | |
| Mischief | 0.81 | 0.65 | 0.72 | 1807 | |
| Motor Vehicle Theft | 0.84 | 0.79 | 0.81 | 2601 | |
| Robbery | 0.88 | 0.96 | 0.91 | 8894 | |
| Offenses Causing Death | 0.99 | 1.00 | 0.99 | 13669 | |
| Weighted Avg | 0.91 | 0.92 | 0.91 | 29845 | 0.92 |
| Results for Decision Tree | |||||
| Theft from/in Motor Vehicle | 0.53 | 0.48 | 0.51 | 1026 | |
| Break and Enter | 0.61 | 0.55 | 0.58 | 1848 | |
| Mischief | 0.57 | 0.55 | 0.56 | 1807 | |
| Motor Vehicle Theft | 0.70 | 0.68 | 0.69 | 2601 | |
| Robbery | 0.84 | 0.86 | 0.85 | 8894 | |
| Offenses Causing Death | 0.98 | 0.99 | 0.99 | 13669 | |
| Weighted Avg | 0.85 | 0.86 | 0.85 | 29845 | 0.86 |
| Results for RF | |||||
| Theft from/in Motor Vehicle | 0.82 | 0.30 | 0.44 | 1026 | |
| Break and Enter | 0.79 | 0.32 | 0.46 | 1848 | |
| Mischief | 0.88 | 0.37 | 0.52 | 1807 | |
| Motor Vehicle Theft | 0.81 | 0.63 | 0.71 | 2601 | |
| Robbery | 0.72 | 0.92 | 0.81 | 8894 | |
| Offenses Causing Death | 0.93 | 1.00 | 0.96 | 13669 | |
| Weighted Avg | 0.84 | 0.84 | 0.82 | 29845 | 0.84 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).