In this section, we outline the methodology used to model the prediction of criminal activity in Montreal and present it as a multi-class classification challenge. Criminal activity is divided into six different categories: Theft from/in Motor Vehicle, Break and Enter, Mischief, Motor Vehicle Theft, Robbery, and Offenses Causing Death.
Our research is based on a comprehensive evaluation of various classification algorithms, including XGBoost, DT and RF. The goal of this evaluation is to determine which classification algorithm produces the most accurate results (in terms of precision and F1-score) when applied to a dataset of criminal activity in Montreal.
Therefore, in the next part of this section, we will detail the steps to develop the model to predict criminal activity in Montreal. These steps include several key phases: data preprocessing, feature selection, exploratory data analysis, the development of the predictive model, and the validation and evaluation of the model.
4.1. Data Preprocessing
This subsection refers to the critical phase of data preprocessing before analysis. It includes several steps such as extracting new features from the Date column, removing redundant data, grouping variables into numeric and categorical, dealing with missing data, encoding categorical variables and dealing with imbalanced datasets.
4.1.1. Temporal Extraction: Weekday, Day, Month, Year
Our dataset contains a
Date column in the format
YYYY-MM-DD. We use the
to_datetime function from the Python Pandas library [
17] to convert this column into a format that our prediction model can interpret. This conversion facilitates the extraction of additional temporal features, including day of the week, day, month, and year. Such a transformation improves the usability of the Date column and allows the prediction model to better understand and use temporal information.
4.1.2. Redundant Data Removal
In this phase, we perform a transformation aimed at removing redundant data from our dataset. After the additional columns Weekday, Day, Month, and Year are successfully extracted from the Date column, retaining this column is no longer necessary. Therefore, the Date column is identified as redundant and is subsequently removed from our dataset using the drop method provided by the Python Pandas library.
4.1.3. Categorizing Variables: Numerical and Categorical
In this phase, we categorized attributes according to their type – numerical or categorical. Specifically, attributes labeled as object type were classified as categorical variables. Conversely, attributes corresponding to the int64 and float64 data types were identified as numeric variables. To facilitate this classification, we used the select_dtypes function from the Python Pandas library.
4.1.4. Handling Missing Values
In this phase, we address the issue of missing data in our dataset. For each attribute, we determine both the number and percentage of missing data.
Table 3 provides a summary of the missing data for each attribute, indicating both the amount and percentage of missing values.
Looking at
Table 3, it is clear that the dataset’s attributes are missing data between a maximum of 16.854% and a minimum of 0.0018%. Since the missing data relates to numerical attributes, we use an imputation strategy in which the missing values in each column are replaced by their respective means, as suggested in [
18,
19].
4.1.5. Categorical Features Encoding
In this phase we code categorical variables. Specifically, we convert the Crime Category and Time columns into numeric values that are interpreted by the prediction model. We use the manual label encoding technique [
20]. This technique assigns a unique numerical value to each category within the categorical variable.
Table 4 and
Table 5 show the manual coding for the Crime Category and Time columns.
4.1.6. Dealing with Data Imbalance Issue
In this phase, we address the data imbalance problem, particularly with regard to the target variable
Crime Category. Data imbalance can significantly impact the performance of a predictive model by leading to a bias toward more majority classes. This issue is particularly highlighted in the
Offenses Causing Death category within the
Crime Category attribute. Although this category is crucial because it refers to crimes that result in loss of life, it is underrepresented compared to other crime categories. The underrepresentation can cause the model to treat the
Offenses Causing Death category as an outlier and bias the predictions toward the more predominant categories. To address this imbalance, we evaluated four balancing techniques: the Synthetic Minority Oversampling Technique (SMOTE) [
21], SMOTE combined with Tomek Links (SMOTE-Tomek) [
22], SMOTE combined with Edited Nearest Neighbours (SMOTE-ENN) [
23], and the Adaptive Synthetic Sampling Approach (ADASYN) [
24]. These techniques were evaluated using the Random Forest classifier to determine which method best balances performance and data representation. The results of this analysis are summarized in
Table 6, which shows the accuracy of each algorithm as measured using the RF classifier.
It is important to emphasize that these studies [
25,
26,
27,
28] support the selection of these balancing algorithms and show how well they work to solve classification problems with data imbalance.
The data in
Table 6 clearly shows that the SMOTE-ENN algorithm outperforms the others in terms of accuracy. Therefore, we decided to implement this algorithm when developing our prediction model.
4.3. Exploratory Data Analysis
In this subsection, we present visual analysis charts that examine the temporal dynamics of the crime category. The following graphics are discussed:
Distribution of crime categories over different times of the day.
Weekly distribution of crime categories.
Monthly distribution of crime categories.
Yearly distribution of crime categories.
Heatmap of crime numbers by time of day and day of the week.
Figure 3 shows the distribution of crime in Montreal in six categories during the day, evening and night. The graph shows that the crime rate is higher during the day than in the evening and at night. This pattern can be explained by the increased likelihood of crimes such as
Theft from/in Motor Vehicle and
Break and Enter during daylight hours, when more vehicles are parked and unattended and residential properties may be vacant. Notably, the number of motor vehicle thefts remains relatively constant across all time periods, suggesting that the likelihood of this crime is not significantly affected by the time of day.
Understanding these temporal patterns is critical to public health safety. The increased frequency of the Mischief and Break and Enter categories throughout the day could help law enforcement and public safety campaigns effectively allocate resources and inform the public about precautions during these times. Meanwhile, the ongoing number of motor vehicle thefts suggests that continued vigilance is warranted and the introduction of improved security technologies or community surveillance programs may be necessary.
Figure 4 illustrates the distribution of different crime categories in Montreal by day of the week. It is seen that there is a higher frequency of criminal activities on Monday compared to other days. This trend gradually decreases throughout the week, with Saturday and Sunday having the lowest crime rates. Crime categories, including motor vehicle theft, burglaries and vandalism, remained relatively stable throughout the week. However, violations that can lead to death occur, especially on weekends. Monday’s increase in criminal activity could be due to opportunities created by the increased movement of people in Montreal earlier in the week.
Given the information provided in
Figure 4, it is imperative that Montreal municipalities redouble their efforts to improve public safety. For example, they could increase police patrols Monday through Friday, target crimes like vehicle theft and burglary, and pay particular attention to deadly violations on weekends. Authorities are encouraged to develop awareness programs for Montrealers to inform them about criminal risks and promote preventive measures.
Figure 5 shows the monthly distribution of different crime categories in Montreal. There is a clear seasonal trend, with crime increasing in the warmer months from June to September. This observation suggests a connection between the summer season and the escalation of criminal activities. Among crime categories, motor vehicle theft and mischief contribute significantly to overall crime throughout the year, indicating that these types of crimes are predominant in Montreal. A significant decrease in the total number of crimes is observed in January and February, possibly due to winter weather conditions that are less conducive to committing crimes. Fatal crimes represent the least common category in terms of overall incidence, but their frequency remains relatively stable over months, suggesting that there is minimal monthly variation in these events. Burglaries and break-ins account for a moderate proportion of total annual crime, reflecting a relatively consistent incidence rate for these crime categories.
The data presented in
Figure 5 is of great value for strategic planning in crime prevention and law enforcement implementation. They enable crime peaks to be anticipated for optimal resource allocation. The consistent trends observed in specific crime categories also provide an opportunity to develop targeted and seasonally tailored crime prevention strategies.
Figure 6 shows the annual distribution of crime categories in Montreal from 2015 to 2023. From 2015 to 2020, a gradual decline in the total number of crimes is observed. This trend predates the pandemic and could be due to variables unrelated to COVID-19, such as improved safety measures, evolving policing strategies, or changes in crime reporting. 2020 also saw a significant decrease in criminal activity, possibly due to movement restrictions imposed at the height of the COVID-19 health crisis. Because there are fewer opportunities to commit these crimes, these measures are likely to have had a particular impact on crimes such as burglaries and vehicle thefts.
A significant increase in crime is observed between 2021 and 2023, peaking in 2023. This increase could be explained by the easing of health restrictions, an increase in social activities, and the economic impact of the post-lockdown period, which could lead to a resurgence of certain types of crime.
The evolution of the proportions of crime categories over the years deserves particular attention. There was a significant increase in mischief and vehicle theft crimes in 2023. This phenomenon may reflect a change in crime trends in response to the societal consequences of the pandemic, including changing routines and economic constraints. Nevertheless, crimes such as robberies and crimes resulting in death did not show any significant variations compared to the other crime categories during the pandemic years. These observations must be carefully examined and supported by in-depth statistical analysis to elucidate the dynamics of crime in such a fluid context.
Figure 7 shows a heatmap illustrating the distribution of crime numbers based on the time of day and days of the week in Montreal. This visualization shows that the number of crimes committed in Montreal during the daytime, from 8:01 a.m. to 4:00 p.m., is significantly higher throughout the week, with a peak of 22,229 incidents recorded on Monday.
In the evening, from 4:01 p.m. to midnight, moderate crime can be observed, which is significantly lower than during the day. The number of crimes on Wednesdays and Thursdays tends to increase compared to other evenings of the week.
There was a decrease in crime during the night from 12:01 a.m. to 8:00 a.m., with the lowest number of crimes recorded on Saturday with a total of 5,422 incidents recorded.
The data highlighted in
Figure 7 provides Montreal city authorities with valuable information for planning and optimal resource allocation to prevent and combat crime.
4.4. Development of the Predictive Model
This subsection discusses the approach used in developing a predictive model to estimate crime categories in Montreal. The construction of our model leverages the capabilities of three different machine learning algorithms: XGBoost, DT, and RF, which are selected based on their proven success on similar classification tasks, as documented in the literature references [
9,
13,
19,
32]. Our approach included several key steps. First, we prepared our dataset by splitting it into features (
) and target labels (
), where the
crime_category column was used as the target. The dataset was then divided into training and testing sets, with 20% of the data reserved for testing and a random seed used to ensure reproducibility. Both the training and test sets were standardized while retaining the original feature names.
Next, we initialized the XGBoost classifier, the DT classifier, and the RF classifier. The XGBoost classifier was configured with 1000 estimators, a maximum depth of 20, a learning rate of 0.1, and mlogloss as the evaluation metric. The RF classifier was initialized with 800 estimators and a maximum depth of 20. Specific configurations were set for each classifier to optimize their performance.
We then trained each classifier on the scaled training set and used them to predict crime categories on the scaled testing set. For each prediction set, we calculated the weighted F1 score, recorded the classifier name and the corresponding F1 score, and generated classification reports.
To determine the best-performing model, we conducted a comparative analysis by displaying a table of F1 scores for all classifiers. The model with the highest F1 score was identified as the best. Finally, we serialized the best-performing model by saving it to a file, ensuring it could be used for future predictions.
In the following part of this section, we provide a detailed description of the machine learning algorithms used in our research.
4.4.1. eXtreme Gradient Boosting (XGBoost)
XGBoost [
33] uses an ensemble of decision trees and methodically integrates each new model into the existing tree framework. While neural networks often outperform various algorithms in prediction tasks, decision tree-based algorithms offer a viable alternative for predicting tabular data. Furthermore, the effectiveness of gradient boosting models such as XGBoost is significantly influenced by the tuning of numerous hyperparameters, making the tuning of these parameters a critical aspect of their application.
4.4.2. Decision Tree (DT)
Decision trees (DTs) are a core machine learning technique that is widely used in both regression and classification scenarios. This approach uses a tree-shaped framework to map decisions and their potential impacts, incorporating random event outcomes, resource costs, and overall value. Structurally similar to a flowchart, DTs have internal nodes that perform “tests” on specific attributes, branches that represent the test results, and leaf nodes that denote either a class label (for classification tasks) or a continuous number (for regression tasks). A key strength of DTs is their straightforwardness and easy-to-understand nature; The journey from the root of the tree to each leaf directly describes classification or regression guidelines that are closely linked to the input variables. This clarity not only makes the model transparent but also simplifies the interpretation of how inputs affect outputs and highlights the importance of different features. These aspects are particularly valuable for applications that require an explicit explanation of the decision path, positioning DTs as a preferred option for numerous practical applications.
4.4.3. Random Forest (RF)
The Random Forest algorithm is a robust and flexible machine learning technique that incorporates a large number of decision trees to create a “forest” through an ensemble approach. It utilizes the method of bagging or bootstrap aggregation to improve the accuracy and stability of prediction in both classification and regression tasks as highlighted by [
34]. By training individual trees on different subsets of the data and then combining their results, Random Forest effectively minimizes the likelihood of overfitting, making it a reliable method for tackling complex, data-intensive problems. Its efficiency in managing large, feature-rich datasets as well as its built-in feature selection mechanism make it a critical component in the toolkit of modern data scientists and analysts. Additionally, Random Forest is praised for its ability to deliver highly accurate models without compromising on explainability, highlighting its importance in both academic research and real-world applications.
4.4.4. Model Evaluation
Evaluation of the performance of each algorithm was done by analyzing the data encapsulated in the confusion matrix (CM). This matrix captures the actual and predicted categorizations determined by the classification mechanism. The components of the CM are defined as follows:
True Negative (TN) means cases that were accurately classified as negative.
False Negative (FN) are cases that are incorrectly classified as negative even though they are positive.
True Positive (TP) represents cases that were accurately classified as positive.
False Positive (FP) are cases that were incorrectly classified as positive when they are actually negative.
For each algorithm, we used the data derived from the confusion matrix to calculate specific performance indicators. These indicators were then used to measure the effectiveness of the models:
Accuracy: Defined as the proportion of correctly predicted instances out of the total number of instances. The accuracy is calculated using Equation
2.
Recall (Sensitivity): Identifies the proportion of correctly predicted positive instances over all instances in the true class. The formula for recall is calculated using Equation
3.
Precision (P): Describes the proportion of correctly predicted positive instances out of all instances predicted as positive. The precision formula is calculated using Equation
4.
F1 Score: This metric is the weighted average of precision and recall and therefore takes both false positives and false negatives into account in its calculation. It serves as an indicator of a classifier’s balanced performance between recall and precision. The F1 score is calculated using Equation
5.