RoadLytics: Road Accidents Analytics Using Artificial Intelligence to Support Deaths’ Prevention on Highways

Daily thousands of people and goods move along Brazilian Federal highways. Traffic accidents are numerous on these highways and have a significant impact, whether on the economy or the health system. Identifying predictor variables, the probability of an event occurring and how to mitigate them are of paramount importance for the actions of the transit authorities that manage these roads. The main contribution of this study is the development of a predictive machine learning model which uses open data to shows graphically the critical points in the highways. This model is fully reproducible and can be applied to any region worldwide helping to minimize the number of accidents and to prevent deaths by automotive collisions. For this study, 43 variables were analyzed supporting the identification of the causes of accidents with fatal victims on the main highways in the south of Brazil. RoadLytics is proposed as a supervised machine learning model, using the Random Forest algorithm to analyze about 33 thousand occurrences between 2017 and 2020. An exploratory analysis of the data was carried out to support the modeling and to facilitate data visualization. In this sense, heat maps were developed to support the analysis and identification of potential risk areas. The results show that BR386 highway registers the highest number of fatal occurrences, regardless of the season. Additionally, concerning the weather conditions, the analysis shows that 52% of accidents occurred in favorable conditions, such as clear skies, victimizing 501 people. The driver’s lack of attention is the main reason for the accidents’ occurrences. Applying the developed model, an accuracy of 77% was achieved for the classification of fatal accidents.


Introduction
According with World Health Organization (WHO) [1], around of 1.35 million people died every year around the world as a result of road traffic crashes. These accidents penalize the countries with more than 3% of their Gross Domestic Product (GDP). Besides, road traffic injuries are the leading cause of death for children and young adults aged 5-29 years. Around the world the concern about traffic accidents has become increasingly greater as this phenomenon has emerged as a public health problem [2].
The problem related to road accidents in Brazil is a direct cause of the high costs demanded by the Unified Health System in Brazil, since it corresponds to a significant portion of care and prolonged treatments. Traffic accidents are also the cause of other costs with direct and indirect social and economic impacts [3].
Inada et al. [4] describe that even during the COVID-19 pandemic, Japan identified an increase in the number of deaths caused by traffic accidents. Empty roads possibly triggered speed-related traffic violations that caused fatal Motor Vehicle Collisions (MVCs). In the same way, in São Paulo, Brazil, this increase may be related to issues such as the low number of available professionals, medications, oxygen and Intensive Care Units (ICU) Sul (RS), and to analyze the variables that may contribute to the occurrence of this type of event. This state was selected for the study due to the possibility of understanding patterns in the traffic of the region that is a major producer of soy [29] and rice [30] in the Brazilian territory and both products are exported to other countries, adding more complexity to the case study scenario.
This article is organized in the following way. Section 2 presents the related works. Section 3 describes the applied methodology, while Section 4 shows the exploratory analysis. Section 5 details the construction of the predictive algorithm and training applied. Finally, section 6 concludes the research with final considerations and future directions.

Related Works
This section presents related works selected from the search of terms "big data", "data mining", "traffic accidents" and "machine learning", in the MDPI 1 , IEEE Xplore 2 , Research Gate 3 and Scielo 4 repositories. At the end of this section, a summary of the comparison between the works is presented.
Barroso Junior et al. [31] analyzed the lethality of traffic accidents on Brazilian Federal highways in 2016, considering, in addition to the characteristics of the victims, information about the context in which these events occurred. During the development of the work, a binomial logistic regression model was used for data analysis, using the R language to adjust the model, which was subsequently subjected to adjustment through the Hosmer and Lemeshow test mechanism. Finally, the study found that, on average, the likelihood of an accident being lethal increases by 44% for males, pedestrians, locations in the northeast region, on Sundays and in rural areas.
Chang and Park [32] lead a study to identify possible fatigued drivers to find patterns related with crash accidents. The study considered GPS informations about the vehicles through a method based on the distribution of the driving duration and the boundary condition of the driving duration between fatigued and non-fatigued state. As a result, the authors identified that the fatigue data measured was a strong explanatory power with regard to the traffic accident rate, with a statistical correlation of 0.86 at least.
von Buxhoeveden and Becker [33] evaluated data on accidents involving trains, buses, ships, cars and planes, using different countries as sources, such as Switzerland, where they collected public transport data, Germany, where car accidents were collected and the United States in which air accidents were analyzed. The solution used during the development of the work was R, which served as a platform for the development of a web application for data exploration through graphics and heat maps as a final result, which can be used to cross-check information on public transport accidents and private, and may present points of greater occurrences and risks of future accidents.
Lamr [34] analyzed the data collected by Czech police searching for patterns in the accidents and generate information for warning the government agencies. During the development of the article the APRIORI algorithm was used for analyzed the data and gathering clusters in heat maps. An exploratory analysis was executed as well.
Wang et al. [35] worked on a neural network for analyze the traffic issues related with truck vehicles using its GPS signal. The work was developed in two phases, first the expansion phase where errors and omissions were treated, then the prediction phase, where the Long Short Term Memory (LSTM) and Gated Recursive Unit (GRU) neural network methods were applied to improve the accuracy. The data used during this work was from Zhengzhou city in China. As a result, the authors conclude that the LSTM approach was a better result than GRU with the accuracy.
Zhang and Hassan [36] worked on the evaluation of accidents in the Egyptian highways during the night, where the severity of the injuries cause by the crashes was evaluating 4 of 20 using a multinominal logit model and an exploratory analysis was made. As a result from the model, the authors conclude that the rainy conditions tend to increase the fatality rate during the night crashes, the authors also identified a relationship between male drivers and accidents related to high speed. The age factor is also a important factor, where the young drivers have a high probability of begin involved.
Mokoatle and Marivate [37] used road traffic data accident in Soshanguve, South Africa, to applied an exploratory analysis and a cluster analysis to identify the main reasons for the occurences. As a result from the analysis, the authors identified that the most serious injuries are related with heavy vehicles, as trucks and buses.
Mazouri et al. [38] applied a different approach in the analysis of France road accidents, used the FP-growth algorithm combine with Spark framework to help in the identification and extraction of associations rules. This associations rules, after been identified can be helpful for the decisions makers to choose the best approach and strategy in the road administration.
The study developed by Chen [39] evaluated the public data regarding accidents in the city of Shanghai, China between the years 2015 to 2016 through the R, where he first performed the treatment of data, through the cleaning of non-useful data, data formatting and converting the accident location to latitude and longitudinal measurements facilitating the creation of heat maps to show the distribution of accidents across the city. The author developed three models to predict the probability of a certain accident category occurring, these models were validated through a dataset that was divided between training and testing. For the creation of the models, the linear discriminant analysis algorithms, random forest and decision trees were used. The discriminant linear analysis algorithm resulted the best accuracy of the model, random forest was better in the integration of different categories, and finally, decision trees obtained the best result in the kappa coefficient, which is used to measure the reliability of the model.
The last related work analysed was the article of Zůvala et al. [40]. In this article, the authors considered the data accident of Czech Republic using the Czech In-depth Accident Study (CzIDAS) and the data collected by the Police of the Czech Republic using a sample of the dataset of each source. The main objetive was to generate information about the location and patterns of drivers, which was possible using both dataset, the secondary objetive was validate both dataset, and the authors could conclude that they were comparable with each other, which can indicate an accurate collection of the information. Table 1 presents a comparative analysis between the works related to this article. Each of selected article was analyzed considering the most relevant features for this study, such as the use of heat maps to facilitate the visualization of information, execution of Exploratory Data Analysis (EDA) over the analyzed dataset, if any machine learning technique is applied, in addition to having observed the application restriction techniques in the dataset and public access to this information. These items are marked as yes, when they are present, and as no, when they are not present. Based on the analysis of the related works, the identification of topics for the research emerged, such as the use of public data, which enables to continue new research from the same source of the data, which also guarantees authenticity in the analysis. Among the main points, the restriction of fatal data is one of improvement identified, which can present more detailed information about the occurrences. In addition, the use of heat map allows to present information in a visual way, applying data balancing techniques, aiming at minimize the discrepancy in the data analyzed.

Methods and Procedures
This section details the procedures adopted for conducting the research, from data collection to the restriction for further analysis and presentation of results.

Work Delimitation
Traffic problems are part of people's daily lives and have a significant impact on the economy and public health. To conduct this study, the experiments used public accident data on RS federal highways, filtering between the years 2017 and 2020, these datasets are available by the FHP of Brazil [6]. Figure 1 shows a snippet of the dataset, which is fully available at FHP Open Data Repository 5 .  Figure 2 shows the steps performed during this work, and then each step is described in detail. The following list details the steps performed in the presented pipeline. The items show the specification of actions presented in each step:

•
Bibliographic survey: Selection of methods to be applied in the database selected for this research; • Data collection: Load of Federal Highway Police records for the period from 2017 to 2020; • Data restriction: In this step, cleaning was performed on erroneous data or that would not be used during the analysis, as described in section 3.4; • Algorithm selection and application: Evaluation and study of machine learning algorithms to perform the analysis of previously treated data; • Results presentation: Result of tests performed and knowledge discoveries obtained through graphs, tables and heat maps, in addition to opportunities and recommendations for improvement for future analyzes.

Data Collection
Specifically, the data collection considered the public data between the years 2017 and 2020, since from the year 2017 the information provided contains more attributes regarding the occurrences, the data were grouped by person, cause and type of accident. After selecting the data, the use of R language by applying a filter to generate a new mass of data containing information only from RS/Brazil. This state was chosen for the relevance that the study can provide for the region that is a major producer of soy and rice in the Brazilian territory and where the research group is located.

Restriction of Data
The treatment of raw data 6 was carried out using the R language, with the data loading from CSV files. This step enables to remove inconsistent, duplicate or data that would not be useful during the analysis.
Firstly, the process considered the concatenation of files for the years 2017, 2018 and 2020 into a single dataset in R. Altogether, the original dataset had 37 variables. After the creation of the new data structure, a filter was applied to the UF variable, to return only the records related to the state of Rio Grande do Sul, which corresponds to the scope of this work.
In the day_week field, which stores the day on which the occurrence was recorded, the values were transformed from text to a sequential integer, starting with sunday until saturday. A filter in the gender field was also used, to eliminate records where the value entered was "Not Informed" or "Ignored".
In the latitude and longitude fields, the conversion of the values that contained a comma to a point was performed, aiming at facilitating the identification of the coordinates of the occurrences.
The process considered the addition of a new column called AgeRange. This feature corresponds to the age group of the people involved in the accidents, where three intervals were defined, according to WHO (World Health Organization) definitions:

•
Value 1: It corresponds to the young age group, people between 0 and 19 years old; • Value 2: It corresponds to the adult age group, people between 20 and 64 years old; • Value 3: It corresponds to the elderly age group, people aged 65 or over.
Through the vehicle_manufacture_year field, a filter proportionates the return of the occurrence records only when the date of manufacture of the vehicle is equal to or greater than 1960. This filter was applied because it was identified that vehicles with the year of manufacture had incomplete information, such as name and model.
For the types of vehicles involved in the accidents, the addition of a new field facilitate the classifications, therefore based on type_vehicle field, the group_vehicle field was created, respecting the rules described below: • Passenger Transport: All vehicles of the automobile, van, utility, bus, minibus, scooter, motorcycle, tricycle and moped were classified as means of transport for passengers; • Cargo Transportation: All vehicles of the truck, pickup, trailer and semi-trailer type were classified as cargo transportation; • Traction: All wheeled tractor and tractor truck vehicles were classified as traction transport.
The added health_condition_seq field corresponds to a sequential according to the severity of the person involved in the accident. This field acts as a filter throughout the development of the research, respecting the following logic:

•
Value 1: When the person's condition is classified as unharmed; • Value 2: When the person's condition is classified as minor injury; • Value 3: When the person's condition is classified as a serious injury; • Value 4: When the person's condition is classified as death.
Two date fields were created, month_occurrence and year_month_occurrence, to facilitate later filtering by dates, these fields were created from the extraction of the month and year in the numeric format of the field data_inversa.
Finally, the variable season was created, so that it was possible to group the data by different seasons in order to analyze them. This field was created after analyzing the month of the accident.
Each accident may have one or more events associated with the same passenger, so to avoid duplication of information, only the first accident event was considered, which is marked as the main cause. A practical example of this scenario can be described as a person who loses control and hits the curb and dies. In the original dataset there are two records of the same accident, which may lead one to believe that two people died, when in fact it was just one. After the filter to return only the first event, we had a 44% reduction after this adjustment.
Originally the database had 101,293 records, after applying the above filters, remained a total of 43,945 records and 43 columns in the sanitized dataset.

Exploratory Analysis Results
This section presents the results obtained through the exploratory analysis of the data, regarding recorded accidents, fatal cases and the heat maps.

Accident Analysis
The graphics and observations 7 recorded in this subsection refer to the accident records in their entirety, regardless of whether it was fatal or not, which represents a total of 33,941 records. The colors of the graphs change according to the volume of occurrences, the dark blue tones, point to a greater volume of occurrences.
When grouping the data by gender, it is identified that practically three quarters of the occurrences are linked to male drivers or passengers, 73.6%.
The evaluation by days of the week enables to identify from the data collected that on weekends there is an increase in the number of records, especially on Friday and Sunday, days commonly used for the displacement of many people during trips, as shown in Figure  3. Figure 4 shows that among the main victims of accidents, the age groups between 20 and 40 years old, with the interval between 20 and 30 being where the rate is higher. This value may be related to the beginning of adulthood, where it is common to obtain a driver's license, we can also mention that it is common knowledge that in this age group the risks of accidents are greater, often caused by recklessness.    Figure 5 shows the data are presented in a similar way to the previous graph, but now listing only the cases in which the person involved in the occurrence was the driver. The behavior of the distribution of accidents remains the same, where for the age group of 20 to 40 years old there is the highest volume of occurrences.

Analysis of Fatal Occurrences
This section presents the events that resulted in fatalities. When analyzing the weather conditions related to fatal accidents, illustrated in Figure 6, which shows that more than half of the occurrences were during stable weather conditions. It should be noted that the information regarding the weather condition is described by the police officer who recorded the occurrence.  Figure 7 presents all causes linked to the occurrences, highlighting the driver's lack of attention in traffic, which caused the accident, followed by the driver's disobedience to traffic rules. It is emphasized that one more cause may be linked to the same occurrence.

Heat Maps For Analysis of Fatal Events
Fatal accidents are grouped by the season in which they occurred. Using heat maps 8 . Figure 9 shows the distribution of fatal accidents by the state for each season of the year. It turns out, the distribution is similar in the 4 seasons. For each season a table was added with the number of accidents and the percentage for the 5 highways with the highest number of occurrences.
By analyzing in detail the fatal accidents that occurred in the summer, we can see that 173 occurrences are related to accidents with passenger vehicles, while the cargo transport is linked to 39 occurrences and the traction vehicle to 12. Table 1 shows the distribution of the 5 highways that most recorded fatalities in the summer. In the fall, again there is a greater representation of passenger vehicles in fatal accidents, a total of 189 cases, against 32 in cargo vehicles and 12 in traction vehicles. Among the highways with the highest number of fatalities in the fall, there is the distribution on the Table 2. The analysis of the map of occurrences in winter shows that it is the most distributed throughout the state, with 194 occurrences with passenger vehicles, 33 with cargo vehicles, 18 with traction and one that was not registered. The distribution by the 5 main highways is presented in the Table 3.  Table 4 presents the highways with the highest record of fatalities. The heat maps of Figures 10, 11 and 12 show the distribution of fatal accidents by state, organized into 3 maps, one for each type of vehicle group. Here it is worth mentioning that a record was not categorized because it was listed as an 'uninformed' vehicle type. Figure 10 shows a distribution of occurrences throughout the RS for passenger vehicles category, which is explained by the higher number of accidents involving this category.
For cargo-type vehicles, illustrated in Figure 11, it should be noted that the northern region of the state, where clearly the number of occurrences is greater. This category includes trucks, trucks, trailers and semi-trailers.
Lastly, Figure 12 shows the distribution of fatal accidents involving vehicles catego-

RoadLytics Model
The predictive model that aims to classify death or survivor in an accident used the random forest algorithm, using the type_accident, origin_accident, type_involved, day_phase, track_layout, age, track_type gender, soil_use and variables. Other algorithms were also evaluated, such as Naíve Bayes and decision tree, however when performing the initial validations on the dataset, it was decided to use the random forest because of the algorithm's own characteristics, such as for example, decreased risk of overfitting, since the algorithm works with multiple decision trees.

Data Balancing
When analyzing the original dataset, it was identified that most of the records were linked to the class of people involved who survived the accident, where only 921 died, so when using the dataset with any algorithm in order to create a predictive model, the accuracy will always be close to 100%, which can lead to a misinterpretation of the model when applied to new datasets. Due to this problem, it was necessary to first rebalance the data, so that there were enough samples and with less discrepancy between the scenarios of people who survived and died. In order for a new dataset to be created, it was decided to select a set of survivor samples 3 times greater than that of deaths, this proportion was considered so that there would be no manipulation of the data in excess, but that it would be sufficient to perform algorithm validations, this technique is known as undersampling.
As indicated by Cartus et al. [41], undersampling involves randomly selecting examples from a majority class to delete from the dataset. This technique has the effect of reducing the number of examples from the majority class to the desired percentage of distribution. The data created from the rebalancing were divided into two new datasets, one for training and one for testing. For this division to be possible, it was decided to divide 70% of the data for the training set, and the remaining 30% for the test set. In addition, an additional validation base was created, containing 1/20 of the data in the original dataset.
The model was created based on the training dataset, using as parameters of the random forest algorithm the value of 300 for ntree, which corresponds to the number of decision trees that must be created for analysis, this value was defined after the creation and evaluation of the model with different values, the value of 3 for the parameter mtry, which corresponds to the number of parameters that will be used to perform the division of the tree. The strata parameter was configured as the variable we are analyzing, in this case, the mortos variable, and the sampsize parameter was 100 sample of survivors and 50 deaths. These two parameters combined facilitate the classification of data created from a rebalancing of the original dataset. Finally, the importance and prox parameters were set to true, when enabling these two parameters we will extract the importance of each predictor variable for the model. Figure 13 shows a snippet of the source code, which is Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 28 May 2021 doi:10.20944/preprints202105.0698.v1 fully available at GitHub repository 9 . Figure 13. Code sample of RoadLytics model

Application Results
In this section, the results achieved by applying the previously created model will be described.

Test Dataset Application
After applying the model to the test dataset, which has a total of 1658 records, the confusion matrix shown in Figure 14. As can be seen, the RoadLytics model was able to correctly classify 645 survivors and 232 fatal victims within a universe of 1,165 occurrences. The execution of the predictive model on the test dataset enables the collection of other metrics, like accuracy, which resulted in 0.77, this value represents the percentage of instances correctly classified by the model. The closer to 1, the better the model's accuracy. The specificity, which corresponds to the model's ability to identify negative instances was 0.80. Sensitivity, which has the ability to classify positive instances was 0.67. From these data, an F-Score of 0.61 is obtained, this metric corresponds to the model's assertiveness.

Validation Dataset Application
Finally, the same model was applied to the validation dataset, which was extracted from the raw data, before any rebalancing technique has been applied. Through the confusion matrix presented in Figure 15, it can be noted that the model managed to classify 1745 survivors correctly, and 38 fatal victims.

Importance of Predictor Variables
The creation of the model considered the analysis of the selected predictor variables. Figures 16 and 17 shows the gain of information through the use of each of the variables in the model. This selection of variables succeeds the test execution with different combinations. The model combined these variables to reach an Out-Of-Bag (OOB) estimate of the error rate of 24,98%, this value represents the probability measure the random forest prediction error. Mean Decrease Accuracy shows how representative a variable is for the model. As we can see, we highlight the variable type_accident, which leads us to the conclusion that without this predictive information, it would be difficult to obtain the result achieved. It is also worth highlighting the importance of the variables origin_accident and type_involved.

Figure 17. Predictor Variables -Mean Decrease Gini
When analyzing the Mean Decrease Gini values, the predictor variables that best obtain results are identified when they are used as dividing nodes by the decision trees. The variable type_accident is highlighted, followed by origin_accident and age.

Final Considerations
The present study aimed to collect, analyze and present information about traffic accidents that occurred on federal highways in the state of RS, between the years 2017 and 2020. The rationale for the development of this work was based on the need to analyze and understand the main factors that can explain the probability of accidents on the highways, especially fatal accidents. This model can be reused by related projects and can be applied to any region helping to minimize the number of accidents and to prevent deaths by automotive collisions.
This work proposed the predictive model created based on the random forest algorithm implemented by the programming language R, after collecting and pre-processing the data. The results obtained were presented in an exploratory and predictive way, making use of graphics and creation of heat maps to contribute to the visualization and understanding of accidents, as well as driver behavior, track and climatic conditions. Some variables were selected from the dataset to create the model, different combinations were used to arrive at the group of variables used to build the RoadLytics.
This study enables the analysis of days with the highest incidence of accidents, the profile of drivers, in the same way that the main causes of accidents and highways with the highest incidence can be assessed. The heat maps presented the distribution by season and type of vehicle, supporting the analysis with more technical elements. The predictive model showed satisfactory results, even with a 3 year limitation. The use of a longer period, a greater number of variables eligible to the predictive model, could contribute to obtain greater accuracy.
Studies of scenarios like this can contribute to public institutions, insurance companies and the population as a whole regarding the behavior of drivers and the main risks of accidents on federal highways in the state of RS, and can serve as a basis for public policies aimed at mitigating the risk to the population and bring benefits in different areas, such as health, logistics and safety. As future work, others machine learning algorithms can be evaluated, both supervised and unsupervised, in addition to the use of a longer period. According to the availability by the Federal Highway Police, the crossing data with the health open data available by the health ministry, called DATASUS [42] can be evaluated, providing extra information about deaths, adding new variables to perform the model training. Furthermore, the model can be deployed and made available as an App to help end-users to evaluate the highways risks including considering their profile [43], considering the data privacy of users [44]. More than that could be published as an open API to be integrated with map services helping users to identify critical points dynamically during their trips [45].
Finally, future work will explore the use of Context Histories [46][47][48] to organize the data, allowing pattern analysis [49], context prediction [50] and similarity analysis [51]. These strategies for handling context histories will improve the analysis of the data, mainly allowing the prediction and recommendation oriented to the safety of drivers on the highways.