The data-driven pattern for healthy behaviors of car drivers based on daily records of Traffic Count data

The road traffic injuries risk factors such as driving offenses and average speed are concerns for health organizations to reduce the number of injuries. Without any comprehensive view of each road, one cannot decide about the effective policy. In this manner, the data-driven policy will help to improve and assess the decisions. The count data near the road of two airports is surveyed for investigating the timevarying speed zones. The descriptive statistics, ANOVA, and functional data analysis were used. The hourly data of traffic counts for four different locations at the entrance of the two airports, international and domestics, were collected for one the year 2018 to 2019. The hourly pattern of driving offenses for each road was assessed and the to and from airport roads had different peaks (<0.05). The hour, weekdays, type of airport, direction and their interactions were statistically significant (<0.05) for the chance of driving offenses. The speed average during the day was statistically different (<0.5) by the number of different types of vehicles. The traffic count data is a great resource for decision making in safe driving subjects such as driving offenses. With functional data analysis, we can analyze them to get the most of the characteristics of this data. The airports are public places with high traffic demand in all countries that yields the different pattern of traffic transportation, therefore we extract the factors that affect the driving offenses. Finally, we conclude that conducting a time-varying speed zone near the airports seems vital.


Introduction
The road traffic deaths which reach 1.35 million each year globally, are the 8 th leading cause of death of all ages (2.5% of total deaths) and high-risk driving is one of the most important causes of death and major injuries in the developing countries. (1) The association between vehicle crashing risk and speed traveling seems obvious. (2,3) The speeding, overtaking from the wrong side, and rapid changing of driving lanes are high risk driving behaviors and they count as the driving offenses which are the most causes of road traffic crashed (RTC) in Tehran. (4) The prevalence of a few risky driving behaviors in Iran was estimated nationally with surveys as follows: 0.5 (0.4 -0.6) for drink driving and 3.5 (3.2 -3.8) for being an occupant in a car with a drunk driver. (5) The other risky behavior is the lack of attention, getting trapped in the car, listening to the music, fatigue and sleepiness, duration and distance, and negligence of seatbelt while driving. (6) On the other hand, safety driving is relating to demographic and health status (7,8). The traffic laws modifications, police controls enhancement, transport infrastructure improvement, and education for drivers were mention as policies for reducing fatal road traffic injuries in Iran. (6,9) The penalty for risky behavior such as using a cell phone while driving is protective. (10) Therefore, identification of the main risk factors for RTC such as situations, places, behaviors, etc. helps us to control, manage, and prevent it and set health policies.
Many meta-analyses showed that the maintenance of the speed cameras is effective in speed management, reducing fatal and serious injury crashes (11,12) and safe system (13). There are different speed camera programs in different countries, for example in Iran, the number of speed cameras on rural roads improves the safety of the roads. (14,15) While speed cameras record many frames at a time interval of data subjects and use big data and AI for automatic decision making (16), their privacy and GDPR is under discussion. (17) The functional data analysis is a field in statistics which deals with high dimensional data with underlying function, for example, we can consider the speed camera hourly records as a function. (18) The hourly records are high dimensional and using multivariate statistical methods does not cover the underlying functions in these datasets. Therefore, we use functional canonical correlation to study the relationships between two functions and we also use the functional regression to estimate the effects of covariates on the whole functions. (19,20) Other methods such as multi-level mixed models were used to model the traffic count data. (21) We can mention controlling the road speed, the hourly travel patterns, and decreasing traffic congestion as a data-driven policy for the sustainable smart city from analyzing the official statistics (22), computer vision (23), and speed cameras data. (24,25). The implementation of the data-driven policy has three phases include predictive and problem definition, design and experimentation, evaluation, and implementation. (26) There is some speed limitation in special areas such as school, playground zones (27,28), residential areas (29), rural (30), and urban areas (31) in many countries (32) for protecting children, cyclists, etc. from traffic collisions. In this study, we first show the temporal pattern of speeding offenses in the airport zone, both domestic and international. They are changing in the special hours. Airports are one of the places which people often want to reach or get out very quickly and therefore we see the occurrence of speeding offenses near these areas. Secondly, the average speed of different types of vehicles is studied as a behavioral factor in airport zones. Many passengers want to reach international airports in a short time, therefore, it seems a high risk driving includes high speed driving appears in the ways to the airport.

Population and Sample
According to the Iranian Airports Holding Company, Iran has 51 airports. We choose two airports in this study. The International Imam Khomeini Airport, OIIE (IKA), is located outside the capital city of Tehran and has 8,647 domestic and 8,843,585 international passengers. And it is only for international flights. The Isfahan International Airport (Esfahan Shahid Beheshti International), OIFM (IFN), is located near the Isfahan, one of the most important cities in Iran. And it has 2,258,806 domestic and 525,810 international passengers. And the major flights in this airport are domestic flights. (33)

Dataset
To analyses the pattern of traffic near the airports, the nearest traffic camera beside the highways to and from these airports is selected (fig 1). There is two traffic camera, one is near the to the airport and the other is near the from airport roads. Then we collect the hourly traffic data to reveal: first is the difference between patterns of to/from airport, second is the difference pattern between an international and domestic airport and third is the difference traffic patterns between weekdays.

Imam Khomeini International Airport
Isfahan Shahid Beheshti International Airport  Figure 1 -The information about the counter stations.
The traffic data set for Traffic Counts with its characteristics such as latitude, longitude, number of lines, and capacity is available from the Ministry of Roads and Urban Development website. (34) There are 15 columns in each dataset including the road code ( the six digits code for go/return), the road name, start time of capturing, end time of capturing, functional time of Traffic Counts ( 60 minutes in a usual capturing), Total Vehicles, total vehicles type 1 to type 5 (type 1 is cars and pick-ups, type 2 is trucks, small trucks, and minibusses, type 3 is Ordinary lorries less than 4m and three axles, type 4 is the bus, and type 5 is Trailers and freighters higher than three axles ), Average speed (calculated from all type of vehicles), number of speeding offenses (the number of the vehicle in a time interval on the road which has higher speed than speed limits) and number of failing to follow a vehicle at a safe distance (at least less than two seconds distances with the previous vehicle. These datasets are available hourly and daily for each month with precision at least between 90% to 95%.

Data Quality and Data Cleaning
The hourly dataset for each Traffic Count is captured for one year, from 2018-10-23 to 2019-10-22.
The data for 2019-07-01 to 2019-07-31 is not available. There is total 31,282 observation (15,714 for Tehran and 15,568 for Isfahan), but about 9% of the observation is excluded because the counter station is capturing less than 60 minutes in an hour and only 29,502 observation ( 15,307 for Tehran and 14,195 for Isfahan) is used in this analysis. There are also some missing values among datasets then we compute them with the stations-weekdays-hours specific mean.

Indices Defining
We created two new indices: First is total vehicles per minute by dividing the total vehicles to functional time of counter stations which its duration is 60 minutes. Therefore, the records with different functional time against 60 minutes are omitted. Second is the total driving offenses per minute by summing all two type offenses and dividing them to 60 minutes.
Then we calculate the occurrence probability of any offenses per minute, P(x), by dividing total driving offenses per minute to total vehicles. Then we transform the probability to the logit function as log(P(x)/(1-P(X))). Its interpretation is simple and the boundaries vary from negative infinity to positive infinity. Therefore, the odds ratio of the occurrence of an offense per minute is calculated for each hour in each station in each day of the week.
The analysis data structure consists of a functional offenses logit during 0-23 hour, the weekday's replication, and the counter stations.

Descriptive Statistics
We calculate the descriptive statistics including the mean and standard deviation of total vehicles and total offenses per hour for stations and directions separately.

Full Factorial Analysis of Variance
We use a three-way analysis of variance (ANOVA) to investigate the effects of City, Direct, and Days on the chance of offenses. The anova() function from R (35) is used. The adjusted stations-weekdayshours mean is used for imputation.

Functional Data Analysis
The functional data analysis (FDA) is dealing with high-dimensional data that are recorded on continuous domains such as time, space, etc. The hourly pattern of the chance of offenses is functional data. Therefore, we use the B-Spline basis function to estimate the functions. The number of basis and regularization parameter is obtained with Generalized Cross-Validation methods. (36)

Functional Canonical Correlation
We calculate the relationship between two random functions and with the sample functional canonical correlation ( = 〈 , 〉 = ∫

Function-on-Scalar Regression
The function-on-Scalar regression (FOSR) is a class of FDA methods when the response is the function and the covariates are vectors. We use fully Bayesian FOSR (39), the formula is in the supplement part 1.1.1, for high dimensional data because of computational issues, regularization and choosing the factors, and calculating the importance of each factor.

Generalized Additive Models
The Generalized Additive Models (GAM) are used for considering non-linear covariate effects with cubic regression spline. (40)

The Descriptive Statistics
The descriptive statistics including the mean and standard deviation of total vehicles and total offenses per hour, airports, and direction is presented in table 1. There are two types of driving offenses: speeding and failing to follow a vehicle at a safe distance. According to our analysis, the speeding offenses probability for an international airport in the hour/direction/weekday is higher than the domestic airport. In contrast, the probability of failing to follow a vehicle at a safe distance for the domestic airport in each hour/direction/weekday is higher than an international airport. (Supplement Part 3)

The Full Factorial ANOVA
The full-factorial-three-way ANOVA was used to assess the effects of city, directions, hours, and all combination effects with aov() function in R. The dependent variable is the average of offenses occurrence chance in an hour. (Table 2) According to this table, the main effects of City and Weekdays and the interaction effects City: Direct and Direct: Days are statistically significant (P-Value < 0.05). Therefore, the adjusted average by City, Direct, Days, and month was used for imputing missing values in the dataset. Then the adjusted average by city, direct, and days was used for imputing the remaining missing values.

The Functional Canonical Correlation
The pattern of the offenses chance per hour into the airport against the from the airport is investigated by functional canonical correlation. Therefore, we transform the data to B-Spline with 26 number of basis with order 6 by fda packages functions. The generalized cross-validation (GCV) method was used to find the smoothing parameter, lambda. Then we have six pairs for studying the relationship, we only describe the two most important of them here. (Figure 2) The canonical correlation coefficient is estimated at 0.7 and a pair of canonical weight functions are plotted in six situations. We observe that the weight pattern of "To the IKA airport" against "From the IKA airport" is different. In the "To the IKA airport", there is a bump from 19 to 20 but in the "From the IKA airport" there are two bumps, one is near 10 and the other is near 18 and 19.
But we observe that the weight pattern of "To the IFN airport" against "From the IFN airport" have slightly different.
The chance of offenses per hour pattern is different both in directions and airport City. To investigate these factors, we use Bayesian function-on-scalar regression.

The Function-on-Scalar Regression
To understand the effect of the airport city, directions, and weekend to the chance of the driving offenses, we use Bayesian Function-On-Scalar regression with fosr R package. We also like to study the interaction effects of these factors. Therefore, we use a 23 basis function to model the response curve. It took about 16 minutes to run on a regular laptop.
All variables are all significant (with a p-value of less than 0.05). And the most important variable is the interaction between station and direction, after that, the station is in the second and in the third is the direction and the interaction between station and weekend. (Table 3) The estimated coefficient functions and 95% simultaneous credible bands for the proposed regression model are presented in the supplement. The station effect, with IKA as a baseline, has a positive effect from 6:00 to 10:00 and from 20:00 to 24:00 which are the busy times in the roads. The direction effect, with To as a baseline, has the largest effect between 6:00 to 12:00. The pattern is changed in the weekend, with non-weekend as a baseline, from 7:00 to 19:00 has positive effect and in non-weekend has negative effect with the most negative effect at 12:00.
There is three estimated two-way interaction functional effect. The first one is the interaction between the Station and Direction, which is the most important effect. The interaction between station and weekend has a negative effect from 5:00 to 17:00 and in the other hours have a positive effect. The interaction effect between Direction and weekend has a negative effect from 5:00 to 10:00.
And finally, the three-way interaction effect between station, direction, and the weekend is almost positive from 1:00 to 12:00, with the highest at 7:00, and at the remaining hours are negative, with the biggest value at 17:00.

The Behavioral Factor
The Descriptive Statistics The average and standard deviation per hour for speed are calculated for all types of vehicles. The total vehicles with the maximum number of them from type 1 to type 5 per hour are presented in table 4. The average speed is the highest at 8:00 with average speed 94 and a standard deviation of 8.3. The total number of vehicles for type 1,2,3,4 and 5 has two peaks at 8:00 and 20:00, at 8:00 and 17:00, at 11:00 and 17:00, at 8:00 and 16:00, and finally at 11:00 and 16:00, respectively. (The further analysis is done in part 2 of the supplement)

Two Type of Driving Offenses
There are two types of driving offenses, and we study each of them with GAM to estimate the nonlinear effects with cubic regression splines and linear effects with Restricted Maximum Likelihood (REML). The penalty based model selection is used by setting select = TRUE in the gam() function. The AIC, BIC, adjusted R-square, and percent of deviance explained were used to select the best model. The degree of cubic regression spline was determined with an effective degree of freedom.

Speeding
The probability of speeding offenses is near zero under 90 Km/h average speed for all airports and directions, except in the international airport and "from" direction. On the other hand, the estimated coefficients are not zero for 90-120 Km/h average speed in all hours, supplementary part 3.1.

Failing to follow a vehicle at a safe distance
Only the domestic airport has positive effects in failing to follow a vehicle at a safe distance, supplementary part 3.2.

Discussion
The average hourly traffic volume near the airport in 2018-2019 is compared with 4 different countries.
The count traffic data of the Cork and Shanon airports in Ireland, the Heathrow and Gatwick airports in the UK, the Hartsfield-Jackson Atlanta International Airport in the USA, IKA, and IFN airports in Iran are gathered and average on all traffic counts in airport zones. (41)(42)(43) The figure in part 4 of the supplementary states that there are two peaks, first is between 5 to 10 am and the second is between 15 to 20 and the same pattern is observed among 4 countries. But this pattern is different based on the road direction, airport type (international vs domestic), and weekdays. The full list of traffic sites with their information is available in the supporting file. The chance of traffic offenses will increase with increasing the average hourly volume.
The rates of the road traffic death per 100,000 population in the world are 18.2%, with the highest rate in Africa equals to 26.6 and the lowest rate in Europe equals 9.3 and in Eastern Mediterranean is 18%. According to the Global Status Report on Road Safety, about 132 countries have funded national road safety goals, targets, and strategies. And 109 countries have also targets for the reduction of road traffic deaths. There are 12 global road safety performance targets for 2020 and 2030. Among them, target number 6 states that "By 2030, halve the proportion of vehicles traveling over the posted speed limit and achieve a reduction in speed-related injuries and fatalities". The key risk factors to prevent road traffic deaths are speed, drink-driving, motorcycle helmet use, use of seat-belts, and child restraint systems. Only less than 10 country has laws for all of these risk factors. The number of countries with speed law is 46, with Drink-driving law is 45, with Seat-belt law is 105, and with Helmet law is 49 and with Child restrains law is 33. The speed management decreases the number of fatalities, serious injuries, and death in traveling vehicles. There are three best practices criteria to assess legislation on speed laws including the existence of a national speed limit law, speed limits not exceeding 50 km/h in the urban area, and the power of modification speed limits by local authorities. According to the Safe road users, Iran has a national speed limit, National drink-driving, National motorcycle helmet, National seat-belt, National law on mobile phone use while driving, and National drug-driving laws. The National speed limit law stated that the maximum urban speed limit is 60 km/h (which is higher than 50 km/h), the maximum rural speed limit is 95 km/h, and the maximum motorway speed limit is 120 km/h. (1,44) This study shows that the chance of driving offenses in airport roads is not the same during all day and has some different peaks, and this pattern is different for the international and domestic airport, and at the end, the direction to the airport against from the airport is also different with some different peaks. Therefore, the chance of the driving offenses in the roads to the airports vary according to temporal effects, the time of day and day of the week, and type of Airport effect, domestic or international and direction effect, to the airport or from the airport.
In addition to the mentioned effects in the driving offenses, we also study the behavioral effect, the average speed, and the vehicle types. The hourly pattern of the average speed does not differ widely because we consider the average, but each type of vehicle has different behaviors. Finally, the other behavioral risk factors such as a driver's status, skills, etc. are not considered in this study.
This study suggests making data-driven policies under modification speed limits by local authorities for managing, controlling, and preventing driving offenses, especially in a specific area like airports. Meanwhile, the privacy of data subjects, vehicles, and driver behaviors, must be considered while their data are used, GDPR, etc.In the near future, the need for such policies is vital with wide uses of selfcontrol and automatic vehicles, which communicate with each other and police station (45) and intelligent speed adaptation (ISA) (46)

Conclusions
The data-driven policy under modification speed limits by local authorities will help for controlling, managing, and preventing driving offenses as a high-risk behavioral health index. The time of day, day of the week, international or domestic airport, to the airport or from the airport and type of vehicles are factors that influence this policy.