Statistical Study on Crash Frequency Model Using GNB Models of Freeway Sharp Horizontal Curve Based on Interactive Influence of 3 Explanatory Variables

Crash prediction of the sharp horizontal curve segment (SHCS) of a freeway is an important tool in analyzing safety of SHCSs and in building a crash prediction model (CPM). The design and crash report data of 88 SHCSs from different institutions were surveyed and three negative binomial (NB) regression models and three generalized negative binomial (GNB) regression models were built to prove that the interactive influence of explanatory variables plays an important role in fitting goodness. The study demonstrates the effective use of the GNB model in analyzing the interactive influence of explanatory variables and in predicting freeway basic segments. Traffic volume, highway horizontal radius, and curve length have been formulated as explanatory variables. Subsequently, we performed statistical analysis to determine the model parameters and conducted sensitivity analysis. Among the six models, the result of model 6, which considered interactive influence, is much better than those of the other models by fitting rules. We also compared the actual results from crashes of 88 SHCSs with those predicted by models 1, 3, and 6. Results demonstrate that model 6 is much more reasonable than models 1 and 3.


Introduction
Compared with other highways, a freeway is often designed with relatively good driving environment characterized with high alignment indexes, good pavement, total enclosure, absence of pedestrians, no low speed interference, perfect traffic safety devices, and so on.Thus, the crash rate and death toll of freeways average 30% ̶ 51% and 43% ̶ 76%, respectively compared with those of ordinary highways in developed countries.In China however, the average crash number, death toll, injury toll, and the direct loss of property are 3.2, 8.4, 7.2, and 24.3 more than those of the ordinary highways.Therefore, it is important to determine the real law of crash occurring in freeways and how the of different types of freeway environment influence the crash number based on reliable databases.
Over the past several decades, historical surveys covering the features and frequencies of crashes in freeways have been an actively pursued (Durduran, 2010; White Jules et al. 2011) [1] [2] .However, in terms of freeway crashes within China, specialized crash databases and highway design databases are not available at present.Similarly, investigations that could clarify China's current situation have not been performed.Thus, Zhong Liande ,et al. (2009) [3] , Ma Zhuanglin ,et al .(2012) [4], and other researchers developed a crash prediction model with a relatively small number of samples.To improve on this effort, this paper attempts to establish a model with huge samples.Mathematical statistics and regression analyses are common methods to predict highway crashes.Other methods, such as fuzzy mathematics, grey theory, nerve cell method, and clustering analysis, have also been used to establish the prediction models.American HSM2010 is an established prediction model based on statistical regression.IHSDM made a good simulation of the American two-lane highway crash prediction (U.S. federal highway data).Tang Chengcheng et al. ( 2009) [5] carried out two-lane highway crash prediction model research, which focused on low-grade highways in China.
Freeway crashes are the result of the combined influence of multiple factors, such as alignment, traffic volume, and presence of interchanges or other structures.The abovementioned methods have explained how a single factor influences the crashes but failed to explain how the these factors and the interactions among these factors influence the crashes.For this reason, studying the crash prediction models requires the division of the freeway into several segments, namely, basic segment, general segment, and special segment.Since we have discussed the crash prediction model of the basic segments in the paper published in Journal of Southeast University (Xiaofei Wang et al. 2014), we take the freeway sharp horizontal curve segment (SHCS) as the research object in this paper.In the crash prediction model, segment length, curve radius, and traffic flow are selected as explanatory variables and crash number is determined as the dependent variable.
The remainder of this paper is organized as follows.In the next section, we review relevant literature.In Section 3, we present the data collected and the glossaries as well as the basic model formulated and the discussion on the variables.In Section 4, we discuss the statistical analysis used to determine the model parameters and the sensitivity analysis performed.We also compared the actual crashes with the predicted ones.Moreover, we present the final results in this section.In Section 5, we present the conclusions.

Literature Review
At present, the commonly used method of building a highway traffic crash prediction model is the general linear model or logarithm linear model by logarithmic transformation into linear equation.Many of the crash prediction models of HSM2010 are analyzed through the logarithm linear model.Analysis of the common traffic crash prediction models has resulted in the observation that in the process of building the model, the basic assumption that all explanatory variables are relatively independent are common does not consider the influence of the each variable.This observation results in a situation where the relationship between explanatory variables and the traffic crash is not fully in accordance with the actual situation.Although a considerable number of recent highway safety studies (Yannis et al., 2005 [6] , Hill et al., 2006 [7] , Dominique Lord,2006 [8] , Liu, Bor-Shong, 2007 [9] , N.N Sze et al.,2007 [10] , Rhodes et al.,2011 [11] ) and [12] considered the interaction among explanatory variables, most are based on the analysis of the relationship between driver, vehicle, highway (Miaou.S, Lum.H, 1993 [13] ), and environment (Fridstrom et al., 1995 [14] ).The results of these studies show the different dangers when driving in highways and the effect of division on the traffic flow, among others.Moreover, the results show that when the lengths of segments analyzed are different, the traffic flow prediction for the crash is also different.
In the present economics and transport logistics industries, the super logarithmic function model is frequently used (Bozdogan, 1987 [15] , Christensen et al., 1973 [16] ), the basic expression form of which is: where Y is the dependent variable, K and L are the explanatory variables, and 0 , , , , , are the estimated parameters.
Wei Huang (2007) [17] and Li Li (2011) [18] studied the generalized translog cost function (GTCF).António (2011) [19] , Lurong Wu (2010) [20] , Juan Zeng (2010) [21] , Rong Li (2013) [22] , and Xiang Liu (2012) [23] introduced whereμ it is the mean number of accidents per year, F it , Leng i , Den i, and T t are the explanatory variables, which are referred to as AADT, segment length, density of access, and time trend variables; β k (k=0-9) and γ are the estimated parameters.No interaction is expected between this variable and the other explanatory variables in relation to accident frequency, since the time trend variable does not take the form of a "cross variable".
Using the logarithmic function NB model, Xiang Liu (2012) and Rong Li (2013) established the frequency forecast model of the highway traffic crash in Ontario, Canada.Compared with the log-linear NB model, it was proven to be more credible.
To deal with the combined influence of the multifactor, we introduced flexibility into our research.
Flexibility is often used in the manufacturing industry to explain the variation environment or the probabilistic ability from the variation.Cobb-Douglas production function, linear production function, Leontief production function, variable elasticity of substitution (VES) production function, and trans-log production function are often used to analyze flexibility [24] .Among these methods, the trans-log production function is the most popularly used to analyze traffic problems.Thus, the trans-log function was adopted in the paper to study the difference between the taking and not taking of the combined influence of multifactor into consideration.The model with a better fitting degree was chosen as CPM of SHCS.Then, CPM was checked against the real traffic crash data.

Materials
To acquire enough samples for a meaningful statistical analysis, four major sources were used:

Definition
In this study, the following segments are defined as sharp horizontal curve prediction segments:

Results
We built the freeway crash prediction model by selecting AADT, length of sharp horizontal curve segments, and curve radius as explanatory variables.
We set up the NB crash prediction model based on the the constant elasticity and flexibility of variables (see formulas 3 and 4).
where i μ = the estimate of crash amount for a specific year of segment i;

Excessive dispersion coefficient
First, we extended the models and obtained six main models.Each main model also contains several specific models.Among the six main models, three are generalized negative binomial models and three are negative binomial models (see However, the difference between the generalized negative binomial model and the negative binomial model is excessive dispersion coefficient.We also used AIC, BIC, and Pseudo R-2 to select the best specific model for each of the six main models with the best goodness of fit.The excessive dispersion coefficient of each model and its AIC, BIC, and Pseudo R-2 coefficient are listed in Table 4.
Tab 4 AIC and BIC, Pseudo R-2 of 6 main models and their specific models We used the following standards to examine and verify the goodness of fit of parameters of β: ○ 1 The Pseudo R statistical magnitude should be used to test the goodness of fit of the models.The bigger it is, the better is the model.likelihood ratio test of decision rules.Thus, the smaller it is, the better is the model.
As shown in the table 4, despite the value of the models being quite close to some models (models 1, 2.1, 2.2, and 2.3), we observed that when T is selected as the excessive dispersion coefficient parameter in the remaining models, the AIC and BIC values of models tend to be smaller, and the Pseudo-R-2 value tends to be larger than the others.
These observations indicate that the fitting effect of the model is better than those of others.
Thus, we determined T as parameter of β.That is, =e ( ( )) .

Model result
Based on the collected data mentioned in Section 2, we calibrated the estimated parameters of the six main models and the specific models cited above.The goodness of fit was also calculated.The results are shown in Table 5.Based on the above analysis, we determined model 6 as CPM and expressed it as follows: N= (3.18+0.60 ln( )−11.70 ln( )+7.85 ln( )+0.015[ln( )] 2 +8.49[ln( )] 2 −2.41[ln( )] 2 +0.64 ln( ) ln( )−1.09 ln( ) ln( )−5.80 ln( ) ln( )) .( 4. 19)   The excessive dispersion coefficient is: where N -the estimate of crash amount for every year of the basic segment; -the basic segment of the annual average daily traffic; -the length of the basic segment; and -the radius of the basic segment.

Prediction analysis with real data
To demonstrate the effectiveness of the prediction, we performed prediction of a certain freeway with model 1, model 3, and model 6.Then, we compared the results with the real crash data we collected from the institutions.The results are shown in Table 6.
Tab. the models are close to the actual casualties.When the standard deviation was used as reference, which is the description of a measurement standard of the dispersion degree of data distribution, the result of model 6 is much closer to the statistics value of the real cash data than those of the other two models.For the maximum and minimum values, the forecast range of model 6 is very close to the actual situation.
Based on the above discussion, model 6 is the best among the six models.

Conclusion
The analysis sheds light on crash prediction of SHCS of a freeway.The influence among the different explanatory variables of the freeway traffic crash has been analyzed by super logarithmic production function.Six kinds of models from a total of 10 models were compared using AIC, BIC, and Pseudo R2 rules.Among the models, model 6, in which the interactive influence was considered, is much better than other models.Through the detailed analysis and study, the following conclusions have been drawn.
(1) With sufficient samples and data, the effective use of the GNB model in analyzing the interactive influence of explanatory variables and predicting freeway basic segments can be demonstrated.
(2) When T is selected as the excessive dispersion coefficient parameter, the AIC, BIC, and the Pseudo-R-2 values of the models tend to be small, which indicates that the fitting effect of the model that uses the parameter is the better than those of the others.
(3) When the interactive influence is taken into consideration, the fitting goodness of crash prediction is much better when the traffic volume, highway horizontal radius, and curve length are used.
(4) Further, prediction results with relatively good models (model 1, model 3, and model 6) have been compared to that of real data.In summary, sufficient samples have been surveyed to establish the CMF of SHCS.Thus, the result is reliable, as proven by an example.
The findings of this study can help enhance understanding of the relationship among traffic volume, highway horizontal radius, and curve length.Such understanding is important in developing crash prevention strategies for specific conditions.For example, the findings can provide an important guide for designers when applying the horizontal radius and curve length.Moreover, the results could be used as basis to implement a variable traffic speed limit on curves to reduce crash risk while traveling on a hazardous roadway segment.However, further efforts should be made to demonstrate the differences between the NB and GNB models.The experimental data were limited; thus, the model fitting effect is slightly far from ideal.
Nevertheless, the influence of highway traffic crashes is universal; thus, this article adds new ideas.Furthermore, this model offers a certain reference value for crash prediction in general.
the surpassing logarithmic function to analyze the traffic crash of loss and frequency.All the basic formulas are formulated based on formula (1) by introducing a second cross variables and using super logarithmic (TCF) cost function form to extend NB and the Poisson model.Consequently, the interaction between the variables can be reflected.António et al. established the logarithmic function model based on AADT, segment length, density of access, and time trend variables.A criterion was introduced to prove the criterion of super NB model logarithmic function.The translog functional form is expressed as:

iQ*
= AADT for a specific year of segment i; i L = the length of segment i; and ( 0,1,2,...,5) k k α = = estimated parameters.Akaike information criteria (AIC criterion), Bayesian Information Criteria (BIC) ru le, and Pseudo R-2 test were used to evaluate the imitative effect of the crash CPM of SHCS.The excessive dispersion coefficient parameter was selected and its expression equation was determined by analyzing the fit goodness of excessive dispersion coefficient of different parameters.Then, we compared the fit goodness of the six models below and ultimately determined the sharp horizontal curve highway crash CPM of SHCS.The corresponding forms of each model and estimated parameters are shown in Table 3ln = + ln( ) + ln( ) + ln( ) 4.2 ln = + ln( ) + ln( ) + ln( ) 4.3 ln = + ln( ) + ln( ) + ln( ) 5 ln = + ln( ) + ln( ) + ln( ) + [ln( )] + [ln( )] + [ln( )] + ln( ) ln( ) + ln( ) ln( ) + ln( ) ln( β : Excessive dispersion coefficient.The higher β is, the more scattered is the distribution.Vi: can represent T, L, R, TR, TL, RL, TRL of segment i.The determining method is discussed in the following section.
In our study, 88 SHCSs from eight four-lane highways of Guangdong Province covering the period from 2008 to 2012, and their crash data of five years were selected for analysis.The statistics are shown in Table2.
Freeway Administration and Maintenance Centers (FAMC, 7 freeways, 593.099 km total), and additional results provided by other scholars.Table1presents the sample size.
).The negative binomial model 2 contains three specific models (model 2.1, model 2.2, and model 2.3).The generalized negative binomial model 4 also contains three specific models (model 4.1

www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 August 2016 doi:10.20944/preprints201608.0144.v1
2 AIC is used to evaluate whether the model is useful or not.The smaller it is, the better is the model.○ 3 BIC states that any given problem can find the smallest error probability by the

Table 5
indicates the obvious interactive influence between two variables.Model 5 and model 6, which take the interactive influence into consideration, have better fitting, particularly when compared with model 2 and model 4. Thus, we ignored models 2 and 4

(www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 August 2016 doi:10.20944/preprints201608.0144.v1 directly
. For the models with three parameters, model 6 is better than model 5.By contrast, we found that the Pseudo R-2 of model 6 is larger than those of model 1 and model 3, indicating that model 6 is much better than model 1 and model 3 with regard to goodness of fit.