Professional Documents
Culture Documents
Hybrid Machine Learning and Geographic Information Systems Approach - A Case For Grade Crossing Crash Data Analysis
Hybrid Machine Learning and Geographic Information Systems Approach - A Case For Grade Crossing Crash Data Analysis
Hybrid Machine Learning and Geographic Information Systems Approach - A Case For Grade Crossing Crash Data Analysis
∗ Corresponding author.
2050003-1
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
1. Introduction
Highway-rail grade crossings (HRGCs) are critical spatial locations for transporta-
tion safety where the highway and railroad tracks meet at the same elevation. It is
an important source of safety concern to railway, highway authorities and the public
at large. The accidents at HRGCs are often very catastrophic with serious conse-
quences, such as fatalities, injuries, environmental disasters and extensive property
damage [Raub (2007)]. From 2008 to 2017, 80.2% of fatal railroad accidents and
by Ahmed Lasisi on 08/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
43.4% of injured railroad casualties in the US are caused by HRGCs [FRA (2019)].
In California however, there have been over 1,260 HRGC accidents within the same
period with the number of casualties per accident shown in Fig. 1. Due to safety
Adv. Data Sci. Adapt. Data Anal. 2020.12. Downloaded from www.worldscientific.com
reasons, HRGC accidents analysis has drawn considerable attention for decades.
Effective accident analysis and prediction methodology are strongly desired to help
transportation agencies and other stakeholders to accurately predict and signifi-
cantly reduce the casualties due to HRGC accidents.
The study of HRGC accidents has a considerably rich recent history. The earli-
est studies date back to early 1940s [Peabody and Dimmick (1941)]. Recent works
have considered developing prediction models for HRGC accidents to estimate the
expected number of accidents based on historical data [Lu and Zheng (2017)]. The
model used only three factors; average annual daily traffic (AADT), average daily
train traffic and the presence of warning devices to predict the accident rate in
a specific time interval. Because only a few important predictors were considered
in this study, there is significant room for improvement. The U.S Department of
Transportation (USDOT) previously came up with an improved HRGC accident
prediction model by taking crossing design factors into account, such as types of
gate [Ogden (2007)], but it is difficult to quantify the contribution of each factor
to HRGC accident rate. In recent years, researchers have developed various models
to predict the HRGC accident frequencies and probabilities depending on a num-
ber of factors (i.e. explanatory variables). Regression-based models were widely
Fig. 1. Summary of casualties due to HRGC accidents in California from 2008 to 2017 [Caltrans
(2019)].
2050003-2
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
investigated due to the random, discrete and nonnegative nature of the accident
data [Miaou et al. (2005)].
Austin and Carson [2002] preferred the negative binomial regression model to
analyze HRGC accident frequency. The study attempted to quantify the rela-
tive magnitudes of accident attributes as a significant factor for HRGC accident
frequency. Lu and Tolliver [2016] compared six different regression models based
on data obtained from North Dakota Department of Transportation (NDDOT),
and they concluded that Bernoulli, Hurdle Poisson and Conway–Maxwell–Poisson
by Ahmed Lasisi on 08/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
regression are more effective than other regression-based models. However, the
regression-based models are potentially erroneous due to assuming incorrect sta-
tistical distributions to accident/casualty data [Lord and Mannering (2010)]; and
Adv. Data Sci. Adapt. Data Anal. 2020.12. Downloaded from www.worldscientific.com
because accident data is not guaranteed to follow the specified distribution, the
performance of the developed model may not be effective on an unseen test set.
Furthermore, a decision tree approach was implemented by Lu and Zheng [2017]
for HRGC accidents, and authors found that train speed, highway, and railroad
traffic volume are significant variables. The study also revealed that advance train
detecting devices and warning systems are potentially useful for reducing HRGC
accident probability. More recently, Khan and Lee [2018] proposed a logit regres-
sion model to predict HRGC accident probabilities. The authors postulated that
populations located within a five-mile radius of HRGCs was a crucial significant
factor worth considering in HRGC accident analysis.
2050003-3
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
(a)
(b)
Fig. 2. HRGC accidents monthly 10-year outlook in california. (a) Distribution of train accidents
by time of day and (b) Grade crossing accidents by months from 2008 to 2017.
The combination of GIS spatial data and other databases allow engineers to visually
analyze and consider more explanatory variables with the expectation to improve
the performance and understanding of HRGC accidents models.
In this study, safety data is obtained from Federal Railroad Administration
(FRA), California Department of Transportation (Caltrans) which also includes
railway inventory data, highway Annual Average Daily Traffic (AADT), highway
speed limit, 10 year accident records and Geographic Information System (GIS)
shape files. A brief outlook of the collected accident data is provided in Fig. 2 while
2050003-4
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
copious details and description are later discussed in the case study. Figure 2(a)
shows that most HRGC accidents occurred around peak periods in the morning
and especially in the evening. This finding is not necessarily true for all HRGC
accidents in the United States [Hao et al. (2016)]. The chart also shows that the
effect of visibility due to daylight may not be significant as later verified by advanced
statistical methods employed in this paper. However, as far as time of the year is
concerned, there is no sufficient statistical evidence to suggest that HRGC accidents
are peculiar to a particular month of the year (Fig. 2(b)).
by Ahmed Lasisi on 08/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
This study will help both transportation and other stakeholders to understand
HRGC safety by identifying the key contributing factors, high-risk HRGCs and
expected casualty numbers in GIS. The final result will provide valuable information
and assistance for safety researchers and professionals to make informed decisions
for improving safety at HRGCs.
2050003-5
Adv. Data Sci. Adapt. Data Anal. 2020.12. Downloaded from www.worldscientific.com
by Ahmed Lasisi on 08/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
1 Peabody and Highway Predict expected HRGC Empirical Statistics Developed HRGC accident model
Dimmick [1941] accident number and identified AADT and
presence of warning devices as
significant factors
2 USDOT [1986] and Highway- Predict expected Regression-based Improved Peabody–Dimmick
Ogden [2007] Railway casualties in HRGC formulae model by including design
accident information
3 Austin and Carson HRGC Predict HRGC accident Negative binomial Build a simplified regression
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA
2050003-6
4 Oh et al. [2006] HRGC Predict HRGC crashes Poisson, Gamma Overcome under-dispersion
regression problem of data
5 Zhao and Khattak HRGC Investigates drivers probit model, Random parameter logit model
[2015] injury severity in multi-nomial logit and multinominal model are
2050003
Table 1. (Continued )
model
2050003-7
8 Ghomi et al. Highway Compared injury severity Association rules, Identified train speed,
[2016] risk factors classification and Vulnerable road users and
regression trees gender as most influential
accident factors
9 Lu and Zheng HRGC Predict HRGC crash Decision Trees Overcome the limitation of
2050003
attempted to build on existing studies and modify certain parameters without sig-
nificant difference in scope [Faghri and Demetsky (1980); Skinner et al. (1997)].
One of the earliest applications of GIS focused mainly on information safety at
HRGCs with little or no deliberate attempt to make accident or casualty predic-
tions [Panchanathan and Faghri (1995)]. The state-of-the-art has been consistently
advanced by implementing several regression-based models, statistical significance
tests and decision trees to make HRGC accident predictions. In recent times, multi-
purpose GIS products have made concerted efforts to improve railway inventory to
by Ahmed Lasisi on 08/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
enable HRGC safety analytics [Wright (2016)]. While these advances have been
multi-faceted and multi-dimensional, this study attempts to bridge some of the
gaps in HRGC accident prediction by introducing machine-learning Geospatial data
Adv. Data Sci. Adapt. Data Anal. 2020.12. Downloaded from www.worldscientific.com
science.
The role of machine learning in most engineering disciplines is becoming pro-
gressively relevant and transportation safety engineering is not an exception.
Attoh-Okine [2017] is a comprehensive resource for machine learning applications
in railway engineering specifically. The idea of machine learning developed from the
evolution of rational, repeatable techniques that learn from data and improve their
performance through an iterative process that comprises of training, validation and
testing performance [Lasisi and Attoh-Okine (2018)].
In very broad terms, regression-modeling can be referred as the simplest form
of supervised machine learning especially if the model performance is evaluated on
a data set that is different from that on which the regression model was devel-
oped [Hastie et al. (2009)]. The same can also be said of decision trees and binary
logit models built (see Table 1) for either regression or classification purposes
[James et al. (2013)]. Therefore, it is not hyperbolic to say that there has been
hints of machine learning in HRGC accident analysis except that the application
has been very limited. In addition, an evaluation of different machine learning tech-
niques on the subject matter has not been critically investigated. This analysis is
important because it should provide the basis for which a machine technique is
applied since different methods are better suited to different tasks. In this study,
authors have improved the decision tree approach to HRGC accidents [Yan et al.
(2010); Lu and Zheng (2017); Ghomi et al. (2016)] by aggregating predictions from
100 s of trees and more. Another improvement includes spatial visualization of pre-
dicted casualties using proportionate symbol maps. Lastly, a parameter importance
plot is also provided to rank the importance of predictor variables from the top per-
forming selected models. Moving forward, the popular classes of machine learning
relevant to this study are introduced in Sec. 2.1.
2050003-8
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
shown as
yi = f (X), (1)
where
⎛ ⎞
y1
⎜ ⎟
⎜ y2 ⎟
⎜ ⎟
y = ⎜ . ⎟, (2)
⎜.⎟
⎝.⎠
yn
⎛ ⎞
x11 x12 ··· x1p
⎜ . .. ⎟
X=⎜
⎝ ..
..
. ···
⎟
. ⎠. (3)
This study focuses majorly on supervised learning with both classification and
regression problems discussed in Sec. 2.1.1.
2050003-9
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
3. Research Framework
3.1. Formulation
This research aims to simplify the multi-step analytical process required in order to
make sense of HRGC accident data by employing the framework in Fig. 3. Firstly,
the analysis starts by collecting grade crossing inventory data from Caltrans, ten-
year HRGC accident data from FRA and; Highway traffic and geometry data from
FHWA and Caltrans, respectively. It is not enough to collect these data without
by Ahmed Lasisi on 08/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
identifying a merging field that is common to all databases, in this case, it is the
HRGC identifier (GXID). Using the GXID field, over 1200 crossings were identified
with at least one accident over the past 10 years in California. Moving forward,
Adv. Data Sci. Adapt. Data Anal. 2020.12. Downloaded from www.worldscientific.com
the merged data is then spatially presented in GIS to identify hot-spots and pos-
sible leads. The next step is geospatial data analytics. In this phase, there are
four sub-analyses; exploratory data analysis, classification and regression problems
on merged accident data, as well as prediction or forecasting casualties on future
network data. The last phase of the analysis examined what the new predictions
portends for relevant stakeholders through visual analytics in GIS.
Next, a detailed description of the classification and regression problems con-
sidered in this study are presented.
3.1.1. Classification
Let X = x(1) , x(2) , . . . , x(n) be a set of n HRGC accidents from some network or
municipality, and y is a related to response y = y (1) , y (2) , . . . , y (n) . Given that each
Fig. 3. A step-wise framework for hybrid GIS-machine learning analysis and prediction of HGRC
accidents.
2050003-10
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
⎜ ⎟
y=⎜ ⎜ .. ⎟.
⎟ (4)
⎝ . ⎠
Adv. Data Sci. Adapt. Data Anal. 2020.12. Downloaded from www.worldscientific.com
0-no casualties
Example of machine learning tools used for classification in this study include:
logistic or binary regression, support vector classifier, gradient boosting classifier,
etc.
3.1.2. Regression
Regression problem on the other hand is very similar to classification in that they
both follow similar structure of Eq. (4), except that yi is a “count-continuous”
variable, i.e.
⎛ ⎞
0-casualties
⎜ ⎟
⎜ 1-casualties⎟
⎜ ⎟
⎜ ⎟
y=⎜ ⎜ 2-casualties ⎟. (5)
⎟
⎜ .. ⎟
⎜ ⎟
⎝ . ⎠
n-casualties
The main idea behind regression is to find an optimal vector of coefficients, w*
that minimizes the sum of squared errors over the independent variables X. In
other words,
w∗ = argmin (y (i) − wT x(i) )2 . (6)
w
i∈x
If a constraint wi = 1 is specified, then the coefficients are simply standard-
ized. This gives an idea of feature or parameter importance at its most elementary
level.
The term “count-continuous” has been employed because the number of casual-
ties is not necessarily continuous nor discrete but natural integers. One way to treat
count data is treating them as nominal categories but that loses the idea that five
casualties may be worse than three. Ordinal categories however can be disguised as
regression target responses if passed as qualitative responses in a multi-class classi-
fication problem. This way, methods like cumulative logit model, polytomous logis-
tic and adjacent-category logistic model may be applicable [Ananth and Kleinbaum
2050003-11
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
(1997)]. Classically, regression has been performed on HRGC casualty count data
using Poisson, binomial, and negative binomial regression (Table 1). The short-
coming often overlooked in this method is that count data is restricted to a given
distribution type which does not take cognizance of unseen data. To eliminate the
problem of distribution biases, this study treats HRGC casualty counts as contin-
uous but with a coding clause that restricts the number of casualties to integers.
Details of this would be provided in the case study. The regression techniques imple-
mented in this study includes tree-based RF regression, bagging, SVR, gradient-
by Ahmed Lasisi on 08/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
boosting regression, etc. without having to assume any distribution for accident
data. The best of these models is eventually selected for predicting the number of
casualties if any HRGC accident occurs.
Adv. Data Sci. Adapt. Data Anal. 2020.12. Downloaded from www.worldscientific.com
3.1.4. Cross-validation
Although most of the analysis conducted in Table 1 involved models trained and
tested on the same set, a simple improvement on that is to split data set into sets
of training and testing data, respectively. Instead of splitting the data into two,
this study employs training with cross-validation wherein data is split into k folds
[James et al. (2013)]. During training, each regression or classification technique is
fitted onto k − 1 folds and the isolated fold is predicted. The choice of k is usually 5
or 10 but the former was used in this study. The performance of over all folds is then
averaged. This performance of could be accuracy, ROC score, etc. (for classification)
or mean absolute error, root mean squared error etc. (for regression).
2050003-12
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
were obtained from USDOT as part of the National Transportation Atlas Database
(NTAD) geospatial files [USDOT (2018a)]. A summary of the variables used in
this analysis is provided in Table 2. The table indicates that some parameters have
not been considered in the machine learning model building. This is because these
parameters are difficult to specify for any future HRGC accident. The HRGC acci-
dents in California from 2008 to 2017 can also be spatially examined from Fig. 4.
While California has been selected for this study, similar analysis can be done for
different states or any specific rail network of interest.
• All types of trains (freight/passenger) were assumed to run within the max
allowed speed on tracks.
• For some crossings that do not have recorded speed limits, a speed of 40 mph
was assumed as most of the HRGC seem to be on rural roads. Also, the speed
limit data for each crossing was taken from the closest road segment that has a
speed limit recorded. This assumption was necessary due to lack of up-to-date
accurately recorded data.
2050003-13
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
Fig. 4. California HRGC accidents from 2008 to 2017 mapped based on casualty levels.
2050003-14
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
(a)
(b)
Fig. 5. Bar plots of HRGC accident features. (a) Summary of HRGC accidents by track class
and (b) Overview of HRGC accidents’ control device.
can be observed that most parameters are uncorrelated except for a few whose cor-
relation are quite obvious (e.g. AADT, Track Class and Freight/Passenger Speed).
2050003-15
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
AADT (point). Same coordination system was used in order to match and collate
the geospatial information. Based on the location of HRGCs and 10-year accidents,
variables which can be obtained from other layers are spatially joined in a table for
each of HRGCs and 10-year accidents. To ensure the highest match of geographic
information, joining criteria was the intersection of the matching field (HRGC or
GXID), which implies that the output layer will only join data from the exact
same location. While the criteria “intersects” could not be met in some cases, the
“closest” location was used to maintain a reasonable accuracy. As a result, two
tables containing all variables for 10-year accidents and HRGCs were integrated
from ArcMap into the Python Scikitlearn machine learning module [Perrot (2011)].
• RF
• KNN
• Gaussian Naive Bayes Classifier (NB)
• Multi-Layer Perceptron-Neural Network (MLP-NN)
• Support Vector Machine (SVM)
• ET
• Bagging
• Gradient Boosting Machine
• Logistic Regression
2050003-16
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
Due to the nature of this paper, it is difficult to introduce all of the above
machine learning techniques in sufficient details. However, Marsland [2015] provides
copious introduction to these methods; and the Scikit-learn Python implementation
is well detailed in Perrot [2011]. Nonetheless, the hyper parameters for the selected
classification method has been shared in Table 3 along side performance results for
reproducibility. All of these techniques were trained on our HRGC accident data
with the aim of predicting whether or not there will be at least one casualty in
an HRGC incident. The performance metric selected are mean accuracy score and
ROC score [Perrot (2011)]. These scores were averaged on all test sets from 5-fold
cross-validation as initially introduced.
It is obvious from Table 3 that the support vector classifier (SVC) had the best
performance for both scores although tied by ET on the mean accuracy score. This
implies that we have a model that can almost tell if there will be a casualty based
on the features of a HRGC accident. Which features are more crucial than others?
This question is answered by a feature ranking or parameter importance plot from
a node impurity measure in Fig. 8. The feature importance can be evaluated from
most tree methods through the relative depth or rank of the feature used as a
node in decision trees when predicting a target variable (casualty or not). The
features used earlier in tree building contribute to the final decisions of the tree.
The expected fraction of the samples a particular feature contributes to is therefore
used as a measure of parameter importance [Gilles (2014)]. The feature importance
2050003-17
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
8 ET 0.9890 0.9814
9 Bagging 0.7247 0.5330
Notes: *C = 100, cache size = 200, class weight = None, coef0 = 0.0,
Adv. Data Sci. Adapt. Data Anal. 2020.12. Downloaded from www.worldscientific.com
was built from the RF model in Table 3 and identifies train speed as the most
influential cause of casualty. Other important attributes include AADT in both
directions before speed of freight train and track class.
After identifying SVM as the best classifier, the analysis proceeded in predicting
the number of casualties over the next 10 years. The average number of casualty
per HRGC accident is estimated as 0.56 for the past 10 years (i.e. 1 casualty in
2 accidents). The SVC model is then used to predict the HRGCs that are likely
to have accident casualties based on the attributes discussed in Fig. 8. This was
2050003-18
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
• N1 = E(Ncasualties ) = NCT×F
1
1
× P1 ,
NCT ×F2
• N2 = E(Ncasualties | CasualtyAccidents) = T2 × P2 ,
It turned out that P1 = 2,756, and P2 = 3,920. This was one way to go about the
problem through classification. However a regression approach was also employed
albeit with an assumption: all HRGCs in California will experience at least one
accident over the next 10 years. While this assumption may be too conservative, it
provides an infrastructure manager with the tools to prepare for the worst possible
scenario and plan accordingly. This assumption is also essential for mathematical
convenience because most machine learning models would only make predictions
with the objective to minimize regression errors, thereby making unreasonable con-
clusions about casualties at all grade crossings (e.g. 0.46 for all number or casualties
just to minimize regression errors).
The top performing models from the classification problem (Table 3) were
selected for the regression task because they exhibited a better acquaintance for
the merged data than others. These models include:
• Gradient Boosting
• RF
• SVR
• Extra Trees
The following appropriate performance metrics were selected for the regression
problem and the results are presented in Table 4
Ideally, this is a regression problem to estimate the number of casualties over the
next 10 years and the best models are selected based on the negative performance
2050003-19
Adv. Data Sci. Adapt. Data Anal. 2020.12. Downloaded from www.worldscientific.com
by Ahmed Lasisi on 08/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
Table 4. Model performance for K = 5 cross-validation and new network casualty prediction details.
Gradient Mean Mean squared Mean Variance RF Mean Mean squared Mean Variance
boosting absolute logarithm squared score absolute logarithm squared score
error error error error error error
A. Lasisi, P. Li & J. Chen
Count 5 5 5 5 Count 5 5 5 5
Mean −1.873924 −0.435331 −1.068924 −1.05956 Mean −1.814649 −0.405415 −0.690739 −0.653648
Standard 0.797066 0.185745 1.153132 1.140142 Standard 0.543695 0.118665 0.787275 0.740251
deviation deviation
Minimum −2.654431 −0.62665 −2.832591 −2.784307 Minimum −2.58639 −0.577813 −1.664205 −1.549561
25% −2.561478 −0.574176 −1.627836 −1.640404 25% −2.008705 −0.437551 −1.429086 −1.369506
50% −2.08149 −0.499451 −0.465734 −0.464299 50% −1.85721 −0.41412 −0.192118 −0.19197
75% −1.11481 −0.250066 −0.373498 −0.392295 75% −1.450058 −0.335951 −0.114745 −0.112291
Maximum −0.957412 −0.226311 −0.044963 −0.016495 Maximum −1.170882 −0.261642 −0.053541 −0.04491
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA
SVR Mean Mean squared Mean Variance ET Mean Mean squared Mean Variance
2050003-20
absolute logarithm squared score absolute logarithm squared score
error error error error error error
Count 5 5 5 5 Count Absolute Error Mean Squared 5 5
Mean −1.233071 −0.235713 −0.022048 −0.000894 Mean Logarithm Error Mean −1.193771 −1.140999
2050003
Standard 0.570092 0.108938 0.009974 0.006424 Standard Squared error Variance score 1.473711 1.406793
deviation deviation
Minimum −2.064331 −0.36167 −0.035684 −0.011303 Minimum −1.989131 −0.492459 −3.158127 −2.970808
25% −1.549041 −0.332196 −0.026637 −0.000635 25% −1.942624 −0.409286 −2.392738 −2.342679
50% −1.024874 −0.213381 −0.023034 0.000118 50% −1.827451 −0.38607 −0.329957 −0.329593
75% −0.875248 −0.165449 −0.014092 0.001041 75% −1.599188 −0.363111 −0.066808 −0.056177
Maximum −0.651859 −0.105868 −0.010793 0.006308 Maximum −1.311221 −0.314384 −0.021227 −0.005741
Adv. Data Sci. Adapt. Data Anal. 2020.12. Downloaded from www.worldscientific.com
by Ahmed Lasisi on 08/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
Table 4. (Continued )
Bagging Mean Mean squared Mean Variance *SVR Mean Mean squared Mean Variance
absolute logarithm squared score (Prediction) absolute logarithm squared score
error error error error error error
Count 5 5 5 5 Count 5 5 5 5
Mean −1.811455 −0.404935 −0.692795 −0.656378 Mean −1.324953 −0.289896 −0.62813 −0.607173
Standard 0.544439 0.118774 0.792393 0.746444 Standard 0.480072 0.083338 0.814063 0.831534
deviation deviation
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA
Minimum −2.575482 −0.575287 −1.675227 −1.56524 Minimum −1.969159 −0.377479 −1.679185 −1.676488
2050003-21
25% −2.01307 −0.441124 −1.433314 −1.372525 25% −1.579628 −0.34306 −1.341049 −1.341002
50% −1.864968 −0.414289 −0.185118 −0.184999 50% −1.348536 −0.317237 −0.050375 −0.013265
75% −1.441645 −0.333777 −0.115333 −0.112652 75% −0.953141 −0.240563 −0.050259 −0.003564
Maximum −1.162111 −0.260199 −0.054981 −0.046477 Maximum −0.774304 −0.171142 −0.019784 −0.001545
2050003
*SVR (C = 1000.0, cache size = 200, class weight = None, coef0 = 0.0, decision function shape = ‘ovr’, degree = 3, gamma = 0.1,
kernel = ‘rbf’, max iter = −1, probability = True, random state = None, shrinking = True, tol = 0.001, verbose = False)
Hybrid Machine Learning and GIS Approach
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
Fig. 9. California HRGC casualty prediction for the next 10 years (2017–2026).
metrics in Table 4 according to Urbanowicz and Moore [2015]. The best model
during the 5-fold cross-validation turned out to be SVR; and it was then selected
to predict the number of casualties at every HRGC over the next 10 years (SVR
(Prediction)) in Table 4. These predictions are presented in Fig. 9.
Based on the regression learning, the total number of casualties for the next 10
years was estimated to be P3 = 1,791 involving a total of 1,372 HRGC accidents
with casualties. Remember it was assumed that at least each of the HRGC crossings
will experience an HRGC accident. Therefore, an estimated 20% of HRGC accidents
would have casualty-involving accidents while 1 casualty is anticipated in every four
HRGC accidents.
Figure 10 shows the regression parameter importance initially discussed, and
there is a consensus on highway traffic (AADT) being the most importance factor.
Before any major conclusions and recommendations were made, a systems-
action-management (SAM) approach was conducted with text analytics of high-
casualty HRGC accidents in California within the aforestated period (2008 to 2017).
2050003-22
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
(a) (b)
Fig. 10. Regression learning variable importance plots. (a) Random forest variable importance
and (b) Gradient boosting variable importance.
been applied in several fields including but not limited to offshore safety, aerospace,
maritime engineering, etc. [Aven (2015)]. In order to make decision recommenda-
tions in HRGC accidents, a qualitative SAM study was conducted and results are
also presented to complement the quantitative analysis initially presented. It is a
three-step process that involves:
• Most severe HRGC incidents involve trucks and trailers. Are the highway vehi-
cles too slow to maneuver or make decisions at Highway Crossings? Should long
vehicles be mandated to stop at every grade crossing just like buses?
• The railroads report may be biased against the highway users who are often
blamed for either inattentiveness or deliberately violating traffic signs. Were there
obscure traffic lights or limited sight distances?
• What should be done to address highway user inattentiveness and how can a
traffic sign violator be stopped? Create additional physical barriers or grade sep-
aration at hot spots?
2050003-23
Adv. Data Sci. Adapt. Data Anal. 2020.12. Downloaded from www.worldscientific.com
by Ahmed Lasisi on 08/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
Table 5. Summary of double-digits in California HRGC accident casualties from the 2008 to 2017.
Contra Costa
125592 Oct-12 BNSF Passenger (AMTRAK) Hanford/Kings 47(0) 79 mph 30 mph/ 42 4[60,80]
2050003-24
1282012 Jan-12 SCRT Transit (SCRT) FruitRidge/ 16(3) 41 mph 15 mph /62 3[40, 60]
Sacramento
40613 Apr-13 SCAX Commuter (SCRT) San Fernando/ 22(0) 76 mph 5 mph/ 26 4[60, 80]
Los Angeles
2050003
90616 Sep-16 SCAX Commuter (SCRT Burbank/ 16(0) 15 mph 0 mph/– 4[60,80]
Los Angeles
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
• Report analysis shows that highway users are often traveling at low speeds
while trains are at high speeds during the examined HRGC accidents. There-
fore, should HRGCs be considered in positive train control (PTC) for automatic
slowdown?
• Is work ethic (long shifts/hours) a major cause of truck driver attentiveness at
HRGCs? What should truck companies and stakeholders be doing?
• How can automated trucks improve this process?
While the above provides a qualitative outlook to severe HRGC accidents in this
case study, a holistic synthesis has been provided in the following discussion.
4.4. Discussion
The machine learning algorithms considered in this study are able to accurately
predict HRGC accidents and the corresponding number of casualties if any. The
accuracy of the prediction can be as good as 98.9% with an ROC score of 0.98.
A total of 15 explanatory variables, which includes crossing attributes, highway
attributes as well as both train and motor traffic features were considered. The
analysis clearly identified train speed, AheadAADT and BackAADT as the most
important predictors for HRGC accidents based on considered accident data. The
total contribution of these three factors is over 55% as illustrated in Fig. 8. The
accident prediction results are presented in GIS map (Fig. 9). HRGC accidents
are marked by colored solid circle on the map. The severity of the accidents are
classified into low risk, moderate risk and high risk which are represented by green,
yellow and red, respectively. The GIS map with prediction results provide an easy
and visually appealing method to identify HRGC accident locations, hot-spots and
corresponding severities for transportation authorities or other stakeholders. The
results from casualty predictions over the next 10-years can assist infrastructure or
welfare managers to prepare insurance plans, safety/capital investment programs
based on well-thought out numbers. How much should be spent to reduce casualties?
Should speed limits be reduced at hot-spots or should long vehicles be mandated to
2050003-25
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
stop at HRGCs? These are follow-up questions that can be addressed based on the
results of this study. Such information allows stakeholders to evaluate the safety of
each HRGC and implement appropriate plan to reduce future occurrences.
By comparing the past 10 years of HRGC accident data and prediction results,
a few points can be concluded. North California HRGC casualties are likely to
reduce significantly if train speed are reduced at crossings Fig. 9. One way this can
be achieved is through PTC [Zhang et al. (2018)]. However, a few HRGCs’ situ-
ations did not improve perhaps due to increased traffic. In these locations, grade
by Ahmed Lasisi on 08/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
separation is an option that can be closely examined. In some parts Southern Cal-
ifornia however, casualty predictions are also reduced especially at the north of
Santa Ana. The working attributes in these locations can be observed and imple-
Adv. Data Sci. Adapt. Data Anal. 2020.12. Downloaded from www.worldscientific.com
mented in other locations. In Central California, e.g. at the San Joaquin Valley,
accidents and casualties are reduced in general. However, there are more high-
risk HRGC accidents predicted for Coalinga and Delano. This can be attributed
to heavy truck activity around the region. These and similar regions present
opportunities for automated or self-driving trucks whose safety promises are
likely to surpass human capabilities. In the meantime however, truck-driver wel-
fare should be critically examined if the recurrent “highway-user inattentiveness”
cause is to be addressed from FRA reports as evident from the qualitative SAM
study.
A closer look at Fig. 9 also shows that some locations without HRGC casualties
from 2008 to 2017 are not exempt from future casualty accidents. For example, the
line from Edwards Air-force base to Ridgecrest in Southern California is predicted
to have four low-severity HRGC casualties over the next 10 years. This prognosis
negates the frequentist’s assumption that future accidents are only due to past
occurrences. Unfortunately, the current FRA HRGC accident prediction is based
on this assumption [FRA (2018c)]. With the approach implemented in this study,
this Web Accident Prediction System (WBAPS) can definitely be improved.
Lastly, from visual inspection (Figs. 4 and 9), it is obvious that both 10-year
accident history and predicted casualties occur more frequently in densely pop-
ulated areas where AADT is obviously higher. Although high-casualty locations
seem to vary from the two maps because of the stochastic nature of the attributes
involved. The results in this analysis can be greatly improved by adding heavy
vehicle percentages in each road because the SAM analysis shows vehicles like
buses and heavy trucks have a higher likelihood of high-severity casualties than
others.
5. Concluding Remarks
In this research, authors present a hybrid approach to analyze and predict HRGC
accidents. This quantitative approach is complemented by a qualitative SAM study
before decision-making recommendations were provided. This study considers a case
study of California because it showcases a large rail infrastructure that exhibits
2050003-26
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
different diversities of heavy lines, rural and urban freight/passenger train services
as well as an ample mix of highway congestion and remote locations.
A major shortcoming of past works assumed a distribution for count casualty
data for predicting HRGC accidents. This assumption often restrict performance
testing to the training set with very little guarantee for high test performance
[Sellers et al. (2017)]. This study examines the use of different machine learning
predictions without falling into the stated assumption. More so, the embedded
implementation of this techniques to predict future casualties in GIS provides an
by Ahmed Lasisi on 08/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
Acknowledgments
Authors would like to appreciate the contributions of Professor Rachel Davidson
towards the success of this study.
References
Ananth, C. V. and Kleinbaum, D. G. (1997). Regression models for ordinal responses:
A review of methods and applications and Kleinbaum D G. regression models
for ordinal responses: A review of methods and applications. Technical Report 6.
Available at: https://faculty.washington.edu/heagerty/Courses/b571/homework/
Ananth-Kleinbaum-1997.pdf.
ArcMap, E. (2018). ArcMap — ArcGIS Desktop. Available at: http://desktop.arcgis.
com/en/arcmap/.
Attoh-Okine (2017). Big Data and Differential Privacy. Wiley Series in Operations
Research and Management Science.
Austin, R. D. and Carson, J. L. (2002). An alternative accident prediction
model for highway-rail interfaces. Technical Report. Available at: www.elsevier.
com/locate/aap.
Aven, T. (2015). Risk assessment and risk management: Review of recent advances on
their foundation, Eur. J. Operat. Res. 253: 1–13. Available at: http://dx.doi.org/
10.1016/j.ejor.2015.12.023.
Caltrans (2019). Caltrans GIS Data Library. Available at: http://www.dot.ca.gov/
hq/tsip/gis/datalibrary/.
Erdogan, S., Yilmaz, I., Baybura, T. and Gullu, M. (2008). Geographical information
systems aided traffic accident analysis system case study: City of Afyonkarahisar,
Accid. Anal. Prev. 40: 174–181. doi:10.1016/j.aap.2007.05.004.
2050003-27
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
Query/inccaus.aspx.
Ghomi, H., Bagheri, M., Fu, L. and Miranda-Moreno, L. F. (2016). Traffic Injury Pre-
vention Analyzing injury severity factors at highway railway grade crossing acci-
dents involving vulnerable road users: A comparative study. Traffic Inj. Prev.,
17: 833–441. Available at: https://www.tandfonline.com/action/ journalInforma-
tion?journalCode=gcpi20, doi:10.1080/15389588.2016.1151011.
Gilles, L. (2014). Understanding random forests from theory to practice. Ph.D.
Thesis, University of Liege. Available at: https://arxiv.org/pdf/1407.7502.pdf,
arXiv:1407.7502v3.
Hao, W., Kamga, C. and Wan, D. (2016). The effect of time of day on driver’s injury sever-
ity at highway-rail grade crossings in the United States. J. Traffic Transp. Eng.
(Engl. Ed.), 3: 37–50. Available at: http://dx.doi.org/10.1016/j.jtte.2015.10.006,
doi:10.1016/j.jtte.2015.10.
Hastie, T., Tibshirani, R. and Friedman, J. (2009). The elements of statistical learning.
Vol. 1. Springer. Available at: http://www.springerlink.com/index/10.1007/b94608,
arXiv:1010.3003.
James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013). An Introduction to Sta-
tistical Learning, 1st edn. Springer Texts in Statistics. Available at: http://www-
bcf.usc.edu/{∼}gareth/ISL/ISLR First Printing.pdf, doi:10.1007/978-1-4614-7138-
7, arXiv:1011.1669v3.
Khan, I. U. and Lee, E. (2018). Developing a highway rail grade crossing accident prob-
ability prediction model: A north dakota case study. MDPI Open Access J. Safety,
1–12. doi:10.3390/safety4020022.
Lasisi, A. and Attoh-Okine, N. (2018). Principal components analysis and track qual-
ity index: A machine learning approach. Transport. Res. Part C, Emerg. Tech-
nol., 91: 230–248. Available at: https://www.sciencedirect.com/science/article/
pii/S0968090X18304303, doi:10.1016/J.TRC.2018.04.001.
Lauren, B. (2017). Machine Learning in ArcGIS. Available at: https://www.esri.
com/arcgis-blog/products/arcgis-pro/analytics/machine-learning-in-arcgis/.
Lord, D. and Mannering, F. (2010). The statistical analysis of crash-frequency data: A
review and assessment of methodological alternatives. Transport. Res. Part A. Avail-
able at: https://ac.els-cdn. com/S0965856410000376/1-s2.0-S0965856410000376-
main.pdf? tid=6bb8c0b9-b68c-4a2b-bd8d-e21f9d2cedb8&acdnat=1549919900 2d69
ff7617f08ffaa19bb2c32aac5993, doi:10.1016/j.tra.2010.02.001.
Lu, P. and Tolliver, D. (2016). Accident prediction model for public highway-rail
grade crossings. Accid. Anal. Prev. 90: 73–81. Available at: http://dx.doi.org/
10.1016/j.aap.2016.02.012, doi:10.1016/j.aap.2016.02.012.
2050003-28
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
Lu, P. and Zheng, Z. (2017). Accident Prediction for Highway-Rail Grade Crossings: A
Model Comparison of Decision Tree and Neural Network.
Marsland, S. (2015). Machine Learning: An Algorithmic Perspective. 2nd edn. Chapman
& Hall/CRC, CRC Press, Boca Raton.
Mennecke, B. E. and Crossland, M. D. (1996). Geographic Information Systems: Applica-
tions and Research Opportunities for Information Systems Researchers. Proc. 29th
Annual Hawaii Int. Conf. System Sciences, Waitea, HI, USA.
Miaou, S.-P., Song, J. J. and Song, J. J. (2005). Bayesian ranking of sites for engineering
safety improvements: Decision parameter, treatability concept, statistical criterion,
by Ahmed Lasisi on 08/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
and spatial dependence. Accid. Anal. Prev., 37: 699–720. Available at: https://ac.els-
cdn.com/S0001457505000497/1-s2.0-S0001457505000497- main.pdf? tid=0f6b2960-
e04a-434b-af0b-545b81ee2322&acdnat=1549661509 3e64e28a8fcf71266cb8718529b7
Adv. Data Sci. Adapt. Data Anal. 2020.12. Downloaded from www.worldscientific.com
7bd6, doi:10.1016/j.aap.2005.03.012.
Ogden, B. D. (2007). Railroad Highway Grade Crossing Handbook. Transportation US,
Department of Federal Highway Administration.
Oh, J., Washington, S. P. and Nam, D. (2006). Accident prediction model for railway-
highway interfaces. Accid. Anal. Prev., 38: 346–356. doi:10.1016/j.aap.2005.10.004.
Panchanathan, S. and Faghri, A. (1995). Artificial Intelligence and Geographical
Information. Transport. Res. Rec. J. Transp. Res. Board, 1114. Available at:
http://onlinepubs.trb.org/Onlinepubs/trr/1995/1497/1497-012.pdf.
Pate-Cornell, M. E. (1990). Organizational Aspects of Engineering Systems Safety: The
case of Offshore Platforms. J. Risk Anal.
Peabody, L. and Dimmick, T. (1941). Accident Hazards at Grade Crossings. Public Roads,
22: 12–130.
Perrot, É. D. F. P. G. V. A. (2011). Scikit-learn: Machine Learning in Python. Available
at: http://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf.
Raub, R. A. (2007). Examination of Highway–Rail Grade Crossing Collisions Nation-
ally from 1998 to 2007. Transport. Res. Rec. J. Transport. Res. Board, 63–71.
doi:10.3141/2122-08.
Rohit Singh (2018). How we did it: Integrating ArcGIS and deep learning at
UC 2018. Available at: https://www.esri.com/arcgis-blog/products/api-python/
analytics/how-we-did-it-integrating-arcgis-and-machine-learning-at-uc-2018/.
Sellers, K. F., Swift, A. W. and Weems, K. S. (2017). A flexible distribution class
for count data. J. Stat. Distrib. Appl., 4: 22. Available at: https://jsdajournal.
springeropen.com/track/pdf/10.1186/s40488-017-0077-0, doi:10.1186/s40488-017-
0077-0.
Skinner, R. E., Barry, T. F. and Berry, B. J. (1997). Transportation Reseach Board
Executive Committee 1999 Officers, National Cooperative Highway Research
Program. Technical Report. Available at: http://onlinepubs. trb.org/onlinepubs/
nchrp/nchrp syn 271.pdf.
Urbanowicz, R. J. and Moore, J. H. (2015). ExSTraCS 2.0: Description and evaluation of
a scalable learning classifier system. Evol. Intell., 8: 89–116. Available at: http://
link.springer.com/10.1007/s12065-015-0128-8, doi:10.1007/s12065-015-0128-8.
USDOT (2018a). Direct Download of National Transportation Atlas Database
(NTAD) Geospatial Files — Bureau of Transportation Statistics. Available at:
https://www.bts.gov/geography/geospatial-portal/NTAD-direct-download.
USDOT (2018b). Safety and Health — US Department of Transportation. Available at:
https://www.transportation.gov/policy/transportation-policy/safety.
2050003-29
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
Wright, R. (2016). Integration of Grade Crossing Data into FRA’s GIS Program 2016 ESRI
Rail Summit. Technical Report. Available at: https://www.esri.com/events/rail-
summit/{∼}/media/B9A885BB7CBA4DD0A7C607E5EA384D71.ashx.
Yan, X., Richards, S. and Su, X. (2010). Using hierarchical tree-based regression model
to predict train-vehicle crashes at passive highway-rail grade crossings. Accid. Anal.
Prev., 42: 64–74. Available at: https://ac.els-cdn.com/S0001457509001687/1-s2.0-
S0001457509001687-main.pdf? tid=6e062379-14c7-492d-883a-bb9e8fc0161f&acdnat
= 1550079562 6da1f2263e88f713453e4fbf769f3bfb, doi:10.1016/j.aap.2009.07.003.
Zhang, Z., Liu, X. and Holt, K. (2018). Positive Train Control (PTC) for railway
by Ahmed Lasisi on 08/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
safety in the United States: Policy developments and critical issues. Util. Pol-
icy, 51: 33–40. Available at: https://doi.org/10.1016/j.jup.2018.03.002, doi:10.1016/
j.jup.2018.03.002.
Adv. Data Sci. Adapt. Data Anal. 2020.12. Downloaded from www.worldscientific.com
Zhao, S. and Khattak, A. (2015). Motor vehicle drivers’ injuries in train-motor vehi-
cle crashes. Accid. Anal. Prev., 74: 162–168. Available at: http://dx.doi.org/
10.1016/j.aap.2014.10.022, doi:10.1016/j.aap.2014.10.022.
2050003-30