Hybrid Machine Learning and Geographic Information Systems Approach - A Case For Grade Crossing Crash Data Analysis

July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
Advances in Data Science and Adaptive Analysis

Vol. 12, No. 1 (2020) 2050003 (30 pages)

c World Scientific Publishing Company
DOI: 10.1142/S2424922X20500035
Hybrid Machine Learning and Geographic Information

Systems Approach — A Case for Grade
by Ahmed Lasisi on 08/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
Crossing Crash Data Analysis

Adv. Data Sci. Adapt. Data Anal. 2020.12. Downloaded from www.worldscientific.com
Ahmed Lasisi∗ , Pengyu Li† and Jian Chen‡

Department of Civil and Environmental Engineering
University of Delaware, Newark, DE, USA
∗aolasisi@udel.edu
†lipengyu@udel.edu
‡jianchen@udel.edu
Received 26 February 2020

Revised 29 February 2020
Accepted 2 March 2020
Published 6 July 2020
Highway-rail grade crossing (HRGC) accidents continue to be a major source of trans-

portation casualties in the United States. This can be attributed to increased road and
rail operations and/or lack of adequate safety programs based on comprehensive HRGC
accidents analysis amidst other reasons. The focus of this study is to predict HRGC
accidents in a given rail network based on a machine learning analysis of a similar net-
work with cognate attributes. This study is an improvement on past studies that either
attempt to predict accidents in a given HRGC or spatially analyze HRGC accidents
for a particular rail line. In this study, a case for a hybrid machine learning and geo-
graphic information systems (GIS) approach is presented in a large rail network. The
study involves collection and wrangling of relevant data from various sources; exploratory
analysis, and supervised machine learning (classification and regression) of HRGC data
from 2008 to 2017 in California. The models developed from this analysis were used
to make binary predictions [98.9% accuracy & 0.9838 Receiver Operating Characteristic
(ROC) score] and quantitative estimations of HRGC casualties in a similar network over
the next 10 years. While results are spatially presented in GIS, this novel hybrid appli-
cation of machine learning and GIS in HRGC accidents’ analysis will help stakeholders
to pro-actively engage with casualties through addressing major accident causes as iden-
tified in this study. This paper is concluded with a Systems-Action-Management (SAM)
approach based on text analysis of HRGC accident risk reports from Federal Railroad
Administration.
Keywords: Grade crossing accidents; geographic information systems; machine learning;

railway/highway engineering.
∗ Corresponding author.
2050003-1
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
A. Lasisi, P. Li & J. Chen
1. Introduction
Highway-rail grade crossings (HRGCs) are critical spatial locations for transporta-
tion safety where the highway and railroad tracks meet at the same elevation. It is
an important source of safety concern to railway, highway authorities and the public
at large. The accidents at HRGCs are often very catastrophic with serious conse-
quences, such as fatalities, injuries, environmental disasters and extensive property
damage [Raub (2007)]. From 2008 to 2017, 80.2% of fatal railroad accidents and
43.4% of injured railroad casualties in the US are caused by HRGCs [FRA (2019)].
In California however, there have been over 1,260 HRGC accidents within the same
period with the number of casualties per accident shown in Fig. 1. Due to safety
reasons, HRGC accidents analysis has drawn considerable attention for decades.
Effective accident analysis and prediction methodology are strongly desired to help
transportation agencies and other stakeholders to accurately predict and signifi-
cantly reduce the casualties due to HRGC accidents.
The study of HRGC accidents has a considerably rich recent history. The earli-
est studies date back to early 1940s [Peabody and Dimmick (1941)]. Recent works
have considered developing prediction models for HRGC accidents to estimate the
expected number of accidents based on historical data [Lu and Zheng (2017)]. The
model used only three factors; average annual daily traffic (AADT), average daily
train traffic and the presence of warning devices to predict the accident rate in
a specific time interval. Because only a few important predictors were considered
in this study, there is significant room for improvement. The U.S Department of
Transportation (USDOT) previously came up with an improved HRGC accident
prediction model by taking crossing design factors into account, such as types of
gate [Ogden (2007)], but it is difficult to quantify the contribution of each factor
to HRGC accident rate. In recent years, researchers have developed various models
to predict the HRGC accident frequencies and probabilities depending on a num-
ber of factors (i.e. explanatory variables). Regression-based models were widely
Fig. 1. Summary of casualties due to HRGC accidents in California from 2008 to 2017 [Caltrans
(2019)].
2050003-2
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
Hybrid Machine Learning and GIS Approach
investigated due to the random, discrete and nonnegative nature of the accident
data [Miaou et al. (2005)].
Austin and Carson [2002] preferred the negative binomial regression model to
analyze HRGC accident frequency. The study attempted to quantify the rela-
tive magnitudes of accident attributes as a significant factor for HRGC accident
frequency. Lu and Tolliver [2016] compared six different regression models based
on data obtained from North Dakota Department of Transportation (NDDOT),
and they concluded that Bernoulli, Hurdle Poisson and Conway–Maxwell–Poisson
regression are more effective than other regression-based models. However, the
regression-based models are potentially erroneous due to assuming incorrect sta-
tistical distributions to accident/casualty data [Lord and Mannering (2010)]; and
because accident data is not guaranteed to follow the specified distribution, the
performance of the developed model may not be effective on an unseen test set.
Furthermore, a decision tree approach was implemented by Lu and Zheng [2017]
for HRGC accidents, and authors found that train speed, highway, and railroad
traffic volume are significant variables. The study also revealed that advance train
detecting devices and warning systems are potentially useful for reducing HRGC
accident probability. More recently, Khan and Lee [2018] proposed a logit regres-
sion model to predict HRGC accident probabilities. The authors postulated that
populations located within a five-mile radius of HRGCs was a crucial significant
factor worth considering in HRGC accident analysis.
1.1. Highway grade crossing accidents

HRGC accidents have historically been analyzed using different methods. There
are several factors that contribute to the occurrence of an HRGC incident while a
subset of these factors are particularly responsible for severity or casualty levels.
In previous studies, most of the research has focused on identifying and quan-
tifying the relationship between probability of accident and explanatory variables.
However, a model to predict whether an HRGC accident happens and the estimated
number of casualty (if it does happen) has not been extensively investigated. Mean-
while, the performance of some models can be questionable because of a limited
number of explanatory variables used. HRGC accident is influenced by a wide range
of predictor or explanatory variables. These variables are often collected from differ-
ent sources and it is an essential step to synchronize different databases and obtain
more explanatory variables for building a more powerful model to predict the likeli-
hood of HRGC accidents. More so, doing this in GIS is an area worth investigating
because it provides a visual support system for safety decision-making. The GIS
is a system designed to capture, store, manipulate, analyze, manage, and present
spatial or geographic data [Mennecke and Crossland (1996)]. It has been popu-
larly used for visualization of accident data and analysis of hot spots in highways
[Erdogan et al. (2008)]. However, GIS has not been effectively used for HRGC acci-
dent analysis along with the data from other databases such as FRA and FHWA.
2050003-3
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003

(a)
(b)
Fig. 2. HRGC accidents monthly 10-year outlook in california. (a) Distribution of train accidents
by time of day and (b) Grade crossing accidents by months from 2008 to 2017.
The combination of GIS spatial data and other databases allow engineers to visually
analyze and consider more explanatory variables with the expectation to improve
the performance and understanding of HRGC accidents models.
In this study, safety data is obtained from Federal Railroad Administration
(FRA), California Department of Transportation (Caltrans) which also includes
railway inventory data, highway Annual Average Daily Traffic (AADT), highway
speed limit, 10 year accident records and Geographic Information System (GIS)
shape files. A brief outlook of the collected accident data is provided in Fig. 2 while
2050003-4
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
copious details and description are later discussed in the case study. Figure 2(a)
shows that most HRGC accidents occurred around peak periods in the morning
and especially in the evening. This finding is not necessarily true for all HRGC
accidents in the United States [Hao et al. (2016)]. The chart also shows that the
effect of visibility due to daylight may not be significant as later verified by advanced
statistical methods employed in this paper. However, as far as time of the year is
concerned, there is no sufficient statistical evidence to suggest that HRGC accidents
are peculiar to a particular month of the year (Fig. 2(b)).
Due to different stakeholders involved in an analysis that meets at the intersec-

tion of railways, roadways and sometimes highways, relevant data sourced require
extensive pre-processing. Data is then merged and adequately synchronized for any
meaningful analysis to proceed. As part of the analysis, a diverse group of machine

learning algorithms are carefully selected. These include support vector regression
(SVR), K-nearest neighbors (KNN), Naı̈ve Bayes regression (NB), Multilayer Per-
ceptron (MLP), Random Forest (RF), Extra Trees (ET), Bagging and Gradient
Boosting (GBM). These algorithms are to be trained, tested and the best of these
models would be used to predict HRGC accident casualties over the next decade.
The best algorithm is chosen by comparing the prediction accuracy and receiver-
operating characteristic (ROC) curve score, which is then able to identify all key
variables in the data-set. The optimized machine learning model is incorporated
into GIS to allow stakeholders to identify all high-risk HRGC accidents and the
number of possible casualties with visual assistance.
Based on the foregoing, the objectives of this paper can be summarized as
follows:
• Developing a framework to enable merging of different databases in order to

obtain adequate significant predictors for HRGC accident prediction;
• Model selection for optimized machine learning algorithms to predict HRGC
accident and corresponding number of casualties;
• Implementing the optimized machine learning algorithm in GIS to identify high
risk HRGCs and corresponding casualty number.
This study will help both transportation and other stakeholders to understand
HRGC safety by identifying the key contributing factors, high-risk HRGCs and
expected casualty numbers in GIS. The final result will provide valuable information
and assistance for safety researchers and professionals to make informed decisions
for improving safety at HRGCs.
2. Improvements to the State-of-the-art

Table 1 summarizes the research trend from the early days of HRGC accident anal-
ysis and prediction till date. Earliest methods have focused on developing empirical
mathematical formula built on five years of state accident data. A few studies
2050003-5
Table 1. A summary of relevant literature on HRGC accidents.
S/No. Authors (years) Transportation Objective Tools and techniques Remarks

area
1 Peabody and Highway Predict expected HRGC Empirical Statistics Developed HRGC accident model
Dimmick [1941] accident number and identified AADT and
presence of warning devices as
significant factors
2 USDOT [1986] and Highway- Predict expected Regression-based Improved Peabody–Dimmick
Ogden [2007] Railway casualties in HRGC formulae model by including design
accident information
3 Austin and Carson HRGC Predict HRGC accident Negative binomial Build a simplified regression
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA
[2002] frequency regression based model and identify

magnitude as significant factor
2050003-6
4 Oh et al. [2006] HRGC Predict HRGC crashes Poisson, Gamma Overcome under-dispersion
regression problem of data
5 Zhao and Khattak HRGC Investigates drivers probit model, Random parameter logit model
[2015] injury severity in multi-nomial logit and multinominal model are
2050003
train-mortor crash model, random more suitable for injury

parameter logit severity analysis
model
6 Hao et al. [2016] Highway Driver Effect of time of day probit model and Identified morning and evening
Safety on driver’s injury in statistical tests peak periods for severe car
HRGC car crash crashes
Table 1. (Continued )
S/No. Authors (years) Transportation Objective Tools and techniques Remarks

area
7 Lu and Tolliver HRGC Comparing Various regression-based Bernoulli, Conway–Maxwell–
[2016] regression-based model models: Poisson Poisson, and Poisson
for HRGC accident model, the gamma hurdle are proposed to
prediction with model,Negative assessing HRGC accident
under-dispersion data binomial model, data with
under-dispersion
The Conway–Maxwell–
Poisson model,
The Bernoulli model,
Zero-inflated Poisson
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA
model
2050003-7
8 Ghomi et al. Highway Compared injury severity Association rules, Identified train speed,
[2016] risk factors classification and Vulnerable road users and
regression trees gender as most influential
accident factors
9 Lu and Zheng HRGC Predict HRGC crash Decision Trees Overcome the limitation of
2050003
[2017] likelihood regression-based model

and identify 23 significant
factors for HRGC
accident
10 Khan and Lee HRGC Predict HRGC accident Binary logit regression Idetify the population is a
[2018] probability significant factor for
HRGC accident
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
attempted to build on existing studies and modify certain parameters without sig-
nificant difference in scope [Faghri and Demetsky (1980); Skinner et al. (1997)].
One of the earliest applications of GIS focused mainly on information safety at
HRGCs with little or no deliberate attempt to make accident or casualty predic-
tions [Panchanathan and Faghri (1995)]. The state-of-the-art has been consistently
advanced by implementing several regression-based models, statistical significance
tests and decision trees to make HRGC accident predictions. In recent times, multi-
purpose GIS products have made concerted efforts to improve railway inventory to
enable HRGC safety analytics [Wright (2016)]. While these advances have been
multi-faceted and multi-dimensional, this study attempts to bridge some of the
gaps in HRGC accident prediction by introducing machine-learning Geospatial data
science.
The role of machine learning in most engineering disciplines is becoming pro-
gressively relevant and transportation safety engineering is not an exception.
Attoh-Okine [2017] is a comprehensive resource for machine learning applications
in railway engineering specifically. The idea of machine learning developed from the
evolution of rational, repeatable techniques that learn from data and improve their
performance through an iterative process that comprises of training, validation and
testing performance [Lasisi and Attoh-Okine (2018)].
In very broad terms, regression-modeling can be referred as the simplest form
of supervised machine learning especially if the model performance is evaluated on
a data set that is different from that on which the regression model was devel-
oped [Hastie et al. (2009)]. The same can also be said of decision trees and binary
logit models built (see Table 1) for either regression or classification purposes
[James et al. (2013)]. Therefore, it is not hyperbolic to say that there has been
hints of machine learning in HRGC accident analysis except that the application
has been very limited. In addition, an evaluation of different machine learning tech-
niques on the subject matter has not been critically investigated. This analysis is
important because it should provide the basis for which a machine technique is
applied since different methods are better suited to different tasks. In this study,
authors have improved the decision tree approach to HRGC accidents [Yan et al.
(2010); Lu and Zheng (2017); Ghomi et al. (2016)] by aggregating predictions from
100 s of trees and more. Another improvement includes spatial visualization of pre-
dicted casualties using proportionate symbol maps. Lastly, a parameter importance
plot is also provided to rank the importance of predictor variables from the top per-
forming selected models. Moving forward, the popular classes of machine learning
relevant to this study are introduced in Sec. 2.1.
2.1. Hybrid machine learning and GIS model

In the broadest possible sense, there are two types of machine learning: supervised
and unsupervised [James et al. (2013)].
2050003-8
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
Unsupervised learning is indicative of a situation where observations in a sample

space ω = s1 , . . . , sn , take properties xi . . . , xp (e.g. lane width, train speed, highway
speed limit, visibility, gates, etc.) without any targeted response yi (e.g. HRGC
accident (binary) or number of casualties). This type of learning is difficult to
monitor despite some expertise about the subject matter. At best, some cluster-
based learning or pattern recognition in similar groups can be identified without
any definitive group assignment or labeling.
Supervised learning in HRGC accident analysis describes a problem whereby
each element in the sample space ω = s1 , . . . , sn is assigned an associated label or

measurement yi . The learning goal is to create a model that attempts to charac-
terize the response yi in terms of the explanatory variables. Mathematically, this is
shown as
yi = f (X), (1)
where
⎛ ⎞
y1
⎜ ⎟
⎜ y2 ⎟
⎜ ⎟
y = ⎜ . ⎟, (2)
⎜.⎟
⎝.⎠
yn
⎛ ⎞
x11 x12 ··· x1p
⎜ . .. ⎟
X=⎜
⎝ ..
..
. ···
⎟
. ⎠. (3)
xn1 xn2 · · · xnp
This study focuses majorly on supervised learning with both classification and
regression problems discussed in Sec. 2.1.1.
2.1.1. Machine learning and GIS

GIS in transportation engineering need no detail introduction as can be found
in relevant cited studies [Faghri and Demetsky (1980); Mennecke and Crossland
(1996); Erdogan et al. (2008)]. However, the profuse use of the GIS tool has taken
the center stage in most spatial analytical studies. Because prediction problems
are hard to visualize in spatial analytics, the need for geospatial data science or
machine learning cannot be over-emphasized. Although, recent breakthroughs have
been made towards integrating popular data science tools with GIS [Lauren (2017);
Rohit Singh (2018)], this study aims to expand the implementation to HRGC acci-
dent analysis and transportation engineering by extension. The framework for this
implementation is provided in Sec. 3.
2050003-9
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
3. Research Framework
3.1. Formulation
This research aims to simplify the multi-step analytical process required in order to
make sense of HRGC accident data by employing the framework in Fig. 3. Firstly,
the analysis starts by collecting grade crossing inventory data from Caltrans, ten-
year HRGC accident data from FRA and; Highway traffic and geometry data from
FHWA and Caltrans, respectively. It is not enough to collect these data without
identifying a merging field that is common to all databases, in this case, it is the
HRGC identifier (GXID). Using the GXID field, over 1200 crossings were identified
with at least one accident over the past 10 years in California. Moving forward,
the merged data is then spatially presented in GIS to identify hot-spots and pos-
sible leads. The next step is geospatial data analytics. In this phase, there are
four sub-analyses; exploratory data analysis, classification and regression problems
on merged accident data, as well as prediction or forecasting casualties on future
network data. The last phase of the analysis examined what the new predictions
portends for relevant stakeholders through visual analytics in GIS.
Next, a detailed description of the classification and regression problems con-
sidered in this study are presented.
3.1.1. Classification
Let X = x(1) , x(2) , . . . , x(n) be a set of n HRGC accidents from some network or
municipality, and y is a related to response y = y (1) , y (2) , . . . , y (n) . Given that each
Fig. 3. A step-wise framework for hybrid GIS-machine learning analysis and prediction of HGRC
accidents.
2050003-10
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
(i) (i) (i)

HRGC accident or x(i) has m features; x1 , x2 , . . . , xm , e.g. gates, speed limit,
train speed, visibility, etc. Our objective is to predict y (i) for every x(i) .
A classification problem arises when the responses, yi is discrete and not contin-
uous. A binary classification problem considered in this study aimed to categorize
grade crossings based on the incidence of casualties in an HRGC accident within
the past 10 years. In other words,
⎛ ⎞
0-no casualties
⎜1-yes casualties⎟
⎜ ⎟
y=⎜ ⎜ .. ⎟.
⎟ (4)
⎝ . ⎠
0-no casualties
Example of machine learning tools used for classification in this study include:
logistic or binary regression, support vector classifier, gradient boosting classifier,
etc.
3.1.2. Regression
Regression problem on the other hand is very similar to classification in that they
both follow similar structure of Eq. (4), except that yi is a “count-continuous”
variable, i.e.
⎛ ⎞
0-casualties
⎜ ⎟
⎜ 1-casualties⎟
⎜ ⎟
⎜ ⎟
y=⎜ ⎜ 2-casualties ⎟. (5)
⎟
⎜ .. ⎟
⎜ ⎟
⎝ . ⎠
n-casualties
The main idea behind regression is to find an optimal vector of coefficients, w*
that minimizes the sum of squared errors over the independent variables X. In
other words,

w∗ = argmin (y (i) − wT x(i) )2 . (6)
w
i∈x

If a constraint wi = 1 is specified, then the coefficients are simply standard-
ized. This gives an idea of feature or parameter importance at its most elementary
level.
The term “count-continuous” has been employed because the number of casual-
ties is not necessarily continuous nor discrete but natural integers. One way to treat
count data is treating them as nominal categories but that loses the idea that five
casualties may be worse than three. Ordinal categories however can be disguised as
regression target responses if passed as qualitative responses in a multi-class classi-
fication problem. This way, methods like cumulative logit model, polytomous logis-
tic and adjacent-category logistic model may be applicable [Ananth and Kleinbaum
2050003-11
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
(1997)]. Classically, regression has been performed on HRGC casualty count data
using Poisson, binomial, and negative binomial regression (Table 1). The short-
coming often overlooked in this method is that count data is restricted to a given
distribution type which does not take cognizance of unseen data. To eliminate the
problem of distribution biases, this study treats HRGC casualty counts as contin-
uous but with a coding clause that restricts the number of casualties to integers.
Details of this would be provided in the case study. The regression techniques imple-
mented in this study includes tree-based RF regression, bagging, SVR, gradient-
boosting regression, etc. without having to assume any distribution for accident
data. The best of these models is eventually selected for predicting the number of
casualties if any HRGC accident occurs.
3.1.3. Correlation check

After all the selected models have been trained in classification, a correlation of their
predictions is conducted to examine the differences in the model performances.
3.1.4. Cross-validation
Although most of the analysis conducted in Table 1 involved models trained and
tested on the same set, a simple improvement on that is to split data set into sets
of training and testing data, respectively. Instead of splitting the data into two,
this study employs training with cross-validation wherein data is split into k folds
[James et al. (2013)]. During training, each regression or classification technique is
fitted onto k − 1 folds and the isolated fold is predicted. The choice of k is usually 5
or 10 but the former was used in this study. The performance of over all folds is then
averaged. This performance of could be accuracy, ROC score, etc. (for classification)
or mean absolute error, root mean squared error etc. (for regression).
4. Case Study: Grade Crossing Accidents in California

4.1. Data description and inventory
As mentioned earlier, the railroad/highway inventory and accidents data were
obtained from multiple sources including FRA [2018b], Caltrans [2019], and
USDOT [2018b]. A 10-year period (2008–2017) accident data sheet was exported
from FRA as an important input for machine learning. According to FRA 5.14
section, a total of 1,393 accidents occurred from 2008 to 2017 [FRA (2018b)]. In
this study, 1,263 cases were analyzed because the HRGC inventory gathered from
Caltrans was not up-to-date, which might result in some missing inclusions of newly-
added HGRCs in GIS. In this analysis, authors employed at-grade for grade type
because grade separations would not have vehicle-train interference. The input vari-
ables for 6,962 at-grade HRGCs were obtained from Caltrans as well, which was
updated until October 31, 2013 [Caltrans (2019)]. Other variables for road network
2050003-12
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
Table 2. Variables, description and sources.
Variable Description Input values Source Use in

prediction
Casualties Number of casualties Numeric FRA Yes
per accident
Visibility Depends on daylight, Category FRA No
time of day
Scale of 1 to 4
Track type Mainline or yard, etc. Category FRA No
Train speed Speed of the train in mph Numeric FRA No

WD Traffic control type Category Caltrans Yes
(gates, flashers, or passive)
Transit YN Passenger/freight Rail Category/binary Caltrans Yes

cracking p Pavement cracking Numeric USDOT Yes
iri International roughness index Numeric USDOT Yes
rutting Pavement rutting Numeric USDOT Yes
speed limi Speed limit Numeric USDOT Yes
TrackClass Track class Category/numeric Caltrans Yes
FRT SPEED Freight train speed Numeric Caltrans Yes
PASS SPEED Passenger train speed Numeric Caltrans Yes
NUM TRACK Number of tracks Numeric Caltrans Yes
Back AADT AADT (west/south of Numeric Caltrans Yes
intersection)
Ahead AADT AADT (east/north of Numeric Caltrans Yes
intersection)
were obtained from USDOT as part of the National Transportation Atlas Database
(NTAD) geospatial files [USDOT (2018a)]. A summary of the variables used in
this analysis is provided in Table 2. The table indicates that some parameters have
not been considered in the machine learning model building. This is because these
parameters are difficult to specify for any future HRGC accident. The HRGC acci-
dents in California from 2008 to 2017 can also be spatially examined from Fig. 4.
While California has been selected for this study, similar analysis can be done for
different states or any specific rail network of interest.
4.1.1. Assumptions and limitations

The following assumptions and limitations apply to this study:
• All types of trains (freight/passenger) were assumed to run within the max
allowed speed on tracks.
• For some crossings that do not have recorded speed limits, a speed of 40 mph
was assumed as most of the HRGC seem to be on rural roads. Also, the speed
limit data for each crossing was taken from the closest road segment that has a
speed limit recorded. This assumption was necessary due to lack of up-to-date
accurately recorded data.
2050003-13
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003

Fig. 4. California HRGC accidents from 2008 to 2017 mapped based on casualty levels.
• Because third-party information sources on train traffic were questionable and

not easily accessible from FRA database, a conservative uniform traffic was not
assumed across all crossings to allow for randomness.
• The HRGC inventory database is not up-to-date, so number of accident data
dropped from 1,394 to 1,263 and the number of at-grade HRGCs was counted as
6,962 [Caltrans (2019)] instead of 9,145, which is listed in FRA 8.05 section as of
January 2019 [FRA (2019)]
4.1.2. Exploratory data analysis

Before proceeding with the analysis, it was important to have a visual understanding
of merged HRGC accident data. Starting with the influence of track class, Track
Class 4 had more accidents perhaps due to a higher frequency of train travel or
mile covered on track type (Fig. 5(a)). Figure 5(b) shows that most of the HRGC
accidents had physical barriers and gates with only a few of them with flashers or
no gates.
In order to observe any latent relationships between all the variables, a correla-
tion analysis was conducted for all parameters and the matrix is presented in Fig. 6.
This is necessary to identify redundant correlated predictors for model building. It
2050003-14
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003

(a)
(b)
Fig. 5. Bar plots of HRGC accident features. (a) Summary of HRGC accidents by track class
and (b) Overview of HRGC accidents’ control device.
can be observed that most parameters are uncorrelated except for a few whose cor-
relation are quite obvious (e.g. AADT, Track Class and Freight/Passenger Speed).
4.2. Geospatial data analysis

It is important to combine the variables in the data from different sources into a
data frame so that the covariates and dependent variable (number of casualties)
can be analyzed using machine learning. ArcMap [2018], a GIS tool, was used to
merge the data spatially because it is convenient to display and edit geographic
data sets. To begin with, different layers were created for the railroad network
(polyline), road network (polyline), 10-year accidents (point), HRGCs (point), and
2050003-15
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003

Fig. 6. Correlation matrix of HRGC accident data.
AADT (point). Same coordination system was used in order to match and collate
the geospatial information. Based on the location of HRGCs and 10-year accidents,
variables which can be obtained from other layers are spatially joined in a table for
each of HRGCs and 10-year accidents. To ensure the highest match of geographic
information, joining criteria was the intersection of the matching field (HRGC or
GXID), which implies that the output layer will only join data from the exact
same location. While the criteria “intersects” could not be met in some cases, the
“closest” location was used to maintain a reasonable accuracy. As a result, two
tables containing all variables for 10-year accidents and HRGCs were integrated
from ArcMap into the Python Scikitlearn machine learning module [Perrot (2011)].
4.2.1. Machine learning

After appropriate data cleansing and pre-processing, model development for the
first stage of supervised machine learning commenced with the following classifiers:
• RF
• KNN
• Gaussian Naive Bayes Classifier (NB)
• Multi-Layer Perceptron-Neural Network (MLP-NN)
• Support Vector Machine (SVM)
• ET
• Bagging
• Gradient Boosting Machine
• Logistic Regression
2050003-16
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003

Fig. 7. Correlation plot for machine learning classifiers.
Due to the nature of this paper, it is difficult to introduce all of the above
machine learning techniques in sufficient details. However, Marsland [2015] provides
copious introduction to these methods; and the Scikit-learn Python implementation
is well detailed in Perrot [2011]. Nonetheless, the hyper parameters for the selected
classification method has been shared in Table 3 along side performance results for
reproducibility. All of these techniques were trained on our HRGC accident data
with the aim of predicting whether or not there will be at least one casualty in
an HRGC incident. The performance metric selected are mean accuracy score and
ROC score [Perrot (2011)]. These scores were averaged on all test sets from 5-fold
cross-validation as initially introduced.
It is obvious from Table 3 that the support vector classifier (SVC) had the best
performance for both scores although tied by ET on the mean accuracy score. This
implies that we have a model that can almost tell if there will be a casualty based
on the features of a HRGC accident. Which features are more crucial than others?
This question is answered by a feature ranking or parameter importance plot from
a node impurity measure in Fig. 8. The feature importance can be evaluated from
most tree methods through the relative depth or rank of the feature used as a
node in decision trees when predicting a target variable (casualty or not). The
features used earlier in tree building contribute to the final decisions of the tree.
The expected fraction of the samples a particular feature contributes to is therefore
used as a measure of parameter importance [Gilles (2014)]. The feature importance
2050003-17
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
Table 3. Model performance for classification problem.
S/No. Machine learning technique Mean accuracy ROC score

1 RF Classifier 0.9720 0.9596
2 Gradient Boosting Machine 0.7332 0.5571
3 Support Vector Classifier 0.9890 0.9838*
4 NB 0.6747 0.5060
5 KNN 0.7052 0.5000
6 Logistic Regression 0.7052 0.5000
7 MLP-NN 0.6857 0.5054
8 ET 0.9890 0.9814
9 Bagging 0.7247 0.5330
Notes: *C = 100, cache size = 200, class weight = None, coef0 = 0.0,
decision function shape=’ovr’, degree = 3, gamma = ’auto’,

kernel = ’rbf’,,max iter = −1, probability = True, random state = None,
shrinking = True,,tol = 0.001, verbose = False
Fig. 8. Parameter importance for classification problem.
was built from the RF model in Table 3 and identifies train speed as the most
influential cause of casualty. Other important attributes include AADT in both
directions before speed of freight train and track class.
After identifying SVM as the best classifier, the analysis proceeded in predicting
the number of casualties over the next 10 years. The average number of casualty
per HRGC accident is estimated as 0.56 for the past 10 years (i.e. 1 casualty in
2 accidents). The SVC model is then used to predict the HRGCs that are likely
to have accident casualties based on the attributes discussed in Fig. 8. This was
2050003-18
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
carried-out in two ways:
• N1 = E(Ncasualties ) = NCT×F
1
1
× P1 ,
NCT ×F2
• N2 = E(Ncasualties | CasualtyAccidents) = T2 × P2 ,
where N1 is the number of casualties estimated due to all HRGC accidents,

N2 is the number of casualties estimated due to HRGC accidents involving casual-
ties only
NC is total number of casualties in all HRGC accidents
NCT is total number of casualties in HRGC accidents involving casualties only

F1 is the number of times 0 to n-casualty HRGC accident occurred,
F1 is the number of times 1 to n-casualty HRGC accident occurred excluding no-

casualty accidents,
P1 is the predicted number of HRGC accidents over the next 10 years; and
P2 is the predicted number of casualties in HRGC accidents over the next 10 years.
It turned out that P1 = 2,756, and P2 = 3,920. This was one way to go about the
problem through classification. However a regression approach was also employed
albeit with an assumption: all HRGCs in California will experience at least one
accident over the next 10 years. While this assumption may be too conservative, it
provides an infrastructure manager with the tools to prepare for the worst possible
scenario and plan accordingly. This assumption is also essential for mathematical
convenience because most machine learning models would only make predictions
with the objective to minimize regression errors, thereby making unreasonable con-
clusions about casualties at all grade crossings (e.g. 0.46 for all number or casualties
just to minimize regression errors).
The top performing models from the classification problem (Table 3) were
selected for the regression task because they exhibited a better acquaintance for
the merged data than others. These models include:
• Gradient Boosting
• RF
• SVR
• Extra Trees
The following appropriate performance metrics were selected for the regression
problem and the results are presented in Table 4
• Mean Absolute Error

• Mean Squared Logarithm Error
• Mean Squared Error
• Variance Score
Ideally, this is a regression problem to estimate the number of casualties over the
next 10 years and the best models are selected based on the negative performance
2050003-19
Table 4. Model performance for K = 5 cross-validation and new network casualty prediction details.
Gradient Mean Mean squared Mean Variance RF Mean Mean squared Mean Variance
boosting absolute logarithm squared score absolute logarithm squared score
error error error error error error
Count 5 5 5 5 Count 5 5 5 5
Mean −1.873924 −0.435331 −1.068924 −1.05956 Mean −1.814649 −0.405415 −0.690739 −0.653648
Standard 0.797066 0.185745 1.153132 1.140142 Standard 0.543695 0.118665 0.787275 0.740251
deviation deviation
Minimum −2.654431 −0.62665 −2.832591 −2.784307 Minimum −2.58639 −0.577813 −1.664205 −1.549561
25% −2.561478 −0.574176 −1.627836 −1.640404 25% −2.008705 −0.437551 −1.429086 −1.369506
50% −2.08149 −0.499451 −0.465734 −0.464299 50% −1.85721 −0.41412 −0.192118 −0.19197
75% −1.11481 −0.250066 −0.373498 −0.392295 75% −1.450058 −0.335951 −0.114745 −0.112291
Maximum −0.957412 −0.226311 −0.044963 −0.016495 Maximum −1.170882 −0.261642 −0.053541 −0.04491
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA
SVR Mean Mean squared Mean Variance ET Mean Mean squared Mean Variance
2050003-20
absolute logarithm squared score absolute logarithm squared score
Count 5 5 5 5 Count Absolute Error Mean Squared 5 5
Mean −1.233071 −0.235713 −0.022048 −0.000894 Mean Logarithm Error Mean −1.193771 −1.140999
2050003
Standard 0.570092 0.108938 0.009974 0.006424 Standard Squared error Variance score 1.473711 1.406793
deviation deviation
Minimum −2.064331 −0.36167 −0.035684 −0.011303 Minimum −1.989131 −0.492459 −3.158127 −2.970808
25% −1.549041 −0.332196 −0.026637 −0.000635 25% −1.942624 −0.409286 −2.392738 −2.342679
50% −1.024874 −0.213381 −0.023034 0.000118 50% −1.827451 −0.38607 −0.329957 −0.329593
75% −0.875248 −0.165449 −0.014092 0.001041 75% −1.599188 −0.363111 −0.066808 −0.056177
Maximum −0.651859 −0.105868 −0.010793 0.006308 Maximum −1.311221 −0.314384 −0.021227 −0.005741
Table 4. (Continued )
Bagging Mean Mean squared Mean Variance *SVR Mean Mean squared Mean Variance
absolute logarithm squared score (Prediction) absolute logarithm squared score
Count 5 5 5 5 Count 5 5 5 5
Mean −1.811455 −0.404935 −0.692795 −0.656378 Mean −1.324953 −0.289896 −0.62813 −0.607173
Standard 0.544439 0.118774 0.792393 0.746444 Standard 0.480072 0.083338 0.814063 0.831534
deviation deviation
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA
Minimum −2.575482 −0.575287 −1.675227 −1.56524 Minimum −1.969159 −0.377479 −1.679185 −1.676488
2050003-21
25% −2.01307 −0.441124 −1.433314 −1.372525 25% −1.579628 −0.34306 −1.341049 −1.341002
50% −1.864968 −0.414289 −0.185118 −0.184999 50% −1.348536 −0.317237 −0.050375 −0.013265
75% −1.441645 −0.333777 −0.115333 −0.112652 75% −0.953141 −0.240563 −0.050259 −0.003564
Maximum −1.162111 −0.260199 −0.054981 −0.046477 Maximum −0.774304 −0.171142 −0.019784 −0.001545
2050003
*SVR (C = 1000.0, cache size = 200, class weight = None, coef0 = 0.0, decision function shape = ‘ovr’, degree = 3, gamma = 0.1,
kernel = ‘rbf’, max iter = −1, probability = True, random state = None, shrinking = True, tol = 0.001, verbose = False)
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003

Fig. 9. California HRGC casualty prediction for the next 10 years (2017–2026).
metrics in Table 4 according to Urbanowicz and Moore [2015]. The best model
during the 5-fold cross-validation turned out to be SVR; and it was then selected
to predict the number of casualties at every HRGC over the next 10 years (SVR
(Prediction)) in Table 4. These predictions are presented in Fig. 9.
Based on the regression learning, the total number of casualties for the next 10
years was estimated to be P3 = 1,791 involving a total of 1,372 HRGC accidents
with casualties. Remember it was assumed that at least each of the HRGC crossings
will experience an HRGC accident. Therefore, an estimated 20% of HRGC accidents
would have casualty-involving accidents while 1 casualty is anticipated in every four
HRGC accidents.
Figure 10 shows the regression parameter importance initially discussed, and
there is a consensus on highway traffic (AADT) being the most importance factor.
Before any major conclusions and recommendations were made, a systems-
action-management (SAM) approach was conducted with text analytics of high-
casualty HRGC accidents in California within the aforestated period (2008 to 2017).
4.3. System-action-management analysis of HRGC accidents

The concept of SAM stems from the idea of solving problems from their root-
causes before making top-level decisions [Pate-Cornell (1990)]. The concept has
2050003-22
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003

(a) (b)
Fig. 10. Regression learning variable importance plots. (a) Random forest variable importance
and (b) Gradient boosting variable importance.
been applied in several fields including but not limited to offshore safety, aerospace,
maritime engineering, etc. [Aven (2015)]. In order to make decision recommenda-
tions in HRGC accidents, a qualitative SAM study was conducted and results are
also presented to complement the quantitative analysis initially presented. It is a
three-step process that involves:
• Basic Events like traffic signal, driver perceptions, braking etc.

• Decisions and Actions; like identifying human decisions, behaviors and actions.
• Organizational Culture which usually is the root causes for recurrent problems
like HRGC.
Specific Highway Rail Crossing accidents in California involving double-digits

casualties for the last decade were considered. Table 5 is a summary of the major
attributes in each identified event. A summary of each accident can be obtained
from FRA [2018a] while the following observations can be concluded from the
reports:
• Most severe HRGC incidents involve trucks and trailers. Are the highway vehi-
cles too slow to maneuver or make decisions at Highway Crossings? Should long
vehicles be mandated to stop at every grade crossing just like buses?
• The railroads report may be biased against the highway users who are often
blamed for either inattentiveness or deliberately violating traffic signs. Were there
obscure traffic lights or limited sight distances?
• What should be done to address highway user inattentiveness and how can a
traffic sign violator be stopped? Create additional physical barriers or grade sep-
aration at hot spots?
2050003-23
Table 5. Summary of double-digits in California HRGC accident casualties from the 2008 to 2017.
Incident Date Railroad Train Station/County Casualty Train Vehicle Track

No. (Month/Yr) type (Death) speed speed/age class
108137 May-08 BNSF Passenger (AMTRAK) Hanford/Kings 32(0) 70 mph 30 mph/40 4[60, 80]
116714 Aug-10 BNSF Passenger (AMTRAK) Wasco/Kern 30(0) 77 mph 5 mph/49 4[60, 80]
121160 Sep-11 BNSF Passenger (AMTRAK) Turlock/Stanislaus 22(0 44 mph 0 mph/33 4[60,80]
121308 Sep-11 BNSF Passenger (AMTRAK) Antioch/ 57(0) 79 mph – mph/35 5[80,90]
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA
Contra Costa
125592 Oct-12 BNSF Passenger (AMTRAK) Hanford/Kings 47(0) 79 mph 30 mph/ 42 4[60,80]
2050003-24
1282012 Jan-12 SCRT Transit (SCRT) FruitRidge/ 16(3) 41 mph 15 mph /62 3[40, 60]
Sacramento
40613 Apr-13 SCAX Commuter (SCRT) San Fernando/ 22(0) 76 mph 5 mph/ 26 4[60, 80]
Los Angeles
2050003
90616 Sep-16 SCAX Commuter (SCRT Burbank/ 16(0) 15 mph 0 mph/– 4[60,80]
Los Angeles
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003

Fig. 11. SAM risk approach.
• Report analysis shows that highway users are often traveling at low speeds
while trains are at high speeds during the examined HRGC accidents. There-
fore, should HRGCs be considered in positive train control (PTC) for automatic
slowdown?
• Is work ethic (long shifts/hours) a major cause of truck driver attentiveness at
HRGCs? What should truck companies and stakeholders be doing?
• How can automated trucks improve this process?
While the above provides a qualitative outlook to severe HRGC accidents in this
case study, a holistic synthesis has been provided in the following discussion.
4.4. Discussion
The machine learning algorithms considered in this study are able to accurately
predict HRGC accidents and the corresponding number of casualties if any. The
accuracy of the prediction can be as good as 98.9% with an ROC score of 0.98.
A total of 15 explanatory variables, which includes crossing attributes, highway
attributes as well as both train and motor traffic features were considered. The
analysis clearly identified train speed, AheadAADT and BackAADT as the most
important predictors for HRGC accidents based on considered accident data. The
total contribution of these three factors is over 55% as illustrated in Fig. 8. The
accident prediction results are presented in GIS map (Fig. 9). HRGC accidents
are marked by colored solid circle on the map. The severity of the accidents are
classified into low risk, moderate risk and high risk which are represented by green,
yellow and red, respectively. The GIS map with prediction results provide an easy
and visually appealing method to identify HRGC accident locations, hot-spots and
corresponding severities for transportation authorities or other stakeholders. The
results from casualty predictions over the next 10-years can assist infrastructure or
welfare managers to prepare insurance plans, safety/capital investment programs
based on well-thought out numbers. How much should be spent to reduce casualties?
Should speed limits be reduced at hot-spots or should long vehicles be mandated to
2050003-25
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
stop at HRGCs? These are follow-up questions that can be addressed based on the
results of this study. Such information allows stakeholders to evaluate the safety of
each HRGC and implement appropriate plan to reduce future occurrences.
By comparing the past 10 years of HRGC accident data and prediction results,
a few points can be concluded. North California HRGC casualties are likely to
reduce significantly if train speed are reduced at crossings Fig. 9. One way this can
be achieved is through PTC [Zhang et al. (2018)]. However, a few HRGCs’ situ-
ations did not improve perhaps due to increased traffic. In these locations, grade
separation is an option that can be closely examined. In some parts Southern Cal-
ifornia however, casualty predictions are also reduced especially at the north of
Santa Ana. The working attributes in these locations can be observed and imple-
mented in other locations. In Central California, e.g. at the San Joaquin Valley,
accidents and casualties are reduced in general. However, there are more high-
risk HRGC accidents predicted for Coalinga and Delano. This can be attributed
to heavy truck activity around the region. These and similar regions present
opportunities for automated or self-driving trucks whose safety promises are
likely to surpass human capabilities. In the meantime however, truck-driver wel-
fare should be critically examined if the recurrent “highway-user inattentiveness”
cause is to be addressed from FRA reports as evident from the qualitative SAM
study.
A closer look at Fig. 9 also shows that some locations without HRGC casualties
from 2008 to 2017 are not exempt from future casualty accidents. For example, the
line from Edwards Air-force base to Ridgecrest in Southern California is predicted
to have four low-severity HRGC casualties over the next 10 years. This prognosis
negates the frequentist’s assumption that future accidents are only due to past
occurrences. Unfortunately, the current FRA HRGC accident prediction is based
on this assumption [FRA (2018c)]. With the approach implemented in this study,
this Web Accident Prediction System (WBAPS) can definitely be improved.
Lastly, from visual inspection (Figs. 4 and 9), it is obvious that both 10-year
accident history and predicted casualties occur more frequently in densely pop-
ulated areas where AADT is obviously higher. Although high-casualty locations
seem to vary from the two maps because of the stochastic nature of the attributes
involved. The results in this analysis can be greatly improved by adding heavy
vehicle percentages in each road because the SAM analysis shows vehicles like
buses and heavy trucks have a higher likelihood of high-severity casualties than
others.
5. Concluding Remarks
In this research, authors present a hybrid approach to analyze and predict HRGC
accidents. This quantitative approach is complemented by a qualitative SAM study
before decision-making recommendations were provided. This study considers a case
study of California because it showcases a large rail infrastructure that exhibits
2050003-26
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
different diversities of heavy lines, rural and urban freight/passenger train services
as well as an ample mix of highway congestion and remote locations.
A major shortcoming of past works assumed a distribution for count casualty
data for predicting HRGC accidents. This assumption often restrict performance
testing to the training set with very little guarantee for high test performance
[Sellers et al. (2017)]. This study examines the use of different machine learning
predictions without falling into the stated assumption. More so, the embedded
implementation of this techniques to predict future casualties in GIS provides an
interactive decision-making tool for stakeholders to further improve the safety of

HRGCs, making it a modest effort to introduce geospatial data science in trans-
portation safety engineering.
In conclusion, this is a first step in introducing a comprehensive GIS and machine

learning approach to HRGC safety at the intersection of both highway and railway
engineering, contrary to traditional HRGC accident analysis. Future work would
consider learning the sensitivities associated with safety investments, policies and
their direct implications on HRGC casualty reduction through examining a history
of safety programs from relevant sources and corresponding safety data.
Acknowledgments
Authors would like to appreciate the contributions of Professor Rachel Davidson
towards the success of this study.
References
Ananth, C. V. and Kleinbaum, D. G. (1997). Regression models for ordinal responses:
A review of methods and applications and Kleinbaum D G. regression models
for ordinal responses: A review of methods and applications. Technical Report 6.
Available at: https://faculty.washington.edu/heagerty/Courses/b571/homework/
Ananth-Kleinbaum-1997.pdf.
ArcMap, E. (2018). ArcMap — ArcGIS Desktop. Available at: http://desktop.arcgis.
com/en/arcmap/.
Attoh-Okine (2017). Big Data and Differential Privacy. Wiley Series in Operations
Research and Management Science.
Austin, R. D. and Carson, J. L. (2002). An alternative accident prediction
model for highway-rail interfaces. Technical Report. Available at: www.elsevier.
com/locate/aap.
Aven, T. (2015). Risk assessment and risk management: Review of recent advances on
their foundation, Eur. J. Operat. Res. 253: 1–13. Available at: http://dx.doi.org/
10.1016/j.ejor.2015.12.023.
Caltrans (2019). Caltrans GIS Data Library. Available at: http://www.dot.ca.gov/
hq/tsip/gis/datalibrary/.
Erdogan, S., Yilmaz, I., Baybura, T. and Gullu, M. (2008). Geographical information
systems aided traffic accident analysis system case study: City of Afyonkarahisar,
Accid. Anal. Prev. 40: 174–181. doi:10.1016/j.aap.2007.05.004.
2050003-27
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
Faghri, A. and Demetsky, M. J. (1980). A comparison of formulae for predicting rail-

highway crossing Hazards. Transport. Res. Rec. J. Transp. Res. Board, 1114. Avail-
able at: http://onlinepubs.trb.org/Onlinepubs/trr/1987/1114/1114-016.pdf.
FRA (2018a). 3.11 — Accident detail report. Technical Report. Available at: https://
safetydata.fra.dot.gov/OfficeofSafety/publicsite/Query/incrpt.aspx.
FRA (2018b). 5.14 — Hwy rail accident incident summary by railroad. Technical Report.
Federal Railroad Administration. Available at: https://safetydata.fra.dot.gov/
OfficeofSafety/publicsite/Query/HwyRailAccidentSummaryByRR.aspx.
FRA (2018c). Annual web accident prediction system. Technical Report. Federal Railroad
Administration Highway-Rail Crossing Safety & Trespass Prevention.

FRA, S. (2019). 3.10 — Accident causes, train accidents by cause from FRA
F 6180.54. Available at: https://safetydata.fra.dot.gov/OfficeofSafety/publicsite/
Query/inccaus.aspx.
Ghomi, H., Bagheri, M., Fu, L. and Miranda-Moreno, L. F. (2016). Traffic Injury Pre-
vention Analyzing injury severity factors at highway railway grade crossing acci-
dents involving vulnerable road users: A comparative study. Traffic Inj. Prev.,
17: 833–441. Available at: https://www.tandfonline.com/action/ journalInforma-
tion?journalCode=gcpi20, doi:10.1080/15389588.2016.1151011.
Gilles, L. (2014). Understanding random forests from theory to practice. Ph.D.
Thesis, University of Liege. Available at: https://arxiv.org/pdf/1407.7502.pdf,
arXiv:1407.7502v3.
Hao, W., Kamga, C. and Wan, D. (2016). The effect of time of day on driver’s injury sever-
ity at highway-rail grade crossings in the United States. J. Traffic Transp. Eng.
(Engl. Ed.), 3: 37–50. Available at: http://dx.doi.org/10.1016/j.jtte.2015.10.006,
doi:10.1016/j.jtte.2015.10.
Hastie, T., Tibshirani, R. and Friedman, J. (2009). The elements of statistical learning.
Vol. 1. Springer. Available at: http://www.springerlink.com/index/10.1007/b94608,
arXiv:1010.3003.
James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013). An Introduction to Sta-
tistical Learning, 1st edn. Springer Texts in Statistics. Available at: http://www-
bcf.usc.edu/{∼}gareth/ISL/ISLR First Printing.pdf, doi:10.1007/978-1-4614-7138-
7, arXiv:1011.1669v3.
Khan, I. U. and Lee, E. (2018). Developing a highway rail grade crossing accident prob-
ability prediction model: A north dakota case study. MDPI Open Access J. Safety,
1–12. doi:10.3390/safety4020022.
Lasisi, A. and Attoh-Okine, N. (2018). Principal components analysis and track qual-
ity index: A machine learning approach. Transport. Res. Part C, Emerg. Tech-
nol., 91: 230–248. Available at: https://www.sciencedirect.com/science/article/
pii/S0968090X18304303, doi:10.1016/J.TRC.2018.04.001.
Lauren, B. (2017). Machine Learning in ArcGIS. Available at: https://www.esri.
com/arcgis-blog/products/arcgis-pro/analytics/machine-learning-in-arcgis/.
Lord, D. and Mannering, F. (2010). The statistical analysis of crash-frequency data: A
review and assessment of methodological alternatives. Transport. Res. Part A. Avail-
able at: https://ac.els-cdn. com/S0965856410000376/1-s2.0-S0965856410000376-
main.pdf? tid=6bb8c0b9-b68c-4a2b-bd8d-e21f9d2cedb8&acdnat=1549919900 2d69
ff7617f08ffaa19bb2c32aac5993, doi:10.1016/j.tra.2010.02.001.
Lu, P. and Tolliver, D. (2016). Accident prediction model for public highway-rail
grade crossings. Accid. Anal. Prev. 90: 73–81. Available at: http://dx.doi.org/
10.1016/j.aap.2016.02.012, doi:10.1016/j.aap.2016.02.012.
2050003-28
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
Lu, P. and Zheng, Z. (2017). Accident Prediction for Highway-Rail Grade Crossings: A
Model Comparison of Decision Tree and Neural Network.
Marsland, S. (2015). Machine Learning: An Algorithmic Perspective. 2nd edn. Chapman
& Hall/CRC, CRC Press, Boca Raton.
Mennecke, B. E. and Crossland, M. D. (1996). Geographic Information Systems: Applica-
tions and Research Opportunities for Information Systems Researchers. Proc. 29th
Annual Hawaii Int. Conf. System Sciences, Waitea, HI, USA.
Miaou, S.-P., Song, J. J. and Song, J. J. (2005). Bayesian ranking of sites for engineering
safety improvements: Decision parameter, treatability concept, statistical criterion,
and spatial dependence. Accid. Anal. Prev., 37: 699–720. Available at: https://ac.els-
cdn.com/S0001457505000497/1-s2.0-S0001457505000497- main.pdf? tid=0f6b2960-
e04a-434b-af0b-545b81ee2322&acdnat=1549661509 3e64e28a8fcf71266cb8718529b7
7bd6, doi:10.1016/j.aap.2005.03.012.
Ogden, B. D. (2007). Railroad Highway Grade Crossing Handbook. Transportation US,
Department of Federal Highway Administration.
Oh, J., Washington, S. P. and Nam, D. (2006). Accident prediction model for railway-
highway interfaces. Accid. Anal. Prev., 38: 346–356. doi:10.1016/j.aap.2005.10.004.
Panchanathan, S. and Faghri, A. (1995). Artificial Intelligence and Geographical
Information. Transport. Res. Rec. J. Transp. Res. Board, 1114. Available at:
http://onlinepubs.trb.org/Onlinepubs/trr/1995/1497/1497-012.pdf.
Pate-Cornell, M. E. (1990). Organizational Aspects of Engineering Systems Safety: The
case of Offshore Platforms. J. Risk Anal.
Peabody, L. and Dimmick, T. (1941). Accident Hazards at Grade Crossings. Public Roads,
22: 12–130.
Perrot, É. D. F. P. G. V. A. (2011). Scikit-learn: Machine Learning in Python. Available
at: http://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf.
Raub, R. A. (2007). Examination of Highway–Rail Grade Crossing Collisions Nation-
ally from 1998 to 2007. Transport. Res. Rec. J. Transport. Res. Board, 63–71.
doi:10.3141/2122-08.
Rohit Singh (2018). How we did it: Integrating ArcGIS and deep learning at
UC 2018. Available at: https://www.esri.com/arcgis-blog/products/api-python/
analytics/how-we-did-it-integrating-arcgis-and-machine-learning-at-uc-2018/.
Sellers, K. F., Swift, A. W. and Weems, K. S. (2017). A flexible distribution class
for count data. J. Stat. Distrib. Appl., 4: 22. Available at: https://jsdajournal.
springeropen.com/track/pdf/10.1186/s40488-017-0077-0, doi:10.1186/s40488-017-
0077-0.
Skinner, R. E., Barry, T. F. and Berry, B. J. (1997). Transportation Reseach Board
Executive Committee 1999 Officers, National Cooperative Highway Research
Program. Technical Report. Available at: http://onlinepubs. trb.org/onlinepubs/
nchrp/nchrp syn 271.pdf.
Urbanowicz, R. J. and Moore, J. H. (2015). ExSTraCS 2.0: Description and evaluation of
a scalable learning classifier system. Evol. Intell., 8: 89–116. Available at: http://
link.springer.com/10.1007/s12065-015-0128-8, doi:10.1007/s12065-015-0128-8.
USDOT (2018a). Direct Download of National Transportation Atlas Database
(NTAD) Geospatial Files — Bureau of Transportation Statistics. Available at:
https://www.bts.gov/geography/geospatial-portal/NTAD-direct-download.
USDOT (2018b). Safety and Health — US Department of Transportation. Available at:
https://www.transportation.gov/policy/transportation-policy/safety.
2050003-29
July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003
Wright, R. (2016). Integration of Grade Crossing Data into FRA’s GIS Program 2016 ESRI
Rail Summit. Technical Report. Available at: https://www.esri.com/events/rail-
summit/{∼}/media/B9A885BB7CBA4DD0A7C607E5EA384D71.ashx.
Yan, X., Richards, S. and Su, X. (2010). Using hierarchical tree-based regression model
to predict train-vehicle crashes at passive highway-rail grade crossings. Accid. Anal.
Prev., 42: 64–74. Available at: https://ac.els-cdn.com/S0001457509001687/1-s2.0-
S0001457509001687-main.pdf? tid=6e062379-14c7-492d-883a-bb9e8fc0161f&acdnat
= 1550079562 6da1f2263e88f713453e4fbf769f3bfb, doi:10.1016/j.aap.2009.07.003.
Zhang, Z., Liu, X. and Holt, K. (2018). Positive Train Control (PTC) for railway
safety in the United States: Policy developments and critical issues. Util. Pol-
icy, 51: 33–40. Available at: https://doi.org/10.1016/j.jup.2018.03.002, doi:10.1016/
j.jup.2018.03.002.
Zhao, S. and Khattak, A. (2015). Motor vehicle drivers’ injuries in train-motor vehi-
cle crashes. Accid. Anal. Prev., 74: 162–168. Available at: http://dx.doi.org/
10.1016/j.aap.2014.10.022, doi:10.1016/j.aap.2014.10.022.
2050003-30

Hybrid Machine Learning and Geographic Information Systems Approach - A Case For Grade Crossing Crash Data Analysis

Uploaded by

Copyright:

Available Formats

You might also like

Hybrid Machine Learning and Geographic Information Systems Approach - A Case For Grade Crossing Crash Data Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hybrid Machine Learning and Geographic Information Systems Approach - A Case For Grade Crossing Crash Data Analysis

Uploaded by

Copyright:

Available Formats

July 2, 2020 8:18 WSPC/2424-922X 244-ADSAA 2050003

Advances in Data Science and Adaptive Analysis

Hybrid Machine Learning and Geographic Information

Crossing Crash Data Analysis

Ahmed Lasisi∗ , Pengyu Li† and Jian Chen‡

Received 26 February 2020

Highway-rail grade crossing (HRGC) accidents continue to be a major source of trans-

Keywords: Grade crossing accidents; geographic information systems; machine learning;

A. Lasisi, P. Li & J. Chen

Hybrid Machine Learning and GIS Approach

1.1. Highway grade crossing accidents

A. Lasisi, P. Li & J. Chen

Hybrid Machine Learning and GIS Approach

Due to diﬀerent stakeholders involved in an analysis that meets at the intersec-

meaningful analysis to proceed. As part of the analysis, a diverse group of machine

• Developing a framework to enable merging of diﬀerent databases in order to

2. Improvements to the State-of-the-art

Table 1. A summary of relevant literature on HRGC accidents.

S/No. Authors (years) Transportation Objective Tools and techniques Remarks

[2002] frequency regression based model and identify

train-mortor crash model, random more suitable for injury

S/No. Authors (years) Transportation Objective Tools and techniques Remarks

[2017] likelihood regression-based model

A. Lasisi, P. Li & J. Chen

2.1. Hybrid machine learning and GIS model

Hybrid Machine Learning and GIS Approach

Unsupervised learning is indicative of a situation where observations in a sample

each element in the sample space ω = s1 , . . . , sn is assigned an associated label or

xn1 xn2 · · · xnp

2.1.1. Machine learning and GIS

A. Lasisi, P. Li & J. Chen

Hybrid Machine Learning and GIS Approach

(i) (i) (i)

A. Lasisi, P. Li & J. Chen

3.1.3. Correlation check

4. Case Study: Grade Crossing Accidents in California

Hybrid Machine Learning and GIS Approach

Table 2. Variables, description and sources.

Variable Description Input values Source Use in

Train speed Speed of the train in mph Numeric FRA No

Transit YN Passenger/freight Rail Category/binary Caltrans Yes

4.1.1. Assumptions and limitations

A. Lasisi, P. Li & J. Chen

• Because third-party information sources on train traﬃc were questionable and

4.1.2. Exploratory data analysis

Hybrid Machine Learning and GIS Approach

4.2. Geospatial data analysis

A. Lasisi, P. Li & J. Chen

Fig. 6. Correlation matrix of HRGC accident data.

4.2.1. Machine learning

Hybrid Machine Learning and GIS Approach

Fig. 7. Correlation plot for machine learning classiﬁers.

A. Lasisi, P. Li & J. Chen

Table 3. Model performance for classiﬁcation problem.

S/No. Machine learning technique Mean accuracy ROC score

decision function shape=’ovr’, degree = 3, gamma = ’auto’,

Fig. 8. Parameter importance for classiﬁcation problem.

Hybrid Machine Learning and GIS Approach