Professional Documents
Culture Documents
A Visual-Analytics System For Railway Safety Management
A Visual-Analytics System For Railway Safety Management
52 September/October 2014 Published by the IEEE Computer Society 0272-1716/14/$31.00 © 2014 IEEE
Authorized licensed use limited to: Indian Institute of Management Mumbai. Downloaded on March 07,2024 at 14:00:19 UTC from IEEE Xplore. Restrictions apply.
g5pes.indd 52 8/21/14 4:10 PM
railway personnel’s failure to comply with safety each covariate’s effects. On the other hand, we
regulations. could explore the predictors’ weighting factors by
So, what are the odds of predicting railway ac- using a multiple-linear-regression model (75 and
cidents? We’re far from being able to answer this 95 percent accuracy with a confidence interval of
question. However, railway companies can increase 0.72 and 0.77). This let us determine each covari-
risk awareness on the basis of sophisticated data ate’s corresponding odds ratio values. We rounded
analytics. This means they can exploit the data them up to their full values, which became each
they have to fill gaps in their knowledge. Thus, we variable’s weight when we compiled the index.
can rephrase the previous question as, what are
the run-over hot spots along the railway lines?
Our data analytics workflow had five stages. Railway companies can increase risk
In our description of these stages and through-
out the rest of the article, we don’t give the real awareness on the basis of sophisticated
values or describe the time period we used, for
privacy reasons. data analytics.
The first stage was descriptive analysis. To char-
acterize incidents, we initially employed 57 attri-
butes; each incident type had a related attribute. The fourth stage was model building and evalu-
We employed this attribute as a target variable ation. For this stage, we performed a hold-out
for further classification or prediction learning. phase. (A hold-out is a model evaluation scheme
So, this stage gave us an overall picture of the that machine-learning techniques sometimes em-
incident data. The data records came from the ploy; it helps avoid overfitting.) We split the data
internal information system related to incidents into two sets: 70 percent for training and 30 per-
that occurred on one of Vale’s railways in Brazil cent for posterior validation. We built the model
within a six-year time frame. We focused on in- by using cross-validation on the training set.
cidents related to run-overs. To prepare the data The final stage was creation and statistical analy-
for classification, we also had to properly label the sis of the risk index. We created the index from the
incidents so that we could distinguish positive multiple-linear-regression model. Given the previ-
cases (run-overs) and negative cases (almost run- ous weighting factors, we needed only to calculate
overs). We did this under the guidance of Vale’s a normalized risk index by taking into account one
safety engineers. of the six covariates—let’s say, the city.
The second stage was feature selection, which was The run-over risk index per city provided an in-
based on a correction heuristic strategy using the teresting measure for situation awareness along the
random-forest method. We carried out the data railway. However, business analysts couldn’t use it
analytics with the caret (Classification and Regres- to explore what-if scenarios for better elaboration
sion Training) R package. We selected the 13 co- of safety polices. In fact, when the safety engineers
variates that most influenced the outcome; among saw the risk index, they asked us to integrate this
them, only six were spatial. We confined our study information with socioeconomic information, as
to those six. One of the 57 attributes was the tar- we mentioned earlier. They wanted to understand
get or class attribute (run-overs). We labeled that how the socioeconomic information correlated
attribute “Yes” when the data record was related with the incident risk index. To support them, we
to a run-over and “No” when it wasn’t. integrated visualization and data analytics.
The third stage was knowledge discovery. Because
we were interested in identifying the variables that Visual Mapping of the Data
could discriminate positive and negative cases, When almost all the data contain geographic-
we applied machine learning. The choice of the location information, visualization on a map is
proper machine-learning method was a trade-off the most reasonable choice. We could have rep-
between ease of understanding and precision. To resented the railway in several ways. For example,
build the best classification model, we investigated we could have employed a linear representation,
techniques such as recursive partitioning trees, showing only the railway track against a simple
conditional trees, and generalized linear models. monochrome background. Instead, we used a
This first evaluation showed that the random- representation showing the track from a satellite
forest method performed best (85 and 95 percent image, with the surrounding area forming the
accuracy with a confidence interval of 0.81 and background. This representation gave a more real-
0.87). We had no straightforward way to calculate istic view of the affected area.
Authorized licensed use limited to: Indian Institute of Management Mumbai. Downloaded on March 07,2024 at 14:00:19 UTC from IEEE Xplore. Restrictions apply.
g5pes.indd 53 8/21/14 4:10 PM
Business Intelligence Analytics
(a)
Figure 2. The cities and their associated risk index. The railway track is painted in black.
54 September/October 2014
Authorized licensed use limited to: Indian Institute of Management Mumbai. Downloaded on March 07,2024 at 14:00:19 UTC from IEEE Xplore. Restrictions apply.
g5pes.indd 54 8/21/14 4:10 PM
Figure 3. Comparing the risk index with the Gini index. This index is a measurement of statistical dispersion
that represents the income distribution of a nation’s residents.
Figure 4. The ratio between the rate of literacy and illiteracy and the risk index. The pie charts maintain the
original proportions and show both classes (for example, urban versus nonurban areas).
Originally, we employed pie charts instead of col- Figure 4 gives a good picture of what users can
ors in contexts in which the visualization needed to draw on to define safety policies to reduce the
clearly represent only two classes of information: number of incidents. For instance, policies that
“positive” (literate or urban) versus “negative” (il- depend on written media to prevent run-overs
literate or nonurban) factors. Pie charts could also might be ineffective because many people in the
be applied to energy, drinking water, and public country are illiterate. Furthermore, the specialists
waste disposal services. The plotGoogleMaps pack- could see a trend in which every city with a high
age also supports pie charts, which are suitable for incident risk index also had a high illiteracy rate,
dealing with ratios. Figure 4 shows the literacy/ although exceptions to this pattern existed.
illiteracy ratio as reflected in the risk index. So, what sets the outlier cities apart? The answer
However, when the two classes have similar pro- might prompt further interesting ideas. However,
portions, using a color bar instead of a pie chart the specialists might interpret the data in unex-
could aid users’ visualization and understanding. So, pected ways. The results of using a visualization
users can choose between a pie chart and color bar. to support the definition of policies depend heav-
In this case, the color bar (as in Figure 3) is always ily on the specialists’ knowledge, experience, and
relative (normalized between 0 and 1 from the lower intuition.
to the greater values, showing the “good” value—the Furthermore, users can explore detailed data
percentage of the urbanized area, literate people, and for a city by clicking on its area. These details in-
so on). In contrast, the pie chart (as in Figure 4) clude the socioeconomic data already mentioned
maintains the original proportion and shows both (the Gini index, population size, and so on). Us-
classes (for example, urban versus nonurban areas). ers can examine this extra information without
Authorized licensed use limited to: Indian Institute of Management Mumbai. Downloaded on March 07,2024 at 14:00:19 UTC from IEEE Xplore. Restrictions apply.
g5pes.indd 55 8/21/14 4:10 PM
Business Intelligence Analytics
56 September/October 2014
Authorized licensed use limited to: Indian Institute of Management Mumbai. Downloaded on March 07,2024 at 14:00:19 UTC from IEEE Xplore. Restrictions apply.
g5pes.indd 56 8/21/14 4:10 PM
W e believe our results are applicable to other
domains involving georeferenced data be-
cause such data can be represented with color
Data,” Geomatica, vol. 66, no. 1, 2012, pp. 37–49.
5. D. Borland and A. Huber, “Collaboration-Specific
Color-Map Design,” IEEE Computer Graphics and
schemes, polygons, and markers. (However, the Applications, vol. 31, no. 4, 2011, pp. 7–11.
polygons must be visible to users.) Our experience
building our system leads us to offer three tips to Wallace P. Lira is a software engineer in the Vale Insti-
practitioners. tute of Technology’s Environmental Informatics Lab. His
First, match the visualization with the users’ research interests are visual analytics and software quality.
knowledge and domain. As we explained before, Lira received his MSc in computer science from the Federal
we adopted a map representation because it was University of Pará. Contact him at wallace.lira@gmail.com.
more intuitive for the intended users. They were
already used to this type of representation and Ronnie Alves is an adjunct researcher in the Vale Institute
therefore wouldn’t need to learn a new visualiza- of Technology’s Environmental Informatics Lab. His research
tion scheme. interests are data mining, machine learning, and bioinfor-
Second, get early feedback. We first built a matics. Alves received a PhD in computer science from the
simple visualization; then, on the basis of user University of Minho. Contact him at ronnie.alves@itv.org.
feedback, we improved the system. Getting early
feedback was important to ensure we were meet- Jean M.R. Costa is a PhD student in Cornell University’s
ing the users’ expectations before we committed Department of Information Science. At Cornell, he has been
more resources designing and implementing them. working on research projects related to mobile sensing and
Finally, add visualizations as needed. Instead social network analysis. He was with the Vale Institute of
of developing a system with several types of vi- Technology when the research in this article was performed.
sualizations from the beginning, we first plotted Costa received an MSc in computer science from the Federal
one dimension of georeferenced data on maps at University of Pará. Contact him at jmd487@cornell.edu.
a time. When users needed to see the correlation
of different data dimensions, we added that infor- Gustavo Pessin is an assistant researcher in the Vale In-
mation to the visualization by using polygons and stitute of Technology’s Environmental Informatics Lab. His
markers on Google Maps. research interests are mobile robotics, machine learning, and
Future studies will focus on integrating other wireless sensor networks. Pessin received his PhD in com-
information sources and other incident types be- puter science from the University of São Paulo’s Institute
sides run-overs. So far, the positive feedback we’ve of Mathematics and Computer Science. Contact him at
received suggests that our system can be of great gustavo.pessin@itv.org.
value to safety engineers and specialists in both
making and communicating decisions. Lilyan Galvão is an MSc student in sustainable use of
natural resources for tropical regions at the Vale Institute
of Technology. Her research interests are urban morphology
Acknowledgments and informal settlements. Galvão received a BSc in archi-
We thank our Vale collaborators for their thought- tecture and urbanism from the Federal University of Pará.
provoking insights and help with our visual-analytics Contact her at lilyangalvaos@gmail.com.
system’s design.
Ana C. Cardoso is an associate professor in the Federal
University of Pará’s Public Policies Department. She was
References with the Vale Institute of Technology when the research in
1. H. Fukuda, “A Study on Incident Analysis Method this article was performed. Her research interests are urban
for Railway Safety Management,” Quarterly Report planning, urban studies, and urban design. Cardoso received
Railway Technical Research Inst., vol. 43, no. 2, 2002, a PhD in architecture from Oxford Brookes University. Con-
pp. 83–86. tact her at aclaudiacardoso@gmail.com.
2. J.J. Thomas and K.A. Cook, Illuminating the Path: The
Research and Development Agenda for Visual Analytics, Cleidson R.B. de Souza is a research manager at the Vale
IEEE CS, 2005. Institute of Technology and an associate professor at the Fed-
3. D. Borland and R.M. Taylor, “Rainbow Color Map eral University of Pará. His research interests are computer-
(Still) Considered Harmful,” IEEE Computer Graphics supported cooperative work and software engineering. De
and Applications, vol. 27, no. 2, 2007, pp. 14–17. Souza received a PhD in information and computer sciences
4. M. Kilibarda and B. Bajat, “PlotGoogleMaps: The from the University of California, Irvine. Contact him at
R-Based Web-Mapping Tool for Thematic Spatial cleidson.desouza@acm.org.
Authorized licensed use limited to: Indian Institute of Management Mumbai. Downloaded on March 07,2024 at 14:00:19 UTC from IEEE Xplore. Restrictions apply.
g5pes.indd 57 8/21/14 4:10 PM