Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Business Intelligence Analytics

A Visual-Analytics System for


Railway Safety Management
Wallace P. Lira, Ronnie Alves, Jean M.R. Costa, Gustavo Pessin, Lilyan Galvão,
Ana C. Cardoso, and Cleidson R.B. de Souza ■ Vale Institute of Technology

T he working environment of railways is


challenging and complex and often has
employees performing high-risk operations.
Furthermore, because railway lines cover long dis-
tances, they pose risks to the many people living
is, both social and technical factors can influ-
ence railway operational safety. (Examples of so-
cial factors are the behaviors of train passengers
and of unauthorized people who trespass on the
tracks.1) So, the engineers asked us to enhance
alongside them. So, rail companies proactively our results with a visualization of socioeconomic
adopt strategies to minimize risk to their employ- factors related to the towns and cities along the
ees and the public. These strategies involve trained railway lines. They argued that this would let them
safety personnel, advanced forms of technology identify patterns, acquire valuable knowledge, and
(for example, radio and cam- make decisions about different scenarios. In other
eras), and special work processes words, they wanted a visual-analytics (VA) system.
Using a data analytics
and practices. Nevertheless, in- VA is “the science of analytical reasoning sup-
workflow, this system compiles cidents still occur. ported by interactive visual interfaces.”2 VA tech-
an incident risk index that Hisaji Fukuda suggested that niques can be used to make sense of large-scale,
processes information about to avoid railway incidents, it’s es- complex, ambiguous, and often conflicting data.
incidents along railway sential to systematically collect The VA system we created uses machine learning
lines. It displays the index information related to failures to compute the risk index. This index is accompa-
on a geographical map, and problems, predict the likeli- nied by geographic information (maps) represent-
together with socioeconomic hood of accidents, and propose ing the locations of towns and cities along the
information about the countermeasures.1 Vale, a multi- railways, together with relevant socioeconomic
associated towns and cities. national mining corporation and information. We designed the system to effectively
one of the largest logistics opera- formulate ideas about the relationship between
tors in Brazil, is implementing this strategy. Vale the socioeconomic data and incident occurrences.
takes steps to ensure that railway incident data are Furthermore, users should be able to employ the
collected, stored, and then analyzed by safety engi- system to support policies aimed at reducing the
neers, who subsequently can address the problem number of incidents at a particular location.
of improving operational safety.
We examined the data in Vale’s incident data- The Classification Model
base to find patterns of safety incidents that would Railway incidents have a huge social and economic
help us classify those incidents. On the basis of impact on society and railway companies. One of
this knowledge, we established an incident risk in- the most serious types of incidents is run-overs.
dex, which we believe could help improve manage- They have many causes, including wear and tear,
ment of safety policies. malfunctioning equipment, pedestrians’ failure
When the safety engineers saw our results, they to heed warning signals, insufficient information
told us that this issue was sociotechnical. That to allow the public to be aware of the risks, and

52 September/October 2014 Published by the IEEE Computer Society 0272-1716/14/$31.00 © 2014 IEEE

Authorized licensed use limited to: Indian Institute of Management Mumbai. Downloaded on March 07,2024 at 14:00:19 UTC from IEEE Xplore. Restrictions apply.
g5pes.indd 52 8/21/14 4:10 PM
railway personnel’s failure to comply with safety each covariate’s effects. On the other hand, we
regulations. could explore the predictors’ weighting factors by
So, what are the odds of predicting railway ac- using a multiple-linear-regression model (75 and
cidents? We’re far from being able to answer this 95 percent accuracy with a confidence interval of
question. However, railway companies can increase 0.72 and 0.77). This let us determine each covari-
risk awareness on the basis of sophisticated data ate’s corresponding odds ratio values. We rounded
analytics. This means they can exploit the data them up to their full values, which became each
they have to fill gaps in their knowledge. Thus, we variable’s weight when we compiled the index.
can rephrase the previous question as, what are
the run-over hot spots along the railway lines?
Our data analytics workflow had five stages. Railway companies can increase risk
In our description of these stages and through-
out the rest of the article, we don’t give the real awareness on the basis of sophisticated
values or describe the time period we used, for
privacy reasons. data analytics.
The first stage was descriptive analysis. To char-
acterize incidents, we initially employed 57 attri-
butes; each incident type had a related attribute. The fourth stage was model building and evalu-
We employed this attribute as a target variable ation. For this stage, we performed a hold-out
for further classification or prediction learning. phase. (A hold-out is a model evaluation scheme
So, this stage gave us an overall picture of the that machine-learning techniques sometimes em-
incident data. The data records came from the ploy; it helps avoid overfitting.) We split the data
internal information system related to incidents into two sets: 70 percent for training and 30 per-
that occurred on one of Vale’s railways in Brazil cent for posterior validation. We built the model
within a six-year time frame. We focused on in- by using cross-validation on the training set.
cidents related to run-overs. To prepare the data The final stage was creation and statistical analy-
for classification, we also had to properly label the sis of the risk index. We created the index from the
incidents so that we could distinguish positive multiple-linear-regression model. Given the previ-
cases (run-overs) and negative cases (almost run- ous weighting factors, we needed only to calculate
overs). We did this under the guidance of Vale’s a normalized risk index by taking into account one
safety engineers. of the six covariates—let’s say, the city.
The second stage was feature selection, which was The run-over risk index per city provided an in-
based on a correction heuristic strategy using the teresting measure for situation awareness along the
random-forest method. We carried out the data railway. However, business analysts couldn’t use it
analytics with the caret (Classification and Regres- to explore what-if scenarios for better elaboration
sion Training) R package. We selected the 13 co- of safety polices. In fact, when the safety engineers
variates that most influenced the outcome; among saw the risk index, they asked us to integrate this
them, only six were spatial. We confined our study information with socioeconomic information, as
to those six. One of the 57 attributes was the tar- we mentioned earlier. They wanted to understand
get or class attribute (run-overs). We labeled that how the socioeconomic information correlated
attribute “Yes” when the data record was related with the incident risk index. To support them, we
to a run-over and “No” when it wasn’t. integrated visualization and data analytics.
The third stage was knowledge discovery. Because
we were interested in identifying the variables that Visual Mapping of the Data
could discriminate positive and negative cases, When almost all the data contain geographic-
we applied machine learning. The choice of the location information, visualization on a map is
proper machine-learning method was a trade-off the most reasonable choice. We could have rep-
between ease of understanding and precision. To resented the railway in several ways. For example,
build the best classification model, we investigated we could have employed a linear representation,
techniques such as recursive partitioning trees, showing only the railway track against a simple
conditional trees, and generalized linear models. monochrome background. Instead, we used a
This first evaluation showed that the random- representation showing the track from a satellite
forest method performed best (85 and 95 percent image, with the surrounding area forming the
accuracy with a confidence interval of 0.81 and background. This representation gave a more real-
0.87). We had no straightforward way to calculate istic view of the affected area.

IEEE Computer Graphics and Applications 53

Authorized licensed use limited to: Indian Institute of Management Mumbai. Downloaded on March 07,2024 at 14:00:19 UTC from IEEE Xplore. Restrictions apply.
g5pes.indd 53 8/21/14 4:10 PM
Business Intelligence Analytics

(a)

(b) example, visually compare the length of a railway


line that passes through a city with the associated
Figure 1. The gradients for (a) red and (b) blue. 5 Red indicates worse- risk index.
than-average values; blue indicates better-than-average values. In addition, socioeconomic information from
the Brazilian Census Bureau about each town or
We chose this design to humanize the problem city can be visualized and analyzed in conjunction
by letting the visualization include socioeconomic with the risk index. For example, Figure 3 shows
data. Furthermore, by showing a realistic map, we the relationship between the risk index and the
could leverage the users’ prior knowledge about Gini index. The Gini index is a measurement of
the region and enable the visualization to be a statistical dispersion that represents the income
communication tool for the railway employees. distribution of a nation’s residents. (It can be used
So, the visualization included colored maps to to explain a government’s policies and plans re-
assist knowledge acquisition of the georeferenced garding a community.) A higher Gini index means
data. However, this approach raised the question that most of the income is concentrated in the
of how to color the maps to represent the data. hands of a few people. To represent the Gini index
Linear scales are common and provide clear vi- data, the visualization uses the same coloring it
sualization. Rainbow patterns are often used in used to represent the risk index. This lets users
meteorological systems, but the literature suggests determine whether locations with poor income
they can sometimes be misleading.3 We employed distributions are more exposed to risk.
two distinct colors, which we felt would improve However, specialists must analyze this factor
the visualization and allow better visual compari- case by case because these data are highly context
son of the results. sensitive. Other socioeconomic variables might
So, red indicated worse-than-average values, and have a greater impact on the analysis results. So,
blue indicated better-than-average values. Figure 1 users can also analyze the risk index’s correlation
shows a gradient of these colors. We created our with
gradient using the plotGoogleMaps R package,
which provides a standard interface for maps.4 We ■■ the population size,
only had to specify the color scheme, which the ■■ the population density,
RColorBrewer R package then carried out. ■■ the human-development index,
Figure 2 shows the application’s simplest screen, ■■ the per capita income,
which illustrates the degree of risk associated with ■■ the urban growth rate,
run-overs in each city. We mapped the colors to ■■ access to electricity,
the risk value, as Figure 2’s legend shows. We ■■ access to drinking water,
painted the full area of each town or city to con- ■■ public waste disposal services,
trast this area with the railway track, which we ■■ the literacy/illiteracy ratio, and
painted black. This lets the safety engineers, for ■■ the urban/nonurban population ratio.

Figure 2. The cities and their associated risk index. The railway track is painted in black.

54 September/October 2014

Authorized licensed use limited to: Indian Institute of Management Mumbai. Downloaded on March 07,2024 at 14:00:19 UTC from IEEE Xplore. Restrictions apply.
g5pes.indd 54 8/21/14 4:10 PM
Figure 3. Comparing the risk index with the Gini index. This index is a measurement of statistical dispersion
that represents the income distribution of a nation’s residents.

Figure 4. The ratio between the rate of literacy and illiteracy and the risk index. The pie charts maintain the
original proportions and show both classes (for example, urban versus nonurban areas).

Originally, we employed pie charts instead of col- Figure 4 gives a good picture of what users can
ors in contexts in which the visualization needed to draw on to define safety policies to reduce the
clearly represent only two classes of information: number of incidents. For instance, policies that
“positive” (literate or urban) versus “negative” (il- depend on written media to prevent run-overs
literate or nonurban) factors. Pie charts could also might be ineffective because many people in the
be applied to energy, drinking water, and public country are illiterate. Furthermore, the specialists
waste disposal services. The plotGoogleMaps pack- could see a trend in which every city with a high
age also supports pie charts, which are suitable for incident risk index also had a high illiteracy rate,
dealing with ratios. Figure 4 shows the literacy/ although exceptions to this pattern existed.
illiteracy ratio as reflected in the risk index. So, what sets the outlier cities apart? The answer
However, when the two classes have similar pro- might prompt further interesting ideas. However,
portions, using a color bar instead of a pie chart the specialists might interpret the data in unex-
could aid users’ visualization and understanding. So, pected ways. The results of using a visualization
users can choose between a pie chart and color bar. to support the definition of policies depend heav-
In this case, the color bar (as in Figure 3) is always ily on the specialists’ knowledge, experience, and
relative (normalized between 0 and 1 from the lower intuition.
to the greater values, showing the “good” value—the Furthermore, users can explore detailed data
percentage of the urbanized area, literate people, and for a city by clicking on its area. These details in-
so on). In contrast, the pie chart (as in Figure 4) clude the socioeconomic data already mentioned
maintains the original proportion and shows both (the Gini index, population size, and so on). Us-
classes (for example, urban versus nonurban areas). ers can examine this extra information without

IEEE Computer Graphics and Applications 55

Authorized licensed use limited to: Indian Institute of Management Mumbai. Downloaded on March 07,2024 at 14:00:19 UTC from IEEE Xplore. Restrictions apply.
g5pes.indd 55 8/21/14 4:10 PM
Business Intelligence Analytics

would help them in their decision making, and


■■ what suggestions or comments they had on how
to improve the system’s interface and usefulness.

The operational analyst stated he would use the


system to more effectively visualize and commu-
nicate with local leaders and Vale’s policymak-
ers about which towns or cities were in a serious
situation and whether this correlated with any
socioeconomic statistics. He found that the visu-
alization’s color scheme was suitable. However, he
also stated that “as long as there was a subtitle in
the charts, one would not mind what colors were
used in the visualization.”
The safety engineer stated he could use the sys-
tem to explain the choice of a location for inves-
Figure 5. Details of the city Governador Valadares, tigating socioeconomic factors of the people living
seen in Figure 4. Users can obtain extra information alongside the railway line. He planned to use the
without disrupting the visualization in which they’re system to aid his decision making.
immersed. All three participants asked us to find a way to
import datasets besides the one from the Brazilian
leaving the view they’re inspecting (see Figure 5). Census Bureau, so that they could compare risks.
This functionality lets them obtain extra informa- The operational analyst also asked whether we
tion without disrupting the visualization in which could refine both the indexes and visualization
they’re immersed. to also take neighborhoods into account. He ar-
We designed our VA system to be extensible; gued that the cities were heterogeneous enough
that is, users can add other types of information to justify this additional level of granularity. The
to increase its effectiveness. For example, users Brazilian Census Bureau provides socioeconomic
could employ meteorological data to determine data linked to census-designed places. The census-
how weather conditions influence incidents. Fur- designed places are usually smaller than the neigh-
thermore, additional socioeconomic and temporal borhoods the first participant referred to. We’re
data can enhance our system by letting safety en- working on a scheme to achieve this, which should
gineers disclose valuable knowledge and commu- make our system more accurate.
nicate it effectively to other stakeholders. The IT leader asked us to include socioeconomic
Users can explore what-if scenarios by simply and temporal factors in the model. Temporal fac-
contrasting the “agreement” between red and blue tors involve investigating whether incidents are
when two types of information are present (for more frequent at certain times. So, future versions
example, the incident risk index versus the Gini of our system will likely include them.
index). We created the VA and data analytics parts The participants said they can already use the
of our system using the R programming language. system to understand incident patterns and take
We integrated all the maps in an HTML front-end corrective measures. Such measures include run-
container so that Vale safety experts could easily ning safety campaigns, imposing new speed limits
access and use them through Web browsers. in specific regions, building traffic barriers, and
recommending the building of overpasses.
Evaluation As the participants’ feedback shows, much work
We conducted semistructured interviews with three remains to be done. We need to improve both the
Vale personnel who were potential users. One was model and the method of visualization to further
an operational analyst, another was a safety engi- aid the specialists’ decision making. However, all
neer, and the last one had a leadership role in the IT the participants were thrilled with the system’s
department. All three participants had full access potential benefits and made valuable suggestions
to the system and used it for several days before the about how to improve it.
interviews. We aimed to investigate In our view, validation of our system could in-
volve more decision makers in a formal study of
■■ how they would use the system, usability, which would enhance the Vale employ-
■■ whether the system provided information that ees’ feedback.

56 September/October 2014

Authorized licensed use limited to: Indian Institute of Management Mumbai. Downloaded on March 07,2024 at 14:00:19 UTC from IEEE Xplore. Restrictions apply.
g5pes.indd 56 8/21/14 4:10 PM
W e believe our results are applicable to other
domains involving georeferenced data be-
cause such data can be represented with color
Data,” Geomatica, vol. 66, no. 1, 2012, pp. 37–49.
5. D. Borland and A. Huber, “Collaboration-Specific
Color-Map Design,” IEEE Computer Graphics and
schemes, polygons, and markers. (However, the Applications, vol. 31, no. 4, 2011, pp. 7–11.
polygons must be visible to users.) Our experience
building our system leads us to offer three tips to Wallace P. Lira is a software engineer in the Vale Insti-
practitioners. tute of Technology’s Environmental Informatics Lab. His
First, match the visualization with the users’ research interests are visual analytics and software quality.
knowledge and domain. As we explained before, Lira received his MSc in computer science from the Federal
we adopted a map representation because it was University of Pará. Contact him at wallace.lira@gmail.com.
more intuitive for the intended users. They were
already used to this type of representation and Ronnie Alves is an adjunct researcher in the Vale Institute
therefore wouldn’t need to learn a new visualiza- of Technology’s Environmental Informatics Lab. His research
tion scheme. interests are data mining, machine learning, and bioinfor-
Second, get early feedback. We first built a matics. Alves received a PhD in computer science from the
simple visualization; then, on the basis of user University of Minho. Contact him at ronnie.alves@itv.org.
feedback, we improved the system. Getting early
feedback was important to ensure we were meet- Jean M.R. Costa is a PhD student in Cornell University’s
ing the users’ expectations before we committed Department of Information Science. At Cornell, he has been
more resources designing and implementing them. working on research projects related to mobile sensing and
Finally, add visualizations as needed. Instead social network analysis. He was with the Vale Institute of
of developing a system with several types of vi- Technology when the research in this article was performed.
sualizations from the beginning, we first plotted Costa received an MSc in computer science from the Federal
one dimension of georeferenced data on maps at University of Pará. Contact him at jmd487@cornell.edu.
a time. When users needed to see the correlation
of different data dimensions, we added that infor- Gustavo Pessin is an assistant researcher in the Vale In-
mation to the visualization by using polygons and stitute of Technology’s Environmental Informatics Lab. His
markers on Google Maps. research interests are mobile robotics, machine learning, and
Future studies will focus on integrating other wireless sensor networks. Pessin received his PhD in com-
information sources and other incident types be- puter science from the University of São Paulo’s Institute
sides run-overs. So far, the positive feedback we’ve of Mathematics and Computer Science. Contact him at
received suggests that our system can be of great gustavo.pessin@itv.org.
value to safety engineers and specialists in both
making and communicating decisions. Lilyan Galvão is an MSc student in sustainable use of
natural resources for tropical regions at the Vale Institute
of Technology. Her research interests are urban morphology
Acknowledgments and informal settlements. Galvão received a BSc in archi-
We thank our Vale collaborators for their thought- tecture and urbanism from the Federal University of Pará.
provoking insights and help with our visual-analytics Contact her at lilyangalvaos@gmail.com.
system’s design.
Ana C. Cardoso is an associate professor in the Federal
University of Pará’s Public Policies Department. She was
References with the Vale Institute of Technology when the research in
1. H. Fukuda, “A Study on Incident Analysis Method this article was performed. Her research interests are urban
for Railway Safety Management,” Quarterly Report planning, urban studies, and urban design. Cardoso received
Railway Technical Research Inst., vol. 43, no. 2, 2002, a PhD in architecture from Oxford Brookes University. Con-
pp. 83–86. tact her at aclaudiacardoso@gmail.com.
2. J.J. Thomas and K.A. Cook, Illuminating the Path: The
Research and Development Agenda for Visual Analytics, Cleidson R.B. de Souza is a research manager at the Vale
IEEE CS, 2005. Institute of Technology and an associate professor at the Fed-
3. D. Borland and R.M. Taylor, “Rainbow Color Map eral University of Pará. His research interests are computer-
(Still) Considered Harmful,” IEEE Computer Graphics supported cooperative work and software engineering. De
and Applications, vol. 27, no. 2, 2007, pp. 14–17. Souza received a PhD in information and computer sciences
4. M. Kilibarda and B. Bajat, “PlotGoogleMaps: The from the University of California, Irvine. Contact him at
R-Based Web-Mapping Tool for Thematic Spatial cleidson.desouza@acm.org.

IEEE Computer Graphics and Applications 57

Authorized licensed use limited to: Indian Institute of Management Mumbai. Downloaded on March 07,2024 at 14:00:19 UTC from IEEE Xplore. Restrictions apply.
g5pes.indd 57 8/21/14 4:10 PM

You might also like