Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Abstract The impact of Industry 4.

0 across the railway


sector is already transforming its operations. In
recent years, owing to rapid digitalisation, a
In the railway industry, a significant significant amount of data can now be
amount of data is stored in the textual collected, analysed, and interpreted in real-time
format. The advanced development to discover meaningful insights in a way that
of natural language processing and would have been unthinkable a decade ago.
text mining techniques enable Railway companies have considerably widened
automatic knowledge extraction and the range of services they can offer: from smart
discovery from such documents. This ticketing, rail analytics, dynamic route
paper presents a systematic review scheduling to predictive and condition-based
maintenance. These IoT applications have
with quantitative and qualitative
enabled operators to reduce costs, improve
analyses to understand the current service quality and efficiency, optimise
state of text-based research in the physical asset usage, and provide enhanced
context of railway transport. The customer experience. Despite the recent
paper collects 107 relevant developments, the railway industry remains one
publications in the past decade and of the least technologically transformed in
identifies different channels for numerous economies. However, there has been
researchers to obtain text data in growing financial and political backing for the
railways and the corresponding text more comprehensive digitalisation of rail
systems since rail transport is a vital part of
analysis application use-cases.
smart, reliable, and green mobility solutions.
Moreover, a comprehensive analysis is For example, the European Union aims to
performed on the state-of-the- increase the transport budget for 2021 to 2027
art machine learning and natural by at least €10 billion, in a bid to provide more
language processing methods. Four robust support for rail digitalisation and
key research directions, namely research and innovation
multilingual NLP, digital programs (Scordamaglia, 2019). Shift2Rail is
maintenance, external data one of the flagship European technology and
integration, and railway-centred research programs to develop and validate
sustainable, cost-efficient, and competitive
solution pipeline, are identified from
railway solutions through research and
Siemens Mobility’s perspective to collaboration (Furio et al., 2020).
highlight the most prominent The key to railway digital transformation is the
challenges faced in the railway seamless and continuous sharing and
industry. transferring of data across all sensors, devices,
subsystems, and applications. All that data can
1. Introduction be sorted into one of the two categories:
structured and unstructured data. Structured
data, typically classified as quantitative data, is
Due to increasing interconnectivity and smart
highly specific and is stored in a pre-defined
automation, the concept of ‘The Fourth
format that can be interpreted by machines. On
Industrial Revolution’ (or ‘Industry 4.0’) has
the contrary, unstructured data is typically
been introduced to mark the phase of
categorised as qualitative data and is a
significant industrial changes driven by
conglomeration of many varied types of data
breakthroughs in emerging technologies in
that are stored in their native formats (Ambika,
fields such as robotics, artificial intelligence,
2020). Text and multi-media are two common
fully autonomous vehicles, the internet of
types of unstructured data. Critical and valuable
things (IoT), and fifth-generation wireless
business information is often buried in
technologies (Schwab, 2017). It has since
unstructured data, and a survey by Forbes has
become a global trend and discussions around
found that more than 95% of businesses cite the
topics like ‘digitalisation’, ‘big data’, and
need to manage and capitalise on unstructured
‘machine learning (ML)’ never seem to cease.
data to remain competitive in the 2020). Fig. 1 displays a list of common NLP
market (Kulkarni, 2019). In the railway tasks ranked by their relative difficulty in terms
industry, a significant amount of information is of solution development. These tasks enable
stored and accumulated in text format, machines to process and understand textual
including maintenance records, work logs, information in substantial volumes within a
performance reports, diagnostic messages, much shorter period of time than the manual
passenger reviews, contracts, work orders, process. Such applications have accelerated the
close call hazard reports, and accident reports. replacement of tedious and error-prone human
It is associated with almost every aspect of work to reduce labour costs and improve
railways and can find applications in sub- efficiency Since the railway industry is
domains such as digital maintenance, inventory information-rich and text-heavy, NLP could
management, vehicle health inspection, and serve as a powerful tool to distil this source of
transport planning (Bešinović et al., 2021a). information and unlock the digital potential to
The ability to employ automated solutions to slash expenses, enhance reliability, and remain
extract, process, and analyse useful knowledge strong competitiveness in the face of the rising
from free-format text data is key to enhancing digital race. On the other hand, the complexity
efficiency and cost-saving for rail operators and of free-text information in the railway industry
increasing the reliability and performance of provides NLP researchers with a unique
railways. The concept that is sitting in the opportunity to develop state-of-the-art domain-
centre of linguistics-based text analytics is specialised solutions. For example, text
known as natural language processing (NLP). classification is the process of categorising
As a critical component of artificial intelligence texts into defined groups. Typical text
(AI), NLP is an interdisciplinary field of study classification tasks in the railway industry
of linguistics and computer science that aims to include identifying the root cause of system
enable computer programs to understand failure from fault diagnosis messages and
human language as it is spoken and analysing customers’ opinions in textual data
written (Hirschberg and Manning, 2015). It has (e.g., tweets and public reviews) and
been studied for over half a century, however, classifying them into binary or multi-class
due to its ambiguous and fuzzy nature, text- labels. Similarly, information extraction and
based research remains an intriguing challenge text summarisation can help railway operators
for many. As Yoav Goldberg put it in his book, retrieve useful and essential information from
“Human language is highly ambiguous … It is lengthy safety reports, discover hidden safety
also ever-changing and evolving. People are loopholes or bottlenecks, and enhance the
great at producing language and efficiency and reliability of the service.
understanding language, and are capable of As machine learning evolves, deep learning and
expressing, perceiving, and interpreting very reinforcement learning have been the trends for
elaborate and nuanced meanings. At the same certain NLP tasks. To be more specific, both
time, while we humans are great users of deep learning and reinforcement learning
language, we are also very poor at formally models have shown superior performances in
understanding and describing the rules that dialogue generation, machine translation, as
govern language”. (Goldberg, 2017) well as question answering tasks. These tasks
Despite the difficulties, numerous are key to applications such as chatbots and
breakthroughs have been made in NLP over the enterprise’s user answering systems, which will
past two decades. Thanks to the vast increase of improve current railway services and customer
computational power and data connectivity, experiences greatly. Moreover, the technical
NLP-based solutions have been integrated into language used in railways will add an extra
a wide range of software applications that layer of complexity and difficulty for NLP
benefit businesses and our daily lives. researchers. New adaptions to the machine
Fundamental tasks such as language learning models and frameworks are needed to
modelling, text classification, information capture the underlying semantic relationships in
retrieval, and question answering have been the railway environment and return reliable
studied extensively and utilised regularly across performances with strong generalisation ability.
different NLP projects (Vajjala et al., It is believed that such NLP applications can
assist the traditional railway industry to reap the last 10 years that have applied NLP and
the benefits of automation and digitalisation, as other text-based methods in the domain of
well as provide a platform for researchers to railways. Four exploratory research questions
process, explore, and interpret technical are determined to guide this study.
language-related data for better comprehension.
Text-based research and applications in
railways are projected to increase as NLP and
TM technologies advance. Furthermore, due to
the complexity and diversity of its project
nature, the railway industry, as a domain-
specific and labour-intensive industry, lags
behind other sectors in text-based solution
adoption. Therefore, it is time to review text-
based research and studies that have been done
across the whole railway sector and discuss the
research gaps and future directions.
1. Download : Download high-res image
As a result of the rapid growth of ML and NLP (206KB)
applications in railways, several reviews about 2. Download : Download full-size image
this theme have been conducted and published. Fig. 1. Relative difficulty levels for
Nonetheless, most of them only feature NLP as
a sub-area of broad ML methods used to solve common NLP tasks (Vajjala et al.,
rail challenges. For example, Ghofrani et al. 2020).
(2018) reviewed recent big data RQ1. What types of data in railways were used
analytics (BDA) applications in operations, to develop text-based algorithms?
maintenance, and safety aspects of railway RQ2. What types of applications have been
transportation, among which text-based developed in the previous studies?
methods were introduced to tackle maintenance RQ3. What text-based analysis methods were
and safety problems. Subsequently, Bešinović utilised in the previous studies?
et al. (2021a) presented a structured taxonomy RQ4. What signposts can we identify for future
to guide researchers to understand how AI research from the railway industry’s
techniques are linked with specific railway perspective?
applications such as maintenance, This paper is organised as follows.
security, autonomous driving, and traffic Section 2 explains the research methodology
management. As another key application of for collecting text-based research publications
AI/ML in railways, Rad et al. (2021) and Hadj- in railways and extracting keyphrases from
Mabrouk (2019b) summarised innovative ML- these papers. Section 3 analyses the literature
based methods for analysing railway accidents database and the research trend from a data
and identifying accident causation. Most perspective. Section 4 explains the railway data
recently, Pappaterra et al. (2021) gave a review sources that have been adopted in the previous
of AI studies and research conducted on the research and the corresponding research
publicly available datasets in different domains objectives. Section 5 lists the main text-based
of the railway sector. Their articles highlighted methods and algorithms explored in these
various ML and deep learning (DL) algorithms papers. Future research directions are discussed
applied to a wide range of data types such as in Section 6 and a conclusion is drawn for this
numeric-, text-, voice- and image-based data, literature review in Section 7.
lacking emphasis on techniques related to text
analysis. There is no comprehensive and 2. Research methodology
holistic literature review that solely
concentrates on text-based research and This section presents the method used to collect
applications in the railway industry. The relevant literature from publicly accessible
authors aim to produce an extensive and research databases. Detailed selection criteria
systematic review of academic publications in and explanations are outlined in this section.
Furthermore, comprehensive data insight is publications. As shown in Fig. 1, the
gained through both qualitative and quantitative field of NLP covers a diverse collection
analysis. of tasks and research topics. Similarly,
‘railway’ is a relatively ‘loose’ concept
2.1. Data acquisition and selection in rail terminology and it often
criteria incorporates distinct sub-groups of
railways such as underground, high-
From a methodology perspective, defining speed train, and tram. What is more, it
boundaries is one of the most critical steps for lacks the standardised use of
conducting a literature review (Sadeghi and terminology due to the parallel
Askarinejad, 2012). Therefore, the following development of rail transport systems in
four criteria were adopted in the three-stage different parts of the world, leading to
literature retrieval process, as illustrated varied forms of terminology and
in Fig. 2, to define the search space of peer- potential contextual confusion (Anon,
reviewed papers. 2021a). Also, rail terminology can have
1. different interpretations outside of the
field. For instance, the word ‘train’ has
Web of Science (WoS), Scopus, and double meanings in research; it can
American Society of Civil Engineers refer to railway carriages in
(ASCE) are selected as the main transportation, as well as the training of
academic databases for targeted machine learning models in data science
publications. topics. Hence, extra caution needs to be
Additionally, ML and NLP researchers taken to exclude irrelevant papers from
are inclined to publish research works at the search query choices. Consequently,
major conferences over the years. this study uses variations such as ‘high-
Hence, IEEE Xplore and ACM Digital speed train’, ‘freight train’, ‘railway’,
Library are also included in the ‘railroad’, ‘underground’, ‘tram’, and
database selection to minimise the ‘metro’ as railways-related domain
search oversight. Suggested by the keywords. Likewise, the NLP-related
previous reviews (Bešinović et al., search queries use ‘natural language
2021a, Ghofrani et al., 2018), the processing’ and its sub-tasks described
development of text-based analysis in in Fig. 1 to ensure all relevant literature
the railway sector emerged in the 2010s are captured. This review adopts logical
and the volumes of applications have operators, as shown in Fig. 2, to identify
risen sharply ever since. As a result, this candidate papers that contain at least
analysis aims only at publications for one match from the railway domain
the last 10 years, from 2013 to 2022, to keywords and the NLP-related technical
highlight the most recent developments keywords each.
in the field. Moreover, two more 3.
parameters are used to determine the
scope of publication research, including The following literature selection
document type (‘Journal article’, process removes duplicated papers
‘Conference’) and language (‘English’). collected from the academic databases
2. and digital libraries by comparing their
paper titles and unique digital object
To make the search more effective, identifiers (DOIs). Furthermore, with
various keywords and conditions are the rapid increase in the number of
used to automatically identify the conferences, conference papers are
literature relevant to the text-based typically much less rigorously
research in railways. At the same time, reviewed, and the screening process is
the terms ‘railways’ and ‘text-based often fast (Al-Fedaghi, 2007). There are
research’ represent broad spheres of also occasions where similar papers are
study and have specific sub-domains submitted to multiple conferences,
and sub-groups in scientific
resulting in the similarity of
publications (Laplante et al., 2009). As
a result, an online file compare tool,
CopyLeaks, is employed to cross-check
the selected conference publications,
and any literature with a similarity score
of over 50% will be singled out for
additional manual screening.
4.

To ensure the relevance and quality of


reviewed literature in this analysis, a
manual screening process is introduced
at the end to read through the abstract
and body of text. This step attempts to
identify unrelated articles queried from
the databases due to query ambiguity
and misinterpretation. Most examples
come from image-based research, in
which texts are extracted from images
for recognition and analysis. Although
text analysis is included, such literature
does not focus on NLP or text-based
algorithms, thus it is considered out of
the scope of this review and is removed
from the selection. Moreover, the
‘similar’ conference proceedings from
Step 3 are manually examined to
determine whether they are deemed
duplicated. Similarities in dataset,
methodology, and conclusion are the
main deciding factors in this process.

Taking described criteria and conditions into


account, a total of 107 papers related to the
study area were identified and stored in our
database for further analysis, among them, 61
were journal articles and 46 were conference
proceedings.

You might also like