sector is already transforming its operations. In recent years, owing to rapid digitalisation, a In the railway industry, a significant significant amount of data can now be amount of data is stored in the textual collected, analysed, and interpreted in real-time format. The advanced development to discover meaningful insights in a way that of natural language processing and would have been unthinkable a decade ago. text mining techniques enable Railway companies have considerably widened automatic knowledge extraction and the range of services they can offer: from smart discovery from such documents. This ticketing, rail analytics, dynamic route paper presents a systematic review scheduling to predictive and condition-based maintenance. These IoT applications have with quantitative and qualitative enabled operators to reduce costs, improve analyses to understand the current service quality and efficiency, optimise state of text-based research in the physical asset usage, and provide enhanced context of railway transport. The customer experience. Despite the recent paper collects 107 relevant developments, the railway industry remains one publications in the past decade and of the least technologically transformed in identifies different channels for numerous economies. However, there has been researchers to obtain text data in growing financial and political backing for the railways and the corresponding text more comprehensive digitalisation of rail systems since rail transport is a vital part of analysis application use-cases. smart, reliable, and green mobility solutions. Moreover, a comprehensive analysis is For example, the European Union aims to performed on the state-of-the- increase the transport budget for 2021 to 2027 art machine learning and natural by at least €10 billion, in a bid to provide more language processing methods. Four robust support for rail digitalisation and key research directions, namely research and innovation multilingual NLP, digital programs (Scordamaglia, 2019). Shift2Rail is maintenance, external data one of the flagship European technology and integration, and railway-centred research programs to develop and validate sustainable, cost-efficient, and competitive solution pipeline, are identified from railway solutions through research and Siemens Mobility’s perspective to collaboration (Furio et al., 2020). highlight the most prominent The key to railway digital transformation is the challenges faced in the railway seamless and continuous sharing and industry. transferring of data across all sensors, devices, subsystems, and applications. All that data can 1. Introduction be sorted into one of the two categories: structured and unstructured data. Structured data, typically classified as quantitative data, is Due to increasing interconnectivity and smart highly specific and is stored in a pre-defined automation, the concept of ‘The Fourth format that can be interpreted by machines. On Industrial Revolution’ (or ‘Industry 4.0’) has the contrary, unstructured data is typically been introduced to mark the phase of categorised as qualitative data and is a significant industrial changes driven by conglomeration of many varied types of data breakthroughs in emerging technologies in that are stored in their native formats (Ambika, fields such as robotics, artificial intelligence, 2020). Text and multi-media are two common fully autonomous vehicles, the internet of types of unstructured data. Critical and valuable things (IoT), and fifth-generation wireless business information is often buried in technologies (Schwab, 2017). It has since unstructured data, and a survey by Forbes has become a global trend and discussions around found that more than 95% of businesses cite the topics like ‘digitalisation’, ‘big data’, and need to manage and capitalise on unstructured ‘machine learning (ML)’ never seem to cease. data to remain competitive in the 2020). Fig. 1 displays a list of common NLP market (Kulkarni, 2019). In the railway tasks ranked by their relative difficulty in terms industry, a significant amount of information is of solution development. These tasks enable stored and accumulated in text format, machines to process and understand textual including maintenance records, work logs, information in substantial volumes within a performance reports, diagnostic messages, much shorter period of time than the manual passenger reviews, contracts, work orders, process. Such applications have accelerated the close call hazard reports, and accident reports. replacement of tedious and error-prone human It is associated with almost every aspect of work to reduce labour costs and improve railways and can find applications in sub- efficiency Since the railway industry is domains such as digital maintenance, inventory information-rich and text-heavy, NLP could management, vehicle health inspection, and serve as a powerful tool to distil this source of transport planning (Bešinović et al., 2021a). information and unlock the digital potential to The ability to employ automated solutions to slash expenses, enhance reliability, and remain extract, process, and analyse useful knowledge strong competitiveness in the face of the rising from free-format text data is key to enhancing digital race. On the other hand, the complexity efficiency and cost-saving for rail operators and of free-text information in the railway industry increasing the reliability and performance of provides NLP researchers with a unique railways. The concept that is sitting in the opportunity to develop state-of-the-art domain- centre of linguistics-based text analytics is specialised solutions. For example, text known as natural language processing (NLP). classification is the process of categorising As a critical component of artificial intelligence texts into defined groups. Typical text (AI), NLP is an interdisciplinary field of study classification tasks in the railway industry of linguistics and computer science that aims to include identifying the root cause of system enable computer programs to understand failure from fault diagnosis messages and human language as it is spoken and analysing customers’ opinions in textual data written (Hirschberg and Manning, 2015). It has (e.g., tweets and public reviews) and been studied for over half a century, however, classifying them into binary or multi-class due to its ambiguous and fuzzy nature, text- labels. Similarly, information extraction and based research remains an intriguing challenge text summarisation can help railway operators for many. As Yoav Goldberg put it in his book, retrieve useful and essential information from “Human language is highly ambiguous … It is lengthy safety reports, discover hidden safety also ever-changing and evolving. People are loopholes or bottlenecks, and enhance the great at producing language and efficiency and reliability of the service. understanding language, and are capable of As machine learning evolves, deep learning and expressing, perceiving, and interpreting very reinforcement learning have been the trends for elaborate and nuanced meanings. At the same certain NLP tasks. To be more specific, both time, while we humans are great users of deep learning and reinforcement learning language, we are also very poor at formally models have shown superior performances in understanding and describing the rules that dialogue generation, machine translation, as govern language”. (Goldberg, 2017) well as question answering tasks. These tasks Despite the difficulties, numerous are key to applications such as chatbots and breakthroughs have been made in NLP over the enterprise’s user answering systems, which will past two decades. Thanks to the vast increase of improve current railway services and customer computational power and data connectivity, experiences greatly. Moreover, the technical NLP-based solutions have been integrated into language used in railways will add an extra a wide range of software applications that layer of complexity and difficulty for NLP benefit businesses and our daily lives. researchers. New adaptions to the machine Fundamental tasks such as language learning models and frameworks are needed to modelling, text classification, information capture the underlying semantic relationships in retrieval, and question answering have been the railway environment and return reliable studied extensively and utilised regularly across performances with strong generalisation ability. different NLP projects (Vajjala et al., It is believed that such NLP applications can assist the traditional railway industry to reap the last 10 years that have applied NLP and the benefits of automation and digitalisation, as other text-based methods in the domain of well as provide a platform for researchers to railways. Four exploratory research questions process, explore, and interpret technical are determined to guide this study. language-related data for better comprehension. Text-based research and applications in railways are projected to increase as NLP and TM technologies advance. Furthermore, due to the complexity and diversity of its project nature, the railway industry, as a domain- specific and labour-intensive industry, lags behind other sectors in text-based solution adoption. Therefore, it is time to review text- based research and studies that have been done across the whole railway sector and discuss the research gaps and future directions. 1. Download : Download high-res image As a result of the rapid growth of ML and NLP (206KB) applications in railways, several reviews about 2. Download : Download full-size image this theme have been conducted and published. Fig. 1. Relative difficulty levels for Nonetheless, most of them only feature NLP as a sub-area of broad ML methods used to solve common NLP tasks (Vajjala et al., rail challenges. For example, Ghofrani et al. 2020). (2018) reviewed recent big data RQ1. What types of data in railways were used analytics (BDA) applications in operations, to develop text-based algorithms? maintenance, and safety aspects of railway RQ2. What types of applications have been transportation, among which text-based developed in the previous studies? methods were introduced to tackle maintenance RQ3. What text-based analysis methods were and safety problems. Subsequently, Bešinović utilised in the previous studies? et al. (2021a) presented a structured taxonomy RQ4. What signposts can we identify for future to guide researchers to understand how AI research from the railway industry’s techniques are linked with specific railway perspective? applications such as maintenance, This paper is organised as follows. security, autonomous driving, and traffic Section 2 explains the research methodology management. As another key application of for collecting text-based research publications AI/ML in railways, Rad et al. (2021) and Hadj- in railways and extracting keyphrases from Mabrouk (2019b) summarised innovative ML- these papers. Section 3 analyses the literature based methods for analysing railway accidents database and the research trend from a data and identifying accident causation. Most perspective. Section 4 explains the railway data recently, Pappaterra et al. (2021) gave a review sources that have been adopted in the previous of AI studies and research conducted on the research and the corresponding research publicly available datasets in different domains objectives. Section 5 lists the main text-based of the railway sector. Their articles highlighted methods and algorithms explored in these various ML and deep learning (DL) algorithms papers. Future research directions are discussed applied to a wide range of data types such as in Section 6 and a conclusion is drawn for this numeric-, text-, voice- and image-based data, literature review in Section 7. lacking emphasis on techniques related to text analysis. There is no comprehensive and 2. Research methodology holistic literature review that solely concentrates on text-based research and This section presents the method used to collect applications in the railway industry. The relevant literature from publicly accessible authors aim to produce an extensive and research databases. Detailed selection criteria systematic review of academic publications in and explanations are outlined in this section. Furthermore, comprehensive data insight is publications. As shown in Fig. 1, the gained through both qualitative and quantitative field of NLP covers a diverse collection analysis. of tasks and research topics. Similarly, ‘railway’ is a relatively ‘loose’ concept 2.1. Data acquisition and selection in rail terminology and it often criteria incorporates distinct sub-groups of railways such as underground, high- From a methodology perspective, defining speed train, and tram. What is more, it boundaries is one of the most critical steps for lacks the standardised use of conducting a literature review (Sadeghi and terminology due to the parallel Askarinejad, 2012). Therefore, the following development of rail transport systems in four criteria were adopted in the three-stage different parts of the world, leading to literature retrieval process, as illustrated varied forms of terminology and in Fig. 2, to define the search space of peer- potential contextual confusion (Anon, reviewed papers. 2021a). Also, rail terminology can have 1. different interpretations outside of the field. For instance, the word ‘train’ has Web of Science (WoS), Scopus, and double meanings in research; it can American Society of Civil Engineers refer to railway carriages in (ASCE) are selected as the main transportation, as well as the training of academic databases for targeted machine learning models in data science publications. topics. Hence, extra caution needs to be Additionally, ML and NLP researchers taken to exclude irrelevant papers from are inclined to publish research works at the search query choices. Consequently, major conferences over the years. this study uses variations such as ‘high- Hence, IEEE Xplore and ACM Digital speed train’, ‘freight train’, ‘railway’, Library are also included in the ‘railroad’, ‘underground’, ‘tram’, and database selection to minimise the ‘metro’ as railways-related domain search oversight. Suggested by the keywords. Likewise, the NLP-related previous reviews (Bešinović et al., search queries use ‘natural language 2021a, Ghofrani et al., 2018), the processing’ and its sub-tasks described development of text-based analysis in in Fig. 1 to ensure all relevant literature the railway sector emerged in the 2010s are captured. This review adopts logical and the volumes of applications have operators, as shown in Fig. 2, to identify risen sharply ever since. As a result, this candidate papers that contain at least analysis aims only at publications for one match from the railway domain the last 10 years, from 2013 to 2022, to keywords and the NLP-related technical highlight the most recent developments keywords each. in the field. Moreover, two more 3. parameters are used to determine the scope of publication research, including The following literature selection document type (‘Journal article’, process removes duplicated papers ‘Conference’) and language (‘English’). collected from the academic databases 2. and digital libraries by comparing their paper titles and unique digital object To make the search more effective, identifiers (DOIs). Furthermore, with various keywords and conditions are the rapid increase in the number of used to automatically identify the conferences, conference papers are literature relevant to the text-based typically much less rigorously research in railways. At the same time, reviewed, and the screening process is the terms ‘railways’ and ‘text-based often fast (Al-Fedaghi, 2007). There are research’ represent broad spheres of also occasions where similar papers are study and have specific sub-domains submitted to multiple conferences, and sub-groups in scientific resulting in the similarity of publications (Laplante et al., 2009). As a result, an online file compare tool, CopyLeaks, is employed to cross-check the selected conference publications, and any literature with a similarity score of over 50% will be singled out for additional manual screening. 4.
To ensure the relevance and quality of
reviewed literature in this analysis, a manual screening process is introduced at the end to read through the abstract and body of text. This step attempts to identify unrelated articles queried from the databases due to query ambiguity and misinterpretation. Most examples come from image-based research, in which texts are extracted from images for recognition and analysis. Although text analysis is included, such literature does not focus on NLP or text-based algorithms, thus it is considered out of the scope of this review and is removed from the selection. Moreover, the ‘similar’ conference proceedings from Step 3 are manually examined to determine whether they are deemed duplicated. Similarities in dataset, methodology, and conclusion are the main deciding factors in this process.
Taking described criteria and conditions into
account, a total of 107 papers related to the study area were identified and stored in our database for further analysis, among them, 61 were journal articles and 46 were conference proceedings.
IoT Streams for Data-Driven Predictive Maintenance and IoT, Edge, and Mobile for Embedded Machine Learning: Second International Workshop, IoT Streams 2020, and First International Workshop, ITEM 2020, Co-located with ECML/PKDD 2020, Ghent, Belgium, September 14-18, 2020, Revised Selected Papers