Professional Documents
Culture Documents
Root Cause Analysis of Incidents Using Text Clustering and Classification Algorithms
Root Cause Analysis of Incidents Using Text Clustering and Classification Algorithms
Root Cause Analysis of Incidents Using Text Clustering and Classification Algorithms
1 Introduction
An occupational accident is an unexpected event in which an employee gets in-
jured due to external components. In India, about 48, 000 workers die annually
due to occupational accidents of which 24.2 percent of the fatal accidents takes
place in construction sites [1]. Lifestyle of workers, demographic and workplace
factors are also responsible for occupational accidents. These factors include
smoking and alcohol consumption [2], age [3], and shift work [4]. To enrich the
safety culture and well-being of employees, recent research is focused on iden-
tifying root causes of accidents that lead to hazardous occurrence, and stop it
2 Sobhan Sarkar et al.
before it happens [5,6]. The steel industry collects data about the incident in cat-
egorical, numerical, or text forms. Many studies have already been conducted on
categorical or numerical attributes [7–11]. The text data are difficult for analysis
and hence remain under-utilized [12]. The text description acts as a lagging indi-
cator and provides the situation during the incident. Statistics of past accidental
records are extracted and represented as lagging indicators whereas leading in-
dicators are the measure specifying a future incident [13–15]. These indicators
are analysed to take preventive actions and control injury. Lagging indicator
provides information about the quantity of injured people and the severity of
injuries but utilizing only lagging indicators for evaluating safety performance
has a drawback of providing no information on how well the company is respond-
ing to prevent incidents. Clustering of text documents can be used to extract
potentially dangerous causal factors from a huge amount of accident data which
is difficult to be extracted using conventional methods. The keywords generated
as a result of incident clustering may reveal the root causes and frequent places
of occurrence of the incident which will act as a lead indicator. The information
gained will help develop a safety action plan and reduce the risk factors. The text
description can also be used to predict the primary cause behind the incidents.
Text classification is the process of assigning a class label to a document using
supervised ML approaches which requires a collection of documents with prede-
fined class labels [16]. Natural Language Processing (NLP) is used to perform
document classification [12]. The task is to improve consistency, efficiency and
performance of the document classification algorithms by experimenting RF and
SVM with different tokenization methods like unigram and bigram. In case of
document classification, the classes are known and the documents are classified
into these classes, whereas in document clustering, the classes are not known.
Thus, document classification and document clustering are different. Classifica-
tion uses supervised ML approaches whereas, clustering requires unsupervised
ML approaches.
Previously some works have been done in occupational accident analysis using
text mining and ML approaches. Chokor et al. evaluated the strength of unsu-
pervised ML and NLP algorithms to support safety investigation by analyzing
the occupational accidents in Arizona [17]. Fragiadakis et al.used Multivariate
Linear Regression (MVLR) and Genetic Algorithm (GA) to analyse the impact
of current conditions on shipbuilding industry accidents [18]. Sarkar et al. used
Bayesian Network (BN) and fault tree analysis (FTA) to develop a prediction
model based on text mining and predict occupational accidents in a steel indus-
try [16]. Taylor et al. used Bayesian models to analyse injurious and near miss
incidents in a fire and emergency services industry [19]. Brooks et al. used text
mining techniques as a tool to analyse text descriptions of occupational accidents
and accordingly act upon compensation claims made by workers [20]. Vallmuur
et al. used Bayesian network (BN) methods to predict injury categories using
textual injury surveillance data [21]. There has been an unprecedented success in
classifying occupational injury narratives using ML algorithms but there is lim-
ited utilization of ML algorithms in grouping documents of occupational hazard
Root cause analysis of steel plant incidents... 3
2 Methodology
This study aims to find out the hidden causal factors behind each accident
separately using unsupervised ML in assisting safety inspections. It also aims to
predict the ‘Primary Cause’ labels of the incidents using classification algorithms
(namely SVM, and RF) on the narrative text. The flowchart of the proposed
methodology is shown in Fig. 1.
This stage tries to expel meaningless information from narratives and recover im-
portant information. In the proposed approach, information preprocessing com-
prises of three stages: (i) Tokenization, (ii) Lemmatization, and (iii) Stopword
removal [22]. After preprocessing the text, it is represented in the term frequency-
inverse document frequency (tf-idf ) vector form. In this representation, idf nor-
malizes the frequency for each term. The importance of commonly occurring
terms in the collection is reduced by this normalization. For example, in a col-
lection of documents on accident, the expression “accident” is probably going to
occur in practically every document. This ensures that the document matching
is more affected by those terms whose frequencies are relatively low in the entire
collection.
value is computed for each of the top five ‘Primary Cause’ clusters separately.
The plots of the average SI value of the ‘Slip/Trip/Fall’, ‘Road Accident’, ‘Ma-
terial Handling’, ‘Fire/Explosion’, and ‘Process Incident’ for different number of
clusters are shown in Figs. 3a-3e.
Silhouette analysis is used to measure the clustering performance and to cal-
culate the optimal number of clusters that needs to be considered. High score
implies that documents inside a cluster are similar, whereas documents in two
different clusters are not similar. According to the following Fig. 3, 10 clusters
are optimal for documents which have ‘Primary Cause’ label as ‘Slip/Trip/Fall’.
Similarly, 6 clusters are optimal for ‘Road Accident’ documents, 4 clusters for
‘Material Handling’, 5 for ‘Fire/Explosion’ and 5 for ‘Process Incidents’. Each
cluster of each Primary Cause is analyzed and top keys terms are extracted.
These terms differentiated between the clusters and possible root causes behind
each ‘Primary Cause’ are found. Key terms for each ‘Primary Cause’ are ex-
tracted from the preprocessed narrative text of each cluster as shown in Table
1.
The top key terms of each cluster for each ‘Primary Cause’ are analyzed to
understand how they contributed to the incident. Analysing these terms, the
root causes behind each ‘Primary Cause’ are found. Using these root causes, a
CE diagram is made as shown in Fig. 4, so that proper corrective measures can
be taken. Few recommendations and findings obtained from these analyses are:
(i) ‘Slip/Trip/Fall’ occurred while climbing ladders. It indicates ladder with
slippery steps and poor footwear of the worker or the position of stair-
case is not ideal for climbing. Slipping is also caused due to occasional
spills and wet and oily surfaces. Proper housekeeping actions are necessary
for preventive measures like sophisticated footwear, advanced flooring, and
instructions of walking techniques to be efficient.
(ii) In case of ‘Slip/Trip/Fall’, the term fall generally specify a worker falling
from an altitude. This includes falls from roofs, ladders or falling down
the stairs. However, some of the incidents of stone falling are incorrectly
labeled as ‘Slip/Trip/Fall’ which may be due to the lack of knowledge of the
person who is logging regarding the definition of the categories. Measures
must be taken to educate the operator.
(iii) ‘Material Handling’ incidents generally takess place when employees are
using vehicles like trucks and cranes to lift and load heavy materials. In
order to operate the machines properly, workers should be trained.
(iv) ‘Road Accident’ mainly happened by employees riding bikes who were on
their way to the plant or going back home. Heavy vehicles like crane and
trucks and skidding of vehicles are also responsible for road accidents.
(v) Fire/Explosion happened due to various factors like splatter of welding
particle, heated electric cable and hot metals. In some incidents, it is found
that leakage of flammable material caused a fire. Measures must be taken
to reduce the welding splatter, heating of cables and leakage of flammable
material.
Root cause analysis of steel plant incidents... 7
(a) Silhouette score for slip/trip/fall (b) Silhouette score for road accident
(c) Silhouette score for material handling (d) Silhouette score for fire explosion
Fig. 5: F1 Score of SVM for different to-Fig. 6: Precision and recall of SVM with
kenisation. unigram tokenization.
Table 2 shows some of the misclassified labels and the reasons. The most
significant reason behind misclassification is that tf-idf representation does not
capture the contextual information of the terms in an incident. Bi-gram tok-
enization of terms may store some contextual information, but the results of the
experiments show that uni-gram tokenization performs the best. Thus, the clas-
sifier wrongly classifies the document because of focusing more on those terms
which are not linked to the cause of the incident. Another important reason for
misclassification is the similarity between some labels which makes it difficult
for even operators to classify manually. This problem occurs in labels such as
“Rail” and “Derailment”, “Road Accident” and “Dashing/collision”, “Skidding”
and “Slip/Trip/Fall” which are quite similar.
10 Sobhan Sarkar et al.
4 Conclusions
References
1. Patel, D.A., Jha, K.N.: An estimate of fatal accidents in indian construction. In:
Proceedings of the 32nd Annual ARCOM Conference. pp. 5–7 (2016)
Root cause analysis of steel plant incidents... 11
2. Wells, S., Macdonald, S.: The relationship between alcohol consumption patterns
and car, work, sports and home accidents for different age groups. Accident Anal-
ysis & Prevention 31(6), 663–665 (1999)
3. Laflamme, L., Menckel, E., Lundholm, L.: The age-related risk of occupational
accidents: the case of swedish iron-ore miners. Accident Analysis & Prevention
28(3), 349–357 (1996)
4. Nag, P., Patel, V.: Work accidents among shiftworkers in industry. International
Journal of Industrial Ergonomics 21(3-4), 275–281 (1998)
5. Singh, K., Raj, N., Sahu, S., Behera, R., Sarkar, S., Maiti, J.: Modelling safety
of gantry crane operations using petri nets. International journal of injury control
and safety promotion (Taylor & Francis) pp. 1–12 (2015)
6. Gautam, S., Maiti, J., Syamsundar, A., Sarkar, S.: Segmented point process models
for work system safety analysis. Safety Science (Elsevier) 95, 15–27 (2017)
7. Sarkar, S., Baidya, S., Maiti, J.: Application of rough set theory in accident analysis
at work: A case study. In: ICRCICN 2017, IEEE. pp. 245–250 (2017)
8. Sarkar, S., Vinay, S., Pateshwari, V., Maiti, J.: Study of optimized svm for incident
prediction of a steel plant in india. In: INDICON 2017 (IEEE). pp. 1–6. IEEE
(2017)
9. Sarkar, S., Patel, A., Madaan, S., Maiti, J.: Prediction of occupational accidents
using decision tree approach. In: INDICON 2017 (IEEE). pp. 1–6. IEEE (2017)
10. Sarkar, S., Ejaz, N., Maiti, J.: Application of hybrid clustering technique for pattern
extraction of accident at work: A case study of a steel industry. In: 2018 4th
International Conference on Recent Advances in Information Technology (RAIT).
pp. 1–6. IEEE (2018)
11. Sarkar, S., Raj, R., Vinay, S., Maiti, J., Pratihar, D.K.: An optimization-based
decision tree approach for predicting slip-trip-fall accidents at work. Safety Science
118, 57–69 (2019)
12. Sarkar, S., Pateshwari, V., Maiti, J.: Predictive model for incident occurrences in
steel plant in india. In: ICCCNT 2017, IEEE. pp. 1–5 (2017)
13. Sarkar, S., Verma, A., Maiti, J.: Prediction of occupational incidents using proac-
tive and reactive data: A data mining approach. In: Industrial Safety Management-
21st Century Perspective of Asia (Springer), pp. 65–79. Springer Singapore (2018)
14. Verma, A., Chatterjee, S., Sarkar, S., Maiti, J.: Data-driven mapping between
proactive and reactive measures of occupational safety performance. In: Indus-
trial Safety Management- 21st Century Perspective of Asia (Springer), pp. 53–63.
Springer Singapore (2018)
15. Sarkar, S., Vinay, S., Raj, R., Maiti, J., Mitra, P.: Application of optimized ma-
chine learning techniques for prediction of occupational accidents. Computers &
Operations Research (Elsevier) (2019)
16. Sarkar, S., Vinay, S., Maiti, J.: Text mining based safety risk assessment and predic-
tion of occupational accidents in a steel plant. In: 2016 International Conference
on Computational Techniques in Information and Communication Technologies
(ICCTICT). pp. 439–444. IEEE (2016)
17. Chokor, A., Naganathan, H., Chong, W.K., El Asmar, M.: Analyzing arizona osha
injury reports using unsupervised machine learning. Procedia engineering 145,
1588–1593 (2016)
18. Fragiadakis, N., Tsoukalas, V., Papazoglou, V.: An adaptive neuro-fuzzy inference
system (anfis) model for assessing occupational risk in the shipbuilding industry.
Safety Science 63, 226–235 (2014)
12 Sobhan Sarkar et al.
19. Taylor, J.A., Lacovara, A.V., Smith, G.S., Pandian, R., Lehto, M.: Near-miss nar-
ratives from the fire service: a bayesian analysis. Accident analysis & prevention
62, 119–129 (2014)
20. Brooks, B.: Shifting the focus of strategic occupational injury prevention: Mining
free-text, workers compensation claims data. Safety Science 46(1), 1–21 (2008)
21. Vallmuur, K.: Machine learning approaches to analysing textual injury surveillance
data: a systematic review. Accident Analysis & Prevention 79, 41–49 (2015)
22. Sarkar, S., Lohani, A., Maiti, J.: Genetic algorithm-based association rule mining
approach towards rule generation of occupational accidents. In: Communications
in Computer and Information Science (Springer), vol. 776, pp. 517–530. Springer,
Singapore (2017)
23. Huang, A.: Similarity measures for text document clustering. In: Proceedings of the
sixth new zealand computer science research student conference (NZCSRSC2008),
Christchurch, New Zealand. vol. 4, pp. 9–56 (2008)
24. Rokach, L., Maimon, O.: Clustering methods. In: Data mining and knowledge
discovery handbook, pp. 321–352. Springer (2005)
25. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation
of cluster analysis. Journal of computational and applied mathematics 20, 53–65
(1987)
26. Powers, D.M.: Evaluation: from precision, recall and f-measure to roc, informed-
ness, markedness and correlation (2011)
27. Sarkar, S., Lakha, V., Ansari, I., Maiti, J.: Supplier selection in uncertain en-
vironment: a fuzzy mcdm approach. In: Proceedings of the First International
Conference on Intelligent Computing and Communication. pp. 257–266. Springer
(2017)
28. Sarkar, S., Chain, M., Nayak, S., Maiti, J.: Decision support system for prediction
of occupational accident: A case study from a steel plant. In: Emerging Technolo-
gies in Data Mining and Information Security, vol. 813, pp. 787–796. Springer,
Singapore (2019)
29. Sarkar, S., Kumar, A., Mohanpuria, S.K., Maiti, J.: Application of bayesian net-
work model in explaining occupational accidents in a steel industry. In: 2017 Third
International Conference on Research in Computational Intelligence and Commu-
nication Networks (ICRCICN). pp. 337–392. IEEE (2017)