Root Cause Analysis of Incidents Using Text Clustering and Classification Algorithms

Root cause analysis of incidents using text
clustering and classification algorithms
Sobhan Sarkar1 , Numan Ejaz2 , Mehul Kumar3 , and J. Maiti4

1
Department of Industrial & Systems Engineering, IIT Kharagpur, India
sobhan.sarkar@gmail.com
2
numan.ejaz1897@gmail.com
3
Department of Mechanical Engineering, IIT Kharagpur, India
mehulpranay.19@gmail.com
4
jhareswar.maiti@gmail.com
Abstract. The purpose of this study is to cluster the injury narratives

to extract the root causes behind the accidents. Analysis is done on in-
cident data collected from the database of an integrated steel plant. Key
terms generated from the clustering of incident scenario help us in finding
root causes of that particular incident. This study also proposed specific
measures to the management that would improve the safety performance.
This study uses text document clustering to discover the hidden factors
and causes behind the incidents. Understanding previous accidents is nec-
essary to avoid future accidents. However, for companies, management of
large accident databases, and accurately classifying accident narratives
are very challenging issues. Therefore, the aim of this study is to accu-
rately classify accident reports using text classification approaches and
evaluate their usefulness. The study used two machine learning (ML)
algorithms, namely random forest (RF), and support vector machine
(SVM) and found that SVM performed best in classifying the accident
narratives. Further, SVM was experimented with different tokenization
of the preprocessed narratives to get more successful results.
Keywords: Root cause analysis · Incident · Text clustering · Classifica-

tion · Steel industry.
1 Introduction
An occupational accident is an unexpected event in which an employee gets in-
jured due to external components. In India, about 48, 000 workers die annually
due to occupational accidents of which 24.2 percent of the fatal accidents takes
place in construction sites [1]. Lifestyle of workers, demographic and workplace
factors are also responsible for occupational accidents. These factors include
smoking and alcohol consumption [2], age [3], and shift work [4]. To enrich the
safety culture and well-being of employees, recent research is focused on iden-
tifying root causes of accidents that lead to hazardous occurrence, and stop it
2 Sobhan Sarkar et al.
before it happens [5,6]. The steel industry collects data about the incident in cat-
egorical, numerical, or text forms. Many studies have already been conducted on
categorical or numerical attributes [7–11]. The text data are difficult for analysis
and hence remain under-utilized [12]. The text description acts as a lagging indi-
cator and provides the situation during the incident. Statistics of past accidental
records are extracted and represented as lagging indicators whereas leading in-
dicators are the measure specifying a future incident [13–15]. These indicators
are analysed to take preventive actions and control injury. Lagging indicator
provides information about the quantity of injured people and the severity of
injuries but utilizing only lagging indicators for evaluating safety performance
has a drawback of providing no information on how well the company is respond-
ing to prevent incidents. Clustering of text documents can be used to extract
potentially dangerous causal factors from a huge amount of accident data which
is difficult to be extracted using conventional methods. The keywords generated
as a result of incident clustering may reveal the root causes and frequent places
of occurrence of the incident which will act as a lead indicator. The information
gained will help develop a safety action plan and reduce the risk factors. The text
description can also be used to predict the primary cause behind the incidents.
Text classification is the process of assigning a class label to a document using
supervised ML approaches which requires a collection of documents with prede-
fined class labels [16]. Natural Language Processing (NLP) is used to perform
document classification [12]. The task is to improve consistency, efficiency and
performance of the document classification algorithms by experimenting RF and
SVM with different tokenization methods like unigram and bigram. In case of
document classification, the classes are known and the documents are classified
into these classes, whereas in document clustering, the classes are not known.
Thus, document classification and document clustering are different. Classifica-
tion uses supervised ML approaches whereas, clustering requires unsupervised
ML approaches.
Previously some works have been done in occupational accident analysis using
text mining and ML approaches. Chokor et al. evaluated the strength of unsu-
pervised ML and NLP algorithms to support safety investigation by analyzing
the occupational accidents in Arizona [17]. Fragiadakis et al.used Multivariate
Linear Regression (MVLR) and Genetic Algorithm (GA) to analyse the impact
of current conditions on shipbuilding industry accidents [18]. Sarkar et al. used
Bayesian Network (BN) and fault tree analysis (FTA) to develop a prediction
model based on text mining and predict occupational accidents in a steel indus-
try [16]. Taylor et al. used Bayesian models to analyse injurious and near miss
incidents in a fire and emergency services industry [19]. Brooks et al. used text
mining techniques as a tool to analyse text descriptions of occupational accidents
and accordingly act upon compensation claims made by workers [20]. Vallmuur
et al. used Bayesian network (BN) methods to predict injury categories using
textual injury surveillance data [21]. There has been an unprecedented success in
classifying occupational injury narratives using ML algorithms but there is lim-
ited utilization of ML algorithms in grouping documents of occupational hazard
Root cause analysis of steel plant incidents... 3
reports. Therefore, this study explores different avenues regarding an assortment

of text mining and assess their future in consequently classifying documents. The
contribution of this study is the root cause analysis of accidents based on text
clustering. In this study, text-based analysis is carried out, which is hardly re-
ported in any previous studies in occupational accident domain.
The remainder of the paper is organized as follows: Section 2 discusses the
methods briefly used in this study. In Section 3, the results are presented and
discussed. Finally, conclusion with the scope for future works are presented in
Section 4.
2 Methodology
This study aims to find out the hidden causal factors behind each accident
separately using unsupervised ML in assisting safety inspections. It also aims to
predict the ‘Primary Cause’ labels of the incidents using classification algorithms
(namely SVM, and RF) on the narrative text. The flowchart of the proposed
methodology is shown in Fig. 1.
Fig. 1: Proposed methodological flowchart.

2.1 Data preprocessing
This stage tries to expel meaningless information from narratives and recover im-
portant information. In the proposed approach, information preprocessing com-
prises of three stages: (i) Tokenization, (ii) Lemmatization, and (iii) Stopword
removal [22]. After preprocessing the text, it is represented in the term frequency-
inverse document frequency (tf-idf ) vector form. In this representation, idf nor-
malizes the frequency for each term. The importance of commonly occurring
terms in the collection is reduced by this normalization. For example, in a col-
lection of documents on accident, the expression “accident” is probably going to
occur in practically every document. This ensures that the document matching
is more affected by those terms whose frequencies are relatively low in the entire
collection.
2.2 Root cause analysis by text-document clustering

Root cause analysis is a structured process which helps to discover hidden fac-
tors behind incidents. Fig. 2 shows a Pareto Chart which represents the number
of documents belonging to each category of ‘Primary Cause’ in decreasing or-
der. In Fig. 2 ‘Cat 1’ represents ‘Slip/Trip/Fall’. Similarly, ‘Cat 2’ is ‘Road
Accident’, ‘Cat 3’ is ‘Material Handling’, ‘Cat 4’ is ‘Fire/Explosion’, ‘Cat 5’ is
‘Process Incidents’, ‘Cat 6’ is ‘Dashing/Collision’, ‘Cat 7’ is ‘Derailment’, ‘Cat
8’ is ‘Equipment Damage’, ‘Cat 9’ is ‘Electrical Flash’, ‘Cat 10’ is ‘Lifting Tools
Tackles’, ‘Cat 11’ is ‘Skidding’, ‘Cat 12’ is ‘Structural Integrity’, ‘Cat 13’ is ‘Gas
Leakage’, ‘Cat 14’ is ‘Crane Dashing’, ‘Cat 15’ is ‘Hot Metals’, ‘Cat 16’ is ‘Rail’
and ‘Cat 17’ is ’Energy Isolation’. Out of these, the top 5 categories are selected
to prepare a Cause and Effect (CE) diagram. A CE diagram, often called a
‘fishbone’ diagram, assists in visualizing the causes and their effects. It can help
to get insights about the hidden root causes behind a problem so that proper
preventive measures can be undertaken. The head of the fish facing to the right
represents the effect. The ribs branch off the backbone for primary causes, with
sub-branches for root causes of each primary cause.
In order to obtain the root causes behind each primary cause, preprocessed
narrative texts corresponding to each primary cause is clustered separately.
For example, preprocessed narrative texts of documents with primary cause as
‘Slip/Trip/Fall’ is clustered into k number of clusters. Each of the k clusters is
analysed to find the root causes behind ‘Slip/Trip/Fall’. The documents with
high cosine similarity [23] are placed in the same cluster. Agglomerative Hier-
archical clustering (HC) approach [24] is used to perform clustering. HC starts
with each document belonging to a separate cluster. Then, the closest clusters
are merged based on some linkage criteria. Silhouette index (SI) [25] is used to
find the optimal number of clusters for narratives belonging to each primary
cause separately. SI measures the similarity of an object with its own cluster
and its dissimilarity with other clusters. This measure has a range [−1, 1] and
higher score of SI is better.
Fig. 2: Pareto chart of primary cause of incident.
2.3 Accident narrative classification
Text classification assigns a label to a text document using supervised ML tech-

niques on a collection of labeled documents. The aim is to predict the ‘Primary
Cause’ labels of documents using classification algorithms on the preprocessed
narratives. In order to improve the consistency and efficiency of the classifica-
tion algorithms, different classification algorithms are experimented with differ-
ent tokenization methods, like uni-gram and bi-gram. The performance of the
classification algorithms is evaluated using precision, recall and F1-score [26].
3 Results and Discussions
3.1 Root cause analysis by text-document clustering

The ‘Brief description’ of an incident is used as narrative text for that incident.
The preprocessed narrative text of each ‘Primary Cause’ category was clustered
separately to get the root causes behind each ‘Primary Cause’ that will help
us to understand the patterns. Pareto Analysis indicates that ‘Slip/Trip/Fall’,
‘Road Accident’, ‘Material Handling’, ‘Fire/Explosion’ and ‘Process Incidents’
are the top five ‘Primary Cause’ categories which represents nearly 65% of the
reported incidents. Only top five ‘Primary Cause’ categories are clustered sepa-
rately to find their root causes. In text-document clustering output, average SI
value is computed for each of the top five ‘Primary Cause’ clusters separately.
The plots of the average SI value of the ‘Slip/Trip/Fall’, ‘Road Accident’, ‘Ma-
terial Handling’, ‘Fire/Explosion’, and ‘Process Incident’ for different number of
clusters are shown in Figs. 3a-3e.
Silhouette analysis is used to measure the clustering performance and to cal-
culate the optimal number of clusters that needs to be considered. High score
implies that documents inside a cluster are similar, whereas documents in two
different clusters are not similar. According to the following Fig. 3, 10 clusters
are optimal for documents which have ‘Primary Cause’ label as ‘Slip/Trip/Fall’.
Similarly, 6 clusters are optimal for ‘Road Accident’ documents, 4 clusters for
‘Material Handling’, 5 for ‘Fire/Explosion’ and 5 for ‘Process Incidents’. Each
cluster of each Primary Cause is analyzed and top keys terms are extracted.
These terms differentiated between the clusters and possible root causes behind
each ‘Primary Cause’ are found. Key terms for each ‘Primary Cause’ are ex-
tracted from the preprocessed narrative text of each cluster as shown in Table
1.
The top key terms of each cluster for each ‘Primary Cause’ are analyzed to
understand how they contributed to the incident. Analysing these terms, the
root causes behind each ‘Primary Cause’ are found. Using these root causes, a
CE diagram is made as shown in Fig. 4, so that proper corrective measures can
be taken. Few recommendations and findings obtained from these analyses are:
(i) ‘Slip/Trip/Fall’ occurred while climbing ladders. It indicates ladder with
slippery steps and poor footwear of the worker or the position of stair-
case is not ideal for climbing. Slipping is also caused due to occasional
spills and wet and oily surfaces. Proper housekeeping actions are necessary
for preventive measures like sophisticated footwear, advanced flooring, and
instructions of walking techniques to be efficient.
(ii) In case of ‘Slip/Trip/Fall’, the term fall generally specify a worker falling
from an altitude. This includes falls from roofs, ladders or falling down
the stairs. However, some of the incidents of stone falling are incorrectly
labeled as ‘Slip/Trip/Fall’ which may be due to the lack of knowledge of the
person who is logging regarding the definition of the categories. Measures
must be taken to educate the operator.
(iii) ‘Material Handling’ incidents generally takess place when employees are
using vehicles like trucks and cranes to lift and load heavy materials. In
order to operate the machines properly, workers should be trained.
(iv) ‘Road Accident’ mainly happened by employees riding bikes who were on
their way to the plant or going back home. Heavy vehicles like crane and
trucks and skidding of vehicles are also responsible for road accidents.
(v) Fire/Explosion happened due to various factors like splatter of welding
particle, heated electric cable and hot metals. In some incidents, it is found
that leakage of flammable material caused a fire. Measures must be taken
to reduce the welding splatter, heating of cables and leakage of flammable
material.
(a) Silhouette score for slip/trip/fall (b) Silhouette score for road accident
(c) Silhouette score for material handling (d) Silhouette score for fire explosion
(e) Silhouette score for process incident
Fig. 3: Silhouette scores for top five Primary Causes.

Table 1: Top key terms for each cluster.

Cluster description Key terms
Material handling
hoist wire rope broken , coil slip from saddle
1 hoist wire broken
while hoist
2 billets fall , unloading billets billets fall down while unloading
3 unload material , dumper topple, breaking piston toppling of dumper
4 scrap collection truck , damaged cable truck incident
Road Incident
1 brake skidding, bike fell down , road injury bike incident
shift duty,
2 accident on the way to duty
accident while coming to duty, going way work
3 collision with bus bus incident
4 trailer loaded , wheel driver heavy vehicle incident
5 dumper+ operator reversing , injury damage dumper accident while reversing
6 duty bike motor fell, hit duty time bike incident
Fire Explosion
1 injury burn , cable overheating, caught fire fire due to cable overheating
2 Smoke, oven flame, fire brigade burning oven incident
3 red hot slabs , burnt injury , high temperature hot material
4 slag pit, removal of slag while slag removal
5 Welder, flying welding particle welding splatter
Process Incident
1 hand injured, sharp edged drill cut by sharp object
2 tail piece fall, coal piece pieces falling during operation
3 strap of coil broken, cut coil open broken coil
4 tension chain belt, sling rope broke rope broke due to tensioning
5 slag spillage slag spillage
Slip/Trip/Fall
1 contractor, section of ladder missing, slip staircase incident
fall, coal pieces
2 pieces falling
, roof , piece from crane
3 mine, loose balance , loading coal incident at mine
4 bike skidding, slipping , shift duty bike incident
5 near site office, dumper, helmet strip trip at office
6 cable jointing, removing checker plates, bolts incident while material handling
7 miner, unloading, toppled slip while unloading
8 operating rail point, operator grinding incident while material handling
9 perlin slips roof top to ground perlin falls
10 floor slippery equipment process incidents
Fig. 4: Cause and effect diagram for extracting root causes.

3.2 Classification of accident narratives
Classification algorithms are used to predict the categories of ‘Primary Cause’

from the preprocessed narratives of documents. From the Pareto Chart, docu-
ments belonging to the top 14 ‘Primary Cause’ are used for classification. This
constitutes 90% documents of the collection. Using sampling, about 25% of the
labeled documents are used as a test set for performance comparison of the dif-
ferent approaches. Since the documents belonging to the top 14 ‘Primary Cause’
are used, so the number of classes are 14. The performances of the classifiers are
evaluated using the test set. The classifier which performs best is used in the fol-
lowing experiments to enhance its performance. Comparing the performance of
linear SVM and RF in terms of F1-score, an average F1 score of 0.63 is achieved
for RF and 0.71 is achieved for linear SVM classifier. Since linear SVM performs
better than RF, so experimentation with different types of tokenization process,
such as uni-gram and bi-gram are performed with linear SVM. This is done to
evaluate the SVM classifier's performance. Fig. 5 shows the F1 score of SVM
with unigram and bigram. From Fig. 5, it is clear that linear SVM with unigram
performs best for our study. Fig. 6 shows the precision and recall of the best
performing linear SVM model.
Fig. 5: F1 Score of SVM for different to-Fig. 6: Precision and recall of SVM with
kenisation. unigram tokenization.
Table 2 shows some of the misclassified labels and the reasons. The most
significant reason behind misclassification is that tf-idf representation does not
capture the contextual information of the terms in an incident. Bi-gram tok-
enization of terms may store some contextual information, but the results of the
experiments show that uni-gram tokenization performs the best. Thus, the clas-
sifier wrongly classifies the document because of focusing more on those terms
which are not linked to the cause of the incident. Another important reason for
misclassification is the similarity between some labels which makes it difficult
for even operators to classify manually. This problem occurs in labels such as
“Rail” and “Derailment”, “Road Accident” and “Dashing/collision”, “Skidding”
and “Slip/Trip/Fall” which are quite similar.
Table 2: Qualitative evaluation of misclassification.

Actual class Predicted class Qualitative evaluation of sample
Electrical equipment is mentioned in the narrative for ’Fire/Explosion’, but the
Fire/Explosion Electric Flash
equipment is not responsible for the accident
Similar types of clusters and it is difficult to differentiate between the two
Road accident Dashing/Collision
incident scenario
Injury took place in a surrounding which was marked as unsafe due to regular
Slip/Trip/Fall Structural integrity
structural disintegration
4 Conclusions
This report exhibits the effects of clustering and classification algorithms in

describing safety issue that an organization faces. The proposed clustering algo-
rithm helps to find the hidden causal factors and patterns that can be used by
the management of the steel plant to make proper precautionary measures to
avoid injuries. From this study, it is observed that clustering and analyzing the
accident narratives provides a much deeper insight into the root causes behind
the incidents. Analyzing the results shown in Table 1, ‘Slip/Trip/Fall’ incidents
usually takes place in staircases. This information can be utilized by design and
safety experts to prevent these incidents. Table 1 shows that bikes are mainly re-
sponsible for ‘Road incidents’ followed by heavy vehicles. Some of the categories
of ‘Primary Cause’ like ‘Skidding’ and ‘Slip/Trip/Fall’ are alike and can be com-
bined. Some categories are fairly tricky to classify, so measures must be taken
to make the categories clearly defined and unique which will help the logger to
make quick decision at the time of incident. This study also produced results
for prediction of ‘Primary Cause’ using classification algorithms on the prepro-
cessed narrative text. According to the F1-score produced by the experiments,
linear SVM with unigram tokenization is recommended to perform classification
of preprocessed narrative text.
One of the limitations includes that tf-idf representation with unigram to-
kenization does not retain the contextual information. In future, more specific
vocabularies for occupational accidents domain can be created for proper identi-
fication of terms. The narrative texts can be preprocessed using more advanced
methods to remove unrelated terms. Further studies can be done to control the
variations across different datasets. In addition, an attempt can be made to de-
termine the optimal number of base learners using the fuzzy concept [27]. For
better visualization and quick decision, a decision support system (DSS) can
also be developed [28]. Text-based Bayesian modeling [29] also a good research
potential area.
References
1. Patel, D.A., Jha, K.N.: An estimate of fatal accidents in indian construction. In:
Proceedings of the 32nd Annual ARCOM Conference. pp. 5–7 (2016)
2. Wells, S., Macdonald, S.: The relationship between alcohol consumption patterns
and car, work, sports and home accidents for different age groups. Accident Anal-
ysis & Prevention 31(6), 663–665 (1999)
3. Laflamme, L., Menckel, E., Lundholm, L.: The age-related risk of occupational
accidents: the case of swedish iron-ore miners. Accident Analysis & Prevention
28(3), 349–357 (1996)
4. Nag, P., Patel, V.: Work accidents among shiftworkers in industry. International
Journal of Industrial Ergonomics 21(3-4), 275–281 (1998)
5. Singh, K., Raj, N., Sahu, S., Behera, R., Sarkar, S., Maiti, J.: Modelling safety
of gantry crane operations using petri nets. International journal of injury control
and safety promotion (Taylor & Francis) pp. 1–12 (2015)
6. Gautam, S., Maiti, J., Syamsundar, A., Sarkar, S.: Segmented point process models
for work system safety analysis. Safety Science (Elsevier) 95, 15–27 (2017)
7. Sarkar, S., Baidya, S., Maiti, J.: Application of rough set theory in accident analysis
at work: A case study. In: ICRCICN 2017, IEEE. pp. 245–250 (2017)
8. Sarkar, S., Vinay, S., Pateshwari, V., Maiti, J.: Study of optimized svm for incident
prediction of a steel plant in india. In: INDICON 2017 (IEEE). pp. 1–6. IEEE
(2017)
9. Sarkar, S., Patel, A., Madaan, S., Maiti, J.: Prediction of occupational accidents
using decision tree approach. In: INDICON 2017 (IEEE). pp. 1–6. IEEE (2017)
10. Sarkar, S., Ejaz, N., Maiti, J.: Application of hybrid clustering technique for pattern
extraction of accident at work: A case study of a steel industry. In: 2018 4th
International Conference on Recent Advances in Information Technology (RAIT).
pp. 1–6. IEEE (2018)
11. Sarkar, S., Raj, R., Vinay, S., Maiti, J., Pratihar, D.K.: An optimization-based
decision tree approach for predicting slip-trip-fall accidents at work. Safety Science
118, 57–69 (2019)
12. Sarkar, S., Pateshwari, V., Maiti, J.: Predictive model for incident occurrences in
steel plant in india. In: ICCCNT 2017, IEEE. pp. 1–5 (2017)
13. Sarkar, S., Verma, A., Maiti, J.: Prediction of occupational incidents using proac-
tive and reactive data: A data mining approach. In: Industrial Safety Management-
21st Century Perspective of Asia (Springer), pp. 65–79. Springer Singapore (2018)
14. Verma, A., Chatterjee, S., Sarkar, S., Maiti, J.: Data-driven mapping between
proactive and reactive measures of occupational safety performance. In: Indus-
trial Safety Management- 21st Century Perspective of Asia (Springer), pp. 53–63.
Springer Singapore (2018)
15. Sarkar, S., Vinay, S., Raj, R., Maiti, J., Mitra, P.: Application of optimized ma-
chine learning techniques for prediction of occupational accidents. Computers &
Operations Research (Elsevier) (2019)
16. Sarkar, S., Vinay, S., Maiti, J.: Text mining based safety risk assessment and predic-
tion of occupational accidents in a steel plant. In: 2016 International Conference
on Computational Techniques in Information and Communication Technologies
(ICCTICT). pp. 439–444. IEEE (2016)
17. Chokor, A., Naganathan, H., Chong, W.K., El Asmar, M.: Analyzing arizona osha
injury reports using unsupervised machine learning. Procedia engineering 145,
1588–1593 (2016)
18. Fragiadakis, N., Tsoukalas, V., Papazoglou, V.: An adaptive neuro-fuzzy inference
system (anfis) model for assessing occupational risk in the shipbuilding industry.
Safety Science 63, 226–235 (2014)
19. Taylor, J.A., Lacovara, A.V., Smith, G.S., Pandian, R., Lehto, M.: Near-miss nar-
ratives from the fire service: a bayesian analysis. Accident analysis & prevention
62, 119–129 (2014)
20. Brooks, B.: Shifting the focus of strategic occupational injury prevention: Mining
free-text, workers compensation claims data. Safety Science 46(1), 1–21 (2008)
21. Vallmuur, K.: Machine learning approaches to analysing textual injury surveillance
data: a systematic review. Accident Analysis & Prevention 79, 41–49 (2015)
22. Sarkar, S., Lohani, A., Maiti, J.: Genetic algorithm-based association rule mining
approach towards rule generation of occupational accidents. In: Communications
in Computer and Information Science (Springer), vol. 776, pp. 517–530. Springer,
Singapore (2017)
23. Huang, A.: Similarity measures for text document clustering. In: Proceedings of the
sixth new zealand computer science research student conference (NZCSRSC2008),
Christchurch, New Zealand. vol. 4, pp. 9–56 (2008)
24. Rokach, L., Maimon, O.: Clustering methods. In: Data mining and knowledge
discovery handbook, pp. 321–352. Springer (2005)
25. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation
of cluster analysis. Journal of computational and applied mathematics 20, 53–65
(1987)
26. Powers, D.M.: Evaluation: from precision, recall and f-measure to roc, informed-
ness, markedness and correlation (2011)
27. Sarkar, S., Lakha, V., Ansari, I., Maiti, J.: Supplier selection in uncertain en-
vironment: a fuzzy mcdm approach. In: Proceedings of the First International
Conference on Intelligent Computing and Communication. pp. 257–266. Springer
(2017)
28. Sarkar, S., Chain, M., Nayak, S., Maiti, J.: Decision support system for prediction
of occupational accident: A case study from a steel plant. In: Emerging Technolo-
gies in Data Mining and Information Security, vol. 813, pp. 787–796. Springer,
Singapore (2019)
29. Sarkar, S., Kumar, A., Mohanpuria, S.K., Maiti, J.: Application of bayesian net-
work model in explaining occupational accidents in a steel industry. In: 2017 Third
International Conference on Research in Computational Intelligence and Commu-
nication Networks (ICRCICN). pp. 337–392. IEEE (2017)

Root Cause Analysis of Incidents Using Text Clustering and Classification Algorithms

Uploaded by

Copyright:

Available Formats

You might also like

Root Cause Analysis of Incidents Using Text Clustering and Classification Algorithms

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Root Cause Analysis of Incidents Using Text Clustering and Classification Algorithms

Uploaded by

Copyright:

Available Formats

Root cause analysis of incidents using text

clustering and classification algorithms

Sobhan Sarkar1 , Numan Ejaz2 , Mehul Kumar3 , and J. Maiti4

Abstract. The purpose of this study is to cluster the injury narratives

Keywords: Root cause analysis · Incident · Text clustering · Classifica-

reports. Therefore, this study explores different avenues regarding an assortment

Fig. 1: Proposed methodological flowchart.

2.1 Data preprocessing

2.2 Root cause analysis by text-document clustering

Fig. 2: Pareto chart of primary cause of incident.

2.3 Accident narrative classification

Text classification assigns a label to a text document using supervised ML tech-

3 Results and Discussions

3.1 Root cause analysis by text-document clustering

(e) Silhouette score for process incident

Fig. 3: Silhouette scores for top five Primary Causes.

Table 1: Top key terms for each cluster.

Fig. 4: Cause and effect diagram for extracting root causes.

3.2 Classification of accident narratives

Classification algorithms are used to predict the categories of ‘Primary Cause’

Table 2: Qualitative evaluation of misclassification.

This report exhibits the effects of clustering and classification algorithms in

You might also like