1 s2.0 S0022437516000128 Main

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Journal of Safety Research 57 (2016) 71–82

Contents lists available at ScienceDirect

Journal of Safety Research

journal homepage: www.elsevier.com/locate/jsr

Bayesian decision support for coding occupational injury data


Gaurav Nanda, a Kathleen M. Grattan, b MyDzung T. Chu, b Letitia K. Davis, b Mark R. Lehto a,⁎,1
a
School of Industrial Engineering, Purdue University, 315 N. Grant Street, West Lafayette, IN 47907-2023, USA
b
Massachusetts Department of Public Health, 250 Washington Street, 4th Floor, Boston, MA 02108, USA

a r t i c l e i n f o a b s t r a c t

Article history: Introduction: Studies on autocoding injury data have found that machine learning algorithms perform well for
Received 1 July 2015 categories that occur frequently but often struggle with rare categories. Therefore, manual coding, although
Received in revised form 10 December 2015 resource-intensive, cannot be eliminated. We propose a Bayesian decision support system to autocode a large
Accepted 2 March 2016 portion of the data, filter cases for manual review, and assist human coders by presenting them top k prediction
Available online 15 March 2016
choices and a confusion matrix of predictions from Bayesian models. Method: We studied the prediction perfor-
mance of Single-Word (SW) and Two-Word-Sequence (TW) Naïve Bayes models on a sample of data from the
Keywords:
Bayesian models
2011 Survey of Occupational Injury and Illness (SOII). We used the agreement in prediction results of SW and
Narrative analysis TW models, and various prediction strength thresholds for autocoding and filtering cases for manual review.
Occupational injury We also studied the sensitivity of the top k predictions of the SW model, TW model, and SW–TW combination,
Text classification and then compared the accuracy of the manually assigned codes to SOII data with that of the proposed system.
Decision support system Results: The accuracy of the proposed system, assuming well-trained coders reviewing a subset of only 26% of
cases flagged for review, was estimated to be comparable (86.5%) to the accuracy of the original coding of the
data set (range: 73%–86.8%). Overall, the TW model had higher sensitivity than the SW model, and the accuracy
of the prediction results increased when the two models agreed, and for higher prediction strength thresholds.
The sensitivity of the top five predictions was 93%. Conclusions: The proposed system seems promising for coding
injury data as it offers comparable accuracy and less manual coding. Practical Applications: Accurate and timely
coded occupational injury data is useful for surveillance as well as prevention activities that aim to make work-
places safer.
© 2016 National Safety Council and Elsevier Ltd. All rights reserved.

1. Introduction hierarchical structure for coding of injury data (Bondy, Lipscomb,


Guarini, & Glazner, 2005; Northwood, Sygnatur, & Windau, 2012; U.S.
Occupational safety and health research and surveillance are essen- Department of Labor, 2012). SOII 2011 was the first year when the
tial for the prevention and control of injuries, illnesses, and hazards that OIICS updated version 2.01 was used for coding the data.
occur in the workplace. A systematic analysis of workplace injuries can The process of coding occupational injury data is currently per-
provide insights about where hazards exist and what interventions formed manually by human coders: a time- and resource-consuming
might be effective at preventing workplace injuries, illnesses, and fatal- task. A promising alternative to manual coding is offered by Bayesian
ities. The Survey of Occupational Injury and Illness (SOII) is the largest machine learning algorithms such as Naïve Bayes and Fuzzy Bayes,
occupational injury survey in the United States conducted by the Bureau which can learn from a training set of narratives previously coded by ex-
of Labor Statistics (BLS), which includes nearly all private sector indus- perts, and predict the cause-of-injury codes based on injury narrative
tries, state and local governments, and provides detailed information on with a probability that reflects the confidence of prediction (Lehto,
workplace injuries and illnesses (Occupational Safety and Health Marucci-Wellman, & Corns, 2009; Noorinaeini & Lehto, 2006; Taylor,
Statistics Program, 2014; U.S. Department of Labor, 2005). SOII data Lacovara, Smith, Pandian, & Lehto, 2014; Wellman, Lehto, Sorock, &
are coded using the Occupational Injuries and Illnesses Classification Smith, 2004). Other machine learning algorithms such as support vector
System (OIICS) coding scheme, which offers a simple, yet detailed, machine (SVM) and logistic regression have also been found to
yield good classification performance for injury classification (Bertke
et al., 2016; Chen, Vallmuur, & Nayak, 2015; Measure, 2014). Most ma-
⁎ Corresponding author at: Industrial Engineering Discovery-to-Delivery Center, School chine learning algorithms perform well for binary text classification
of Industrial Engineering, Purdue University, West Lafayette, IN, USA. tasks but performance declines dramatically as the number of predicted
E-mail addresses: gnanda@purdue.edu (G. Nanda), kathleen.grattan@state.ma.us
(K.M. Grattan), lehto@purdue.edu (M.R. Lehto).
classes increases, partly because there are very few cases for rare cate-
1
Present/Permanent Address: School of Industrial Engineering, Purdue University, 315 gories (Rizzo, Montesi, Fabbri, & Marchesini, 2015; Vallmuur, 2015).
N. Grant Street, West Lafayette, IN 47907-2023, USA. As shown by Rizzo et al., the macro-averaged F1 score (a commonly

http://dx.doi.org/10.1016/j.jsr.2016.03.001
0022-4375/© 2016 National Safety Council and Elsevier Ltd. All rights reserved.
72 G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

used measure of accuracy for classification tasks) of SVM model declines word combinations and sequences becomes very large. Hence, we
from about 0.88 to 0.75 as the number of classes increase from 10 to 45. have used two Naïve Bayes models for building the proposed decision
Classification of SOII data is a challenging task for any machine learning support system: (a) Single-Word (SW), and (b) Two-Word Sequence
method because there are about 45 two-digit OIICS event or exposure (TW), which are discussed below. We used the Textminer (TM) soft-
codes (which are the class labels for this study, and are also referred ware for implementing these models, which has also been used in pre-
to as “categories” in the paper), and the distribution of data is heavily vious studies related to autocoding injury data (Corns et al., 2007).
skewed with most injury cases falling under a small number of catego-
ries. The small number of training cases for rare categories makes it dif- 2.1. Naive Bayes model
ficult for the machine learning model to learn and make predictions
with good accuracy. Moreover, other factors such as misspellings, ab- Each injury narrative can be considered as a vector of j words pres-
breviations, and synonyms in the narratives adversely affect prediction ent in that narrative, that is, n = {n1 , n2 , ..nj }. Assuming there can be i
performance. Hence, although using machine learning algorithms to possible event codes that can be assigned, the set of event codes can
predict event codes is an efficient method for coding injury data with be represented as vector E = { E1, E2, … . Ei}. The Naïve Bayes model as-
good accuracy, human review cannot be eliminated (Wellman et al., sumes conditional independence of words in a narrative given the
2004). event code, which implies that the probability of a word being present
Semi-automated methods have been suggested as an alternative in the narrative depends on only the event code considered and is inde-
strategy to: (a) reduce the amount of manual coding required without pendent of the remaining terms in the narrative. Using the conditional
sacrificing accuracy, and (b) allow expert coders to focus on complicat- independence assumption, the probability of assigning a particular
ed narratives, thus resulting in a more efficient utilization of their time event code Ei to the narrative n can be calculated using the expression:
and resources (Corns, Marucci, & Lehto, 2007). Agreement in prediction 
results from Fuzzy Bayes and Naïve Bayes models has been used as a P n j jEi P ðEi Þ
P ðEi jnÞ ¼ ∏ 
strategy for autocoding the agreement cases and filtering the disagree- j P nj
ment cases for manual review (Marucci-Wellman, Lehto, & Corns,
2011). In addition to using agreement between different models, pre- where P(Ei | n)is the probability of event code category Ei given the set of
diction performance thresholds have also been explored in a recent n words in the narrative, P(nj | Ei) is the probability of word nj given cat-
study to filter cases that require manual review (Marucci-Wellman, egory Ei, P(Ei) is the prior probability of category i, and P(nj) is the prob-
Lehto, & Corns, 2015). ability of word nj in the entire keyword list.
The present study proposes a Bayesian decision support system that In application, P(nj | Ei), P(Ei) and P(nj) are all normally estimated on
autocodes a large portion of cases with high accuracy, filters cases for the basis of their frequency of occurrence in the training set. The prob-
manual review, and assists human coders by providing them the top ability of an event category is calculated by multiplying the likelihood
five choices for possible event codes. Such a semi-automated top k ap- ratios for each word in the narrative and prior probabilities. Even
proach has been found to be helpful for human coders in similar text though the conditional independence assumption of the Naive Bayes
classification tasks with a large number of categories, such as assigning model is violated in practice, it yields high accuracy for text classifica-
International Classification of Diseases (ICD) codes to short medical text tion tasks (Wellman et al., 2004). P(nj | Ei) in this study was estimated
(Rizzo et al., 2015). Similar decision support systems developed using by using word counts and category frequencies in the training set as
Bayesian models have been found to be very effective for predicting shown below:
the print defect category based on a customer's narrative of issue,  
with the actual defect being present in top five predictions 95% of the  count n j jEi þ ∝  count n j
P n j jEi ¼
time (Leman & Lehto, 2003). Therefore, it seems reasonable to use countðEi Þ þ ∝  N
Bayesian models for developing a decision support system that can
help human coders in coding occupational injury data. The detailed where count(nj | Ei) = number of times word nj occurs in category Ei,
methodology used in the study is described in the next section. count(nj) = number of times word nj occurs, count(Ei) = number of
times category Ei occurs, N = number of training narratives, and ∝ =
2. Methods smoothing constant.
The smoothing constant reduces the weight given to the evidence
As mentioned earlier, many machine learning methods have been provided by each term and also avoids P(nj | Ei) being improperly set to
proposed. Among these methods, the simple Naïve Bayes model is zero in case a word is not present in the training examples of a particular
well-suited for classification of short textual data, especially when the category (Lehto et al., 2009; Taylor et al., 2014; Wellman et al., 2004).
classes have fewer training cases (Marucci-Wellman et al., 2015; We chose the value of ∝ as 0.05, which corresponds to a small level of
Wang & Manning, 2012). SOII data has a lot of categories and many cat- smoothing.
egories have very few training cases, hence the use of Naïve Bayes
model seems appropriate. Previous studies on autocoding injury narra- 2.2. Agreement of two models and prediction thresholds
tives have used Naïve and Fuzzy Bayes models with single words as well
as multiple-word sequences and combinations (Wellman et al., 2004; Considering the agreement of prediction results of more than one
Zhu & Lehto, 1999). Multiple-word combinations and sequences as pre- Bayesian model has been shown to be a good strategy to improve clas-
dictors are likely to have better prediction accuracy than any single sification accuracy of occupational injury data (Marucci-Wellman et al.,
word for certain categories. For example, the two-word sequence ‘fell- 2011, 2015). For example, it was found that the classification accuracy
off’ is a more accurate predictor of the category ‘fall from elevation’ as improves for the task of assigning ICD codes based on inpatient dis-
compared to separate words ‘fell’ and ‘off.’ Use of multiple-word combi- charge summaries by combining prediction output from three different
nations and sequences in Naïve Bayes and Fuzzy Bayes classifiers have text classification models, namely, K-nearest neighbors, relevance feed-
been found to yield better results than single words for autocoding inju- back, and Bayesian independence (Larkey & Croft, 1996). The present
ry narratives (Corns et al., 2007; Noorinaeini & Lehto, 2006). The Fuzzy work considers the agreement in prediction results from SW and TW
Bayes model particularly works better when two or more word combi- models and analyzes the prediction performance when the models
nations are used (Noorinaeini & Lehto, 2006). However, using multiple- agree. Furthermore, we examined the effect on prediction performance
word combinations and sequences becomes computationally intensive of applying a minimum threshold to the prediction probability output
for large datasets such as SOII because the number of two or more by the Naïve Bayes model.
G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82 73

2.3. Data preprocessing different values of k (1, 2, 3, 4, and 5) were calculated by examining if
the BLS-assigned event codes were present in the list of top k prediction
For this study, we used a random sample of 50,000 lost workday choices. We propose that providing such a list of the top k prediction
cases from the 2011 SOII data for the U.S., provided by BLS, and coded choices could help human coders in manually assigning event codes to
according to OIICS version 2.01. The injury narrative field was minimally particular categories.
cleaned and no effort was made to correct grammatical errors and mis-
spellings. Out of the 50,000 cases, 40,000 cases were randomly selected 2.7. Evaluating BLS-assigned and Textminer-predicted codes
to be used as the training set for the Naïve Bayes model, of which three
cases with missing event codes were excluded. The remaining 10,000 During model development, we regarded the BLS-assigned event
cases constituted the prediction set. The distributions of categories in codes as the gold standard. However, it is well-known that the codes
training and prediction sets were similar. assigned by human coders to accident narratives vary between coders.
The inter-rater agreement often varies and the agreement between
2.4. Performance measures coders is sometimes relatively lower for some categories (Marucci-
Wellman et al., 2015). In order to study the consistency in manual coding,
We used sensitivity, positive predictive value (PPV), and F1 score as a sample of 871 cases was taken from the prediction dataset and coded
the measures to evaluate the prediction performance of different ap- independently by two coders at Occupational Health Surveillance Pro-
proaches used in this study. These measures have been widely used in gram (OHSP) at the Massachusetts Department of Public Health. This
previous studies to examine the performance of autocoding injury sample consisted of two groups: (a) the discordant group (754 cases),
data (Lehto et al., 2009; Marucci-Wellman et al., 2011). Sensitivity, where the predictions from SW and TW models agreed but did not
also known as “recall,” is defined as match up with BLS assigned codes, and (b) the concordant group (117
cases), where the event codes predicted by SW and TW models agreed
TP with BLS-assigned codes. For both the concordant and discordant groups,
Sensitivity ¼
TP þ FN the event codes predicted by SW and TW models agreed, and are referred
to as TM-predicted (Textminer predicted) codes henceforth.
PPV, also known as “precision,” is defined as These two independent OHSP coders were blinded to the BLS-
assigned codes, TM-predicted codes, or the code assigned by the other
TP
PPV ¼ OHSP coder. The coders were provided all the information from SOII
TP þ FP
data needed to assign the event code. They assigned 4-digit event
where TP = number of true positives, FN = number of false negatives, codes to each case in the sample independently. Once the coding
and FP = number of false positives. process was complete, the cases where both coders disagreed at the
The F1 score considers both sensitivity and PPV, and is defined as 4-digit level were identified, and both coders discussed these cases to
resolve codes. Cases for which no agreement could be reached were re-
Sensitivity  PPV ferred to the Boston Regional Office of BLS, where three experienced
F1 ¼ 2 
Sensitivity þ PPV staffers reviewed the cases and assigned event codes to them. These
codes assigned by OHSP coders were considered as the gold standard
The BLS-assigned codes were considered as the gold standard (true for this sample of 871 cases. The BLS-assigned codes and TM-
class), and the predicted codes were compared with the BLS-assigned predicted codes were compared with the codes assigned by OHSP
codes to calculate the sensitivity and PPV values for each category. coders to calculate the sensitivity and PPV for discordant and concor-
dant groups separately.
2.5. Confusion matrix
3. Results and discussion
In addition to sensitivity and PPV, classification results are evaluated
using a two-dimensional confusion matrix, which is often used for In this section, the individual prediction performance of the SW and
displaying multiclass prediction results (Witten & Frank, 2005). A con- TW models are first discussed. We then consider prediction performance
fusion matrix consists of a row and a column for each class. Each cell when TW and SW models agree, before addressing the use of prediction
in the matrix represents the number of cases in the prediction set for probability thresholds to improve accuracy. The sensitivity of the top k
which the true class is the row label and predicted class is the column prediction choices from the SW model, the TW model, and the SW–TW
label. In terms of representation, good prediction results would mean combined model is then presented for k = (1, 2, 3, 4, 5). Finally, we
large numbers as diagonal elements and very few small numbers as estimate the accuracy of the BLS-assigned codes. Further analysis in-
off-diagonal elements. Confusion matrices can reveal trends and pat- dicated that using the proposed Bayesian decision support system is
terns in the classification results and have also been used to create visu- likely to result in comparable accuracy as the BLS-assigned codes
alization tools for machine learning (Talbot, Lee, Kapoor, & Tan, 2009). while requiring less than half of the cases to be manually coded.
Given the structure of the confusion matrix, we propose that it could
be used to assist the human coders in identifying the categories that 3.1. SW and TW model predictions
often get misclassified by the machine learning algorithm and might
require further attention. The overall sensitivity of the prediction set for the SW model was
66%, and for the TW model it was 69%. The sensitivity, PPV, and F1
2.6. Top k predictions score of SW and TW models on the prediction sets for each event code
category are presented in Table 1. As shown in Table 1, each measure
The Naïve Bayes model outputs the prediction probability of each of accuracy varies between categories; there seems to be large variation
class for each case in the prediction dataset. We ranked the classes in measures of accuracy for smaller categories (with fewer cases) while
based on prediction probabilities to obtain a list of the top k prediction there is some tendency for larger categories (with more number of
choices outputted by the SW model, TW model, and SW–TW combina- cases) to report better accuracy. This is also illustrated in Fig. 1(a) and
tion. For SW–TW combination, the prediction results from SW and TW (b), where the F1 score of each category is plotted against the frequency
models were combined together and then ranked based on prediction (number of cases) of the category in Fig. 1(a) for the SW model and in
probabilities to get top k prediction choices. The sensitivities for Fig. 1(b) for the TW model.
74 G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

Table 1
Category-wise sensitivity, PPV, and F1 score for SW and TW models (Column ‘N’ represents the number of cases in the prediction set for that event code. Columns ‘SW Sen’ and ‘TW Sen’
represent SW and TW model sensitivities, respectively. Columns ‘SW PPV’ and ‘TW PPV’ represent SW and TW model PPVs, respectively. Columns ‘SW F1’ and ‘TW F1’ represent SW and
TW model F1 scores, respectively).

Event code Event code description N SW Sen SW PPV SW F1 TW Sen TW PPV TW F1

10 Violence and other injuries by persons or animals, unspecified 2 0.00 — 0.00 0.00 — 0.00
11 Intentional injury by person 214 0.74 0.54 0.62 0.64 0.55 0.59
12 Injury by person—unintentional or intent unknown 237 0.46 0.37 0.41 0.38 0.36 0.37
13 Animal- and insect-related incidents 64 0.86 0.82 0.84 0.70 0.83 0.76
20 Transportation incident, unspecified 2 0.00 — 0.00 0.00 — 0.00
21 Aircraft incidents 1 0.00 — 0.00 0.00 — 0.00
22 Rail vehicle incidents 0 0.00 — 0.00 0.00 — 0.00
23 Animal and other non-motorized vehicle transportation incidents 8 0.13 1.00 0.23 0.00 — 0.00
24 Pedestrian vehicle incident 54 0.28 0.48 0.35 0.50 0.42 0.46
25 Water vehicle incidents 3 0.00 0.00 0.00 0.00 — 0.00
26 Roadway incidents involving motorized land vehicle 287 0.96 0.70 0.81 0.94 0.79 0.86
27 Non-roadway incidents involving motorized land vehicles 68 0.28 0.17 0.21 0.26 0.22 0.24
29 Transportation incident, n.e.c. 0 0.00 — 0.00 0.00 — 0.00
31 Fires 7 0.00 — 0.00 0.00 0.00 0.00
32 Explosions 4 0.25 1.00 0.40 0.25 1.00 0.40
40 Fall, slip, trip, unspecified 41 0.00 0.00 0.00 0.00 0.00 0.00
41 Slip or trip without fall 404 0.18 0.43 0.25 0.31 0.45 0.37
42 Falls on same level 1701 0.78 0.71 0.74 0.82 0.73 0.77
43 Falls to lower level 392 0.73 0.39 0.51 0.74 0.45 0.56
44 Jumps to lower level 16 0.06 0.17 0.09 0.25 0.29 0.27
45 Fall or jump curtailed by personal fall arrest system 2 0.00 — 0.00 0.00 — 0.00
49 Fall, slip, trip, n.e.c. 1 0.00 — 0.00 0.00 — 0.00
50 Exposure to harmful substances or environments, unspecified 8 0.00 — 0.00 0.00 — 0.00
51 Exposure to electricity 17 0.82 1.00 0.90 0.59 0.83 0.69
52 Exposure to radiation and noise 9 0.22 0.50 0.31 0.44 0.80 0.57
53 Exposure to temperature extremes 158 0.95 0.76 0.84 0.84 0.78 0.81
54 Exposure to air and water pressure change 0 0.00 — 0.00 0.00 — 0.00
55 Exposure to other harmful substances 182 0.91 0.70 0.79 0.71 0.77 0.74
56 Exposure to oxygen deficiency, n.e.c. 1 0.00 — 0.00 0.00 — 0.00
57 Exposure to traumatic or stressful event, n.e.c. 21 0.57 0.48 0.52 0.38 0.73 0.50
59 Exposure to harmful substances or environments, n.e.c. substances 4 0.00 — 0.00 0.00 0.00 0.00
60 Contact with objects and equipment, unspecified 36 0.03 1.00 0.06 0.00 — 0.00
61 Needlestick without exposure to harmful substance 6 0.33 0.67 0.44 0.33 1.00 0.50
62 Struck by object or equipment, unspecified 1171 0.65 0.67 0.66 0.68 0.74 0.71
63 Struck against object or equipment 494 0.28 0.53 0.37 0.39 0.60 0.47
64 Caught in or compressed by equipment or objects 369 0.78 0.45 0.57 0.72 0.54 0.62
65 Struck, caught, or crushed in collapsing structure, equipment, or material 1 0.00 — 0.00 0.00 — 0.00
66 Rubbed or abraded by friction or pressure 59 0.51 0.51 0.51 0.68 0.44 0.53
67 Rubbed, abraded, or jarred by vibration 8 0.00 — 0.00 0.00 — 0.00
69 Contact with objects and equipment, n.e.c. 8. 0.00 — 0.00 0.00 — 0.00
70 Overexertion and bodily reaction 88 0.01 0.06 0.02 0.01 0.05 0.02
71 Overexertion involving outside sources 2620 0.80 0.87 0.83 0.84 0.84 0.84
72 Repetitive motions involving microtasks 365 0.83 0.67 0.74 0.82 0.69 0.75
73 Other exertions or bodily reactions 745 0.45 0.63 0.53 0.52 0.66 0.58
74 Bodily conditions, n.e.c. 22 0.09 0.40 0.15 0.23 0.33 0.27
78 Multiple types of overexertion and bodily reactions 10 0.00 0.00 0.00 0.00 0.00 0.00
79 Overexertion and bodily reaction and exertion, n.e.c. 18 0.00 0.00 0.00 0.00 0.00 0.00
99 Non-classifiable 72 0.08 0.20 0.11 0.19 0.28 0.23

Please note that as shown in Table 1, the categories vary greatly in For some of the categories such as 43 (fall lower level) and 64
frequency (number of cases), ranging from 1 to 2620. It is shown in (caught in object), the sensitivity is relatively high but PPV is low for
Fig. 1(a) and (b) that the high-frequency categories have relatively both SW and TW, which indicates a higher number of false positive
higher F1 scores, and most of the low-frequency categories have inferior cases. This might be happening because these categories are closely re-
F1 scores except few categories such as 13 (animal related), 51 (electric- lated to some other more frequent categories. For example, categories
ity), and 53 (temperature extremes), which have high F1 scores. These 43 (fall lower level) and 41 (slip trip without fall) both have about 4%
categories (13, 51, and 53) are very unique in their definition and of cases, and are closely related to category 42 (fall same level), which
might have very specific words in the narrative that are strong predic- is a relatively large category with 12% cases. Because of the considerably
tors, thereby resulting in high F1 scores even with few cases. higher number of cases of category 42 as compared to categories 41 and
The sensitivity and PPV along with the percentage of number of 43, the latter often get misclassified as 42 because they have almost the
cases in the prediction dataset for the 10 highest frequency categories same words in the narrative. Of 404 cases of category 41 in the predic-
are presented graphically in Fig. 2 for SW model, and in Fig. 3 for TW tion set, misclassification as category 42 occurs in 210 instances for
model. As shown in Figs. 2 and 3, both SW and TW models performed the SW model and in 192 instances by the TW model. This effect is
particularly well for the following categories: 26 (roadway MVA), 42 also illustrated in Fig. 4, where the percentage of predictions of true cat-
(fall same level), 71 (overexertion—outside sources), and 72 (repetitive egory, related categories, and non-related categories are presented for
motion) with relatively good sensitivity and PPV values. The TW model the SW and TW models for all high-frequency categories except 26
performed better than the SW model for the following categories: 41, (since 26 is not closely related to any other high-frequency category).
42, 43, 62, 63, and 73, with both measures of sensitivity and PPV being Related categories are those where the cause of injury is similar, such
higher for the TW model. For categories 26 and 64, the TW model had as, falls (41, 42, and 43), struck (61, 62, and 63), and overexertion (71,
lower sensitivity but higher PPV as compared to the SW model. 72, and 73).
G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82 75

a) Plot of Category F1-score vs Number-of- explain to some extent why the TW model performed better. For exam-
cases in Category for Predictions of SW model ple, the TW model's consideration of word sequences such as ‘struck
against’ or ‘bumped against’ as one term in training set narratives pre-
cludes them being considered as separate single words ‘struck’ and
‘against.’ This is important because the word sequence ‘struck against’
or ‘bumped against’ are very strong predictors of category 63, thus help-
ing the TW model to distinguish category 63 (with 5% cases) from being
misclassified as the bigger related category 62 (with 12% cases). Out of
494 cases of category 63 in the prediction set, it is misclassified as cate-
gory 62 in 139 cases by the SW model and in 91 cases by the TW model.
The high percentage of prediction of the true category and closely re-
lated categories by the Naïve Bayes models demonstrates that the
models are not making any unreasonable mistakes. A challenging aspect
is to distinguish two closely related categories, which can be handled by
the model if there are enough training cases available for all closely
related categories so that the model can learn the fine differences
between them. However, there are many closely related categories
b) Plot of Category F1-score vs Number-of- that do not have enough training cases for the model to learn
cases in Category for Predictions of TW model which reinforces the point that human review and manual coding is
particularly important for smaller categories.

3.2. Agreement in prediction results of SW and TW models: semi-automated


combined strategies

The prediction results from both SW and TW models were examined


to identify the cases where the two models predicted the same event
code (also referred to as ‘agreement approach’). Out of the 10,000 pre-
diction cases, the predictions agreed for 7366 cases, out of which 5859
were predicted correctly. The PPV for the 7366 cases where the two
models agreed were much higher (80%) than the PPV of the SW (66%)
and TW (69%) models. Further analysis showed that the PPV and sensi-
tivity for most of the individual categories was higher when the predic-
tions from SW and TW models agreed, as compared to the individual
SW and TW models.
Fig. 1. (a) Plot of category F1 score vs number of cases in category for predictions of SW The cases where the predictions from the SW model and TW model
model. (b) Plot of category F1 score vs number of cases in category for predictions of
did not agree can be filtered for manual review by expert coders (this
TW model.
strategy is referred to as ‘semi-automated combined strategy 1’ hence-
forth). These disagreement cases are probably the more challenging to
code since the two models predicted different categories for the same
Categories 64 (caught in object), 62 (struck by), and 63 (struck narrative. This might be explained by various reasons such as the narra-
against) are closely related and often get misclassified together as tive being complex or ambiguous, the true category being closely relat-
shown in Fig. 4. The TW model yielded higher PPV and sensitivity for ed to another category, etc.
categories 62 and 63 as compared to the SW model. Although these im- We tried to estimate the overall sensitivity of the ‘semi-automated
provements were modest, the differentiating two-word sequences may combined strategy 1’. Assuming the manual coding by experts to be

Fig. 2. Percentage of cases, PPV, and sensitivity of SW model for the ten highest frequency categories.
76 G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

Fig. 3. Percentage of cases, PPV, and sensitivity of TW model for the ten highest frequency categories.

perfect, the disagreement cases were assigned the original event codes autocode only those cases in the prediction dataset where (a) the two
as the predicted codes, and the overall sensitivity was calculated to be models agreed, and (b) the minimum threshold condition was satisfied.
85%. The overall sensitivity of predictions using this strategy was con- It is intuitive that there should be a tradeoff between sensitivity and the
siderably higher than SW (66%) or TW (69%) models, and required portion of cases being autocoded based on the prediction strength
only a small portion of cases (26%) to be manually coded. The number threshold selected. We observed this tradeoff as illustrated in Fig. 6
of prediction cases, sensitivity, and PPV of individual categories for the where the Y-axis represents (in percentage terms) the sensitivity of
‘semi-automated combined strategy 1’ are shown in the form of a con- autocoded cases, and the portion of cases autocoded, using the ‘agree-
fusion matrix in Fig. 5, where (a) each row label represents the true cat- ment approach’ plus a minimum prediction threshold applied to TW
egory, which is the BLS-assigned code in our case; (b) each column label model. When a prediction strength threshold was not applied (i.e. just
represents the predicted category; (c) diagonal entries depict the cor- the ‘agreement approach’), 73.7% of cases (in the prediction set) were
rect predictions; (d) the off-diagonal entries show the misclassifica- autocoded with a PPV of 79.5%, and 26.3% of cases were filtered for man-
tions; (e) the elements in the last column represent the sensitivity of ual coding. At a medium prediction strength threshold of 0.5, 71.8% of
individual categories; and (f) the elements in the last row represent cases were autocoded with a PPV of 80.2%. For a very high prediction
the PPV of individual categories. The confusion matrix can help identify strength threshold of 0.95, only 55.3% of cases were autocoded, but
which categories get frequently misclassified together and can be a use- with a considerably higher PPV of 86.1%.
ful tool for human coders. Typically, these categories are closely related The cases which did not meet the above conditions of agreement be-
to each other and have similar injury narratives. For example, category tween models and minimum prediction strength threshold, were fil-
31 (fire) had 7 cases in the prediction dataset, and was misclassified 6 tered to be coded manually by expert coders. This strategy is referred
times as category 53 (temperature extremes). It is worth mentioning to as ‘semi-automated combined strategy 2’ henceforth. Assuming
that sensitivity and PPV for most categories (particularly rare catego- that the manual coding will be perfect, the cases filtered for manual re-
ries) are significantly better for the ‘semi-automated combined strategy view were assigned the original BLS codes as the predicted codes and
1’ as compared to SW or TW models. the sensitivity was measured. For a high prediction strength threshold
In addition to requiring agreement between prediction results from of 0.95, the overall sensitivity of ‘semi-automated combined strategy
SW and TW models, we also evaluated the effect of applying a minimum 2’ was 92%, which is considerably higher than the overall sensitivity of
prediction strength threshold on the TW model. The approach was to ‘semi-automated combined strategy 1’ (84.9%) but comes at a cost of

Fig. 4. Percentage of predictions of true category, related categories, and non-related categories.
G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82 77

Fig. 5. Confusion matrix for 'semi-automated combined strategy 1'.

manually coding about 18% more cases. The sensitivity and PPV of indi- ‘correct predictions’ where the BLS-assigned event code was present
vidual categories in the form of a confusion matrix for ‘semi-automated in the top k predictions.
combined strategy 2’ is available in Appendix 1. As shown in Table 3, the overall sensitivity increases as the value
of k increases for all the models. This effect is larger for smaller values
3.3. Top k predictions of k; for example, for the SW–TW combined model, there is a steep
increase in sensitivity of 14% between top 1 and top 2 predictions,
The overall sensitivity of the top k predictions for the SW model, but the increase between top 4 and top 5 predictions is only about
the TW model, and the SW–TW combined model for different values 2%. This pattern is more clearly represented in Fig. 7. Comparing
of k (= 1, 2, 3, 4, 5) are presented in Table 2 and Fig. 7. The sensitivity the performance of individual models, we can see that the SW-TW
values are calculated by counting those cases in the prediction set as combined model performs consistently better than the SW and TW

Fig. 6. Sensitivity and percentage of autocoded cases for different prediction strength thresholds.
78 G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

Table 2 Results for the discordant group are presented in Table 4, revealing
Overall sensitivity of top k predictions for SW, TW, and SW–TW combined models. very similar levels of consistency for the BLS-assigned codes and TM-
Top k Overall sensitivity predicted codes (when SW and TW model predictions agree). The over-
SW (%) TW (%) SW–TW combined (%)
all sensitivity of the TM-predicted codes was 39.6% and 42.3% for BLS-
assigned codes. The discordant group cases are probably more challeng-
Top 1 66.14 69.05 70.24
ing to code since different approaches predict different categories for
Top 2 81.70 82.82 84.16
Top 3 87.46 87.6 89.14 the same narrative. Hence, the overall low sensitivity of BLS-assigned
Top 4 90.61 90.19 91.48 codes and TM-predicted codes was not surprising.
Top 5 92.37 91.41 93.3 In the discordant group, nine categories had 25 or more cases
(12, 41, 42, 43, 62, 63, 64, 71, and 73) and accounted for 77% of
the cases. As shown in Table 4, sensitivity and PPV for most
of these categories was less than 0.5 for both BLS-assigned and
models. The results show similar levels of performance for the SW TM-predicted codes.
and TW models. The TW model performs slightly better than SW Several findings for the TM-predicted codes suggested potential sys-
model for smaller values of k but the SW model performs better for tematic coding errors/issues for some categories, such as, fall-related
higher values of k. categories (41, 42, 43), struck-by (62), struck-against (63), and
Given the relatively high sensitivity of the top 5 choices, it would be caught-in (64). TM-predicted codes did well in capturing category 43
reasonable to assume that the manual coding performance would im- (falls to a lower level) with sensitivity of 73.7% but there were many
prove if the human coders are presented the list of top 5 options. As false positives (PPV 28.9%). In turn, TM-predicted codes did poorly in
mentioned in the Introduction section, this approach has been found capturing category 41 (slip and trips without falls) with sensitivity of
to be useful for similar tasks in other domains such as manually only 14.5% but a high percentage of cases predicted this code were con-
assigning ICD codes based on medical notes. The high sensitivity firmed as true cases (PPV 72.7%). These juxtapositions of performance
(92.4%) of the top 5 predictions for the SW models indicates that even measures for closely related codes suggest some miscoding within
a simple and computationally efficient SW Naïve Bayes model can be fall-related categories. The same point applies to the struck-related cat-
used for presenting informative top 5 choices to human coders to im- egories 62, 63, and 64.
prove accuracy. BLS-assigned codes tended to be inconsistent (both
measures—sensitivity and PPV b 30%) for category 12 (injury by per-
son unintentional or intent unknown), and also for category 64
3.4. Coding accuracy (caught in). Consistency for category 43 (falls to a lower level) was
also low with sensitivity being 21.1% and PPV 42.1%. For most of
In order to examine the consistency of BLS-assigned codes or the rare categories (≤ 25 cases), BLS-assigned codes had better con-
computer-predicted codes, two independent OHSP coders manually sistency than TM-predicted codes but these findings are based on
coded a sample of 871 cases from the prediction set which had 752 small numbers.
cases from the discordant group (where SW and TW models both
agreed but did not agree with BLS-assigned codes), and 117 cases 3.4.1. Accuracy of BLS-assigned codes
from the concordant group (where SW and TW model predictions Based on the above results, we estimated the accuracy of BLS
both agreed with BLS-assigned codes). assigned codes for this dataset as follows. We divided the prediction
For the concordant group, the BLS-assigned codes agreed with the set into three parts: concordant (where SW = TW = BLS), discordant
OSHP manually assigned ‘gold standard’ codes for 111 out of 117 cases (where SW = TW ≠ BLS), and neither-concordant-nor-discordant
(i.e., the overall agreement was 94.8%). The agreement percentage for (where SW ≠ TW). In our prediction dataset of 10,000 cases, the SW
individual categories is presented in Table 3. Such high levels of agree- and TW models agree for 7366 cases among which 5859 cases belong
ment were expected as the codes predicted by SW and TW models to the concordant group, and 1507 cases to the discordant group. The
agreed with BLS-assigned codes indicating that the cases in the concor- remaining 2634 cases belonged to neither-concordant-nor-discordant
dant group were not very challenging to code. group.
As shown in Table 3 for the concordant group, P(BLS = True |
Concordant) = 94.8%. For the discordant group, P(BLS = True |
Discordant) = 42.3%, P(TM = True | Discordant) = 39.6% as shown in
Table 4. Using these results, the P(BLS = True | Concordant or Discor-
dant) can be estimated by taking a weighted average of 5859 Concor-
dant cases and 1507 Discordant cases. That is,

5859  94:8% þ 1507  42:3%


P ðBLS ¼ TruejConcordant or DiscordantÞ ¼
7366
¼ 84:06%

Then, making an optimistic assumption: P(BLS = True | Neither


Concordant nor Discordant) = P(BLS = True | Concordant) = 94.8%
for the 2634 cases where the SW and TW models did not agree, and
taking a weighted average, the overall P(BLS = True) can be estimat-
ed as

7366  84:06% þ 2634  94:8%


P ðBLS ¼ TrueÞ ¼ ¼ 86:88%
10; 000

Fig. 7. Overall sensitivity of top k predictions of SW, TW, and SW–TW combined models for This optimistic estimation can be treated as the upper bound of accu-
different values of k. racy of BLS-assigned codes.
G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82 79

Table 3
Agreement between concordant case event codes and gold standard OHSP manually assigned event codes.

Event code Event description Total number of cases Number of agreements % Agree

11 Intentional injury by person 1 0 0%


12 Injury by person—unintentional or intent unknown 2 2 100%
13 Animal- and insect-related incidents 2 2 100%
26 Roadway incidents involving motorized land vehicle 3 3 100%
41 Slip or trip without fall 2 1 50%
42 Falls on same level 27 27 100%
43 Falls to lower level 2 2 100%
51 Exposure to electricity 1 1 100%.
53 Exposure to temperature extremes 3 3 100%
55 Exposure to other harmful substances 2 2 100%
62 Struck by object or equipment, unspecified 13 12 92.3%
63 Struck against object or equipment 2 2 100%
64 Caught in or compressed by equipment/objects 4 3 75%
66 Rubbed or abraded by friction or pressure 1 1 100%
70 Overexertion and bodily reaction, unspecified 1 0 0
71 Overexertion involving outside sources 41 40 97.5%
72 Repetitive motions involving micro-tasks 6 6 100%
73 Other exertions or bodily reactions 4 4 100%
Total 117 111 94.8%

The cases where the prediction results from SW and TW models This pessimistic estimation can be treated as the lower bound of
do not agree, will most likely not be as straightforward as the accuracy of the BLS-assigned codes. From the above calculations,
Concordant group. Making a pessimistic assumption that they were the estimated accuracy of the BLS-assigned code lies in the range
as challenging as the discordant group, P(BLS = True | Neither [73%–86.88%].
Concordant nor Discordant) = P(BLS = True | Concordant) = 42.3%,
then taking a weighted average, the overall P(BLS = True) can be es- 3.4.2. Accuracy of semi-automated coding
timated as If the ‘semi-automated combined strategy 1’ is used, the pro-
posed Bayesian decision support system will autocode the cases
where the predictions from SW and TW models agree, and the dis-
7366  84:06% þ 2634  42:3% agreement cases will be coded manually by expert human coders.
P ðBLS ¼ TrueÞ ¼ ¼ 73:06%
10; 000 We have assumed that these coders are well trained experts, and

Table 4
Sensitivity and PPV for TM-predicted and BLS-assigned 2-digit OIICS event codes using OHSP manually assigned codes as gold standard. Column ‘N’ represents the number of cases in the
prediction set for that event code.

Event code Description N TM Sensitivity TM PPV BLS Sensitivity BLS PPV

11 Intentional injury by person 18 77.8 58.3 22.2 33.3


12 Injury by person—unintentional or intent unknown 25 64 42.1 28 20
13 Animal- and insect-related incidents 5 40 100 60 100
24 Pedestrian vehicle incident 10 30 60 70 100
26 Roadway incidents involving motorized land vehicle 10 90 40.9 10 100
27 Non-roadway incidents involving motorized land vehicles 7 28.6 25 57.1 30.8
31 Fires 1 0 100 100
40 Fall, slip, trip, unspecified 11 0 9.1 6.3
41 Slip or trip without fall 110 14.5 72.7 61.8 59.6
42 Falls on same level 99 54.5 33.8 33.3 51.6
43 Falls to lower level 38 73.7 28.9 21.1 42.1
44 Jumps to lower level 5 20 100 80 57.1
45 Fall or jump curtailed by personal fall arrest system 1 0 100 100
51 Exposure to electricity 2 0 100 100
52 Exposure to radiation and noise 2 0 100 100
53 Exposure to temperature extremes 2 50 11.1 50 33.3
55 Exposure to other harmful substances 11 63.6 63.6 27.3 50
57 Exposure to traumatic or stressful event, n.e.c. 1 100 100 0 0
60 Contact with objects and equipment, unspecified 7 0 28.6 16.7
62 Struck by object or equipment, unspecified 57 35.1 28.2 54.4 41.3
63 Struck against object or equipment 51 11.8 60 66.7 43
64 Caught in or compressed by equipment or objects 25 72 29.5 20 26.3
66 Rubbed or abraded by friction or pressure 13 84.6 91.7 15.4 66.7
67 Rubbed, abraded, or jarred by vibration 2 0 0 0
70 Overexertion and bodily reaction 22 0 9.1 8.3
71 Overexertion involving outside sources 113 53.1 49.2 37.2 58.3
72 Repetitive motions involving microtasks 18 66.7 35.3 22.2 21.1
73 Other exertions or bodily reactions 60 28.3 42.5 67.1 37
74 Bodily conditions, n.e.c. 6 0 50 60
78 Multiple types of overexertion and bodily reactions 11 0 0 0
99 Nonclassifiable 9 0 0 66.7 35.3
Total 752 39.6 42.3
80 G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

would also be helped by a list of top 5 prediction choices and the but even with a high threshold of 0.95, we are able to autocode
confusion matrix. The accuracy of this strategy can be estimated more than half of cases (55.3%), and assuming perfect manual
as follows. coding of filtered cases, an overall sensitivity of 92% can be
As shown in Table 3, P(TM = True | Concordant) = 94.8%, and achieved.
P(TM = True | Discordant) = 39.6% as presented in Table 4. Then, tak- Among the event code categories, rare categories that are closely
ing the weighted average for concordant (5859) and discordant related to other larger categories remain a challenge to autocode as
(1507) cases, and assuming that the manual coding done by expert these categories are consistently misclassified by different models
human coders (on remaining 2634 cases) would be perfect, with high prediction strengths. A possible approach to handle rare
P(TM = True) can be estimated as categories could be to use a confusion matrix as a filtering tool to
examine the cases that can potentially be misclassified by
autocoding.
5859  94:8% þ 1507  39:6% þ 2634  100% We also examined the consistency of BLS-assigned and TM-
P ðTM ¼ TrueÞ ¼
10; 000 predicted event codes by comparing them with codes assigned
¼ 87:85% by two independent OHSP coders and observed some discrepan-
cies. The high level of agreement for the concordant group,
while based on small numbers, was reassuring. The higher incon-
which is about 1% better than the estimated upper bound of accuracy
sistency in coding on the discordant group for both the TM-
(86.88%) of BLS-assigned codes and involves a significantly smaller
predicted and BLS-assigned codes provided some insight into
amount of manual coding (26%).
some specific coding challenges to be addressed in both manual
Instead of assuming that the manual assigned codes will be
coding and autocoding.
100% correct, even if we assume the accuracy to be 94.8%, which
This study is also important as it demonstrates the feasibility of
is P(BLS = True | Concordant), the overall sensitivity can be estimat-
autocoding SOII data according to the new version 2.01 of OIICS coding
ed as
scheme, which has about 45 two-digit event codes, higher than most of
the previous studies in this area. One of the strengths of this study was
the large size of the training data set compared to that used in prior
5859  94:8% þ 1507  39:6% þ 2634  94:8%
P ðTM ¼ TrueÞ ¼ studies of coding occupational injury data. Notably, even with this larger
10; 000
training set, there were very few training cases of rare event codes in the
¼ 86:48%
training dataset.
The effect of incorporating coded information on nature of in-
which is almost the same as the estimated upper bound of accuracy jury and body part in the model(s) should also be explored. Fu-
of BLS-assigned codes (86.8%) with the expert coders having to ture work should also be directed toward using other machine
code only 26% cases. learning models in combination with Bayesian methods, such as
The ‘semi-automated combined strategy 2’ can also be used support vector machine, logistic regression, etc. Each of these
by selecting different levels of prediction strength thresholds, models has different underlying mathematical principles and
but it will be difficult to estimate the overall accuracy for that looking at the agreement between their prediction results may
approach since the manual coding exercise was carried out only lead to improved performance for individual categories as well
for concordant and discordant groups. However, it is intuitive as overall PPV.
that a higher prediction strength threshold would filter more
cases for manual review, and assuming the accuracy of expert
coders to be at least 94.8%, the overall accuracy would be slightly 5. Practical applications
more than that of ‘semi-automated combined strategy 1’. The es-
timated accuracy values suggest that a Bayesian decision support Occupational injury survey data are often used for surveil-
system can be used for coding occupational injury data with lance and other purposes by various groups of people such as
good accuracy while requiring reduced amount of manual government agencies, policymakers, safety standards writers, in-
coding. surance companies, and manufacturers of safety equipment. Use
of the proposed Bayesian decision support system may result in
4. Conclusions faster and comparably accurate coding of occupational injury
data as compared to manual coding, and will also provide assis-
Results from this study show that a semi-automated Bayesian tance to manual coders. Accurately and timely coded occupation-
decision support system that autocodes a large portion of the data al injury data can help in quickly identifying the prevalent causes
with good accuracy and leaves a small portion of cases to be coded by of injuries at workplace, and thus planning for remedial action or
a few well-trained expert coders may yield comparable accuracy to an revising safety standards. Early implementation of revised safety
entirely manual coding system for coding occupational injury data. measures by organizations can prevent occupational injuries, and
Hence, use of such a Bayesian decision support system seems promising save lives.
for coding event information based on injury narratives from large oc-
cupational injury databases.
The SW and TW Naïve Bayes models used for building the Acknowledgements
decision support system yielded reasonably good prediction
performance, with the TW model being slightly better. The PPV This study was conducted at the Massachusetts Department of
of prediction cases where the SW and TW models agreed was Public Health Occupational Health Surveillance Program (OHSP) in
higher than individual models and was similar to levels reported collaboration with Purdue University. This study was funded
in other studies applying Naïve Bayes models for coding injury through a cooperative agreement with the Bureau of Labor Statistics:
narratives in workers' compensation records. Use of an additional OS 24725-13-75-J-25. The authors would like to express their grati-
prediction strength threshold along with agreement in prediction tude to Sangwoo Tak, ScD, MPH and James R. Laing, B.S. from Massa-
results of SW and TW model exhibited a tradeoff between the ac- chusetts Department of Public Health for their contributions in this
curacy of autocoding and the proportion of cases being autocoded, study.
G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82 81

Appendix 1. Confusion matrix for ‘semi-automated combined strategy 2’(autocoding of cases where (a) prediction results of SW and TW
models agree, and (b) prediction strength of TW model > 0.95. Manual coding of remaining cases)
BLS Assigned Event Codes

References Northwood, J. M., Sygnatur, E. F., & Windau, J. A. (2012). Updated BLS occupational in-
jury and illness classification system. Monthly Labor Review, 19 (Retrieved from
Bondy, J., Lipscomb, H., Guarini, K., & Glazner, J. E. (2005). Methods for using narrative text http://wwwn.cdc.gov/wisards/oiics/Doc/UpdatedOIICSNorthwoodetal2012.pdf).
from injury reports to identify factors contributing to construction injury. American Occupational Safety and Health Statistics Program (2014, January). Retrieved March 14,
Journal of Industrial Medicine, 48(5), 373–380. http://dx.doi.org/10.1002/ajim.20228. 2014, from http://www.mass.gov/lwd/labor-standards/occupational-safety-and-
Bertke, S. J., Meyers, A. R., Wurzelbacher, S. J., Measure, A., Lampl, M. P., & Robins, D. health-statistics-program/
(2016). Comparison of methods for auto-coding causation of injury narratives. Rizzo, S. G., Montesi, D., Fabbri, A., & Marchesini, G. (2015). In N. Ashish, & J. -L. Ambite
Accident Analysis & Prevention, 88, 117–123. http://dx.doi.org/10.1016/j.aap.2015.12. (Eds.), ICD code retrieval: Novel approach for assisted disease classification
006. (pp. 147–161). Springer International Publishing Retrieved from http://link.springer.
Chen, L., Vallmuur, K., & Nayak, R. (2015). Injury narrative text classification using factor- com/chapter/10.1007/978-3-319-21843-4_12
ization model. BMC Medical Informatics and Decision Making, 15(Suppl. 1), S5. http:// Talbot, J., Lee, B., Kapoor, A., & Tan, D. S. (2009). EnsembleMatrix: interactive visualization to
dx.doi.org/10.1186/1472-6947-15-S1-S5. support machine learning with multiple classifiers. New York, NY, USA: ACM,
Corns, H. L., Marucci, H. R., & Lehto, M. R. (2007). In M. J. Smith, & G. Salvendy (Eds.), 1283–1292. http://dx.doi.org/10.1145/1518701.1518895.
Development of an approach for optimizing the accuracy of classifying claims narra- Taylor, J. A., Lacovara, A. V., Smith, G. S., Pandian, R., & Lehto, M. (2014). Near-miss narra-
tives using a machine learning tool (Textminer[4]) (pp. 411–416). Berlin Heidel- tives from the fire service: A Bayesian analysis. Accident Analysis & Prevention, 62,
berg: Springer (Retrieved from http://link.springer.com/chapter/10.1007/978-3- 119–129. http://dx.doi.org/10.1016/j.aap.2013.09.012.
540-73345-4_47). U.S. Department of Labor (2005). OSHA recordkeeping handbook. (Retrieved from https://
Larkey, L. S., & Croft, W. B. (1996). Combining classifiers in text categorization. New York, www.wisconsin.edu/workers-compensation/download/frequently_used_guidance/
NY, USA: ACM, 289–297. http://dx.doi.org/10.1145/243199.243276. OSHA1904Recordkeepingpub3245rev.pdf).
Lehto, M., Marucci-Wellman, H., & Corns, H. (2009). Bayesian methods: A useful tool for U.S. Department of Labor, W. D. C. (2012). Bureau of Labor Statistics, Occupational injury
classifying injury narratives into cause groups. Injury Prevention, 15(4), 259–265. and illness classification manual, version 2.01. Retrieved from http://wwwn.cdc.gov/
http://dx.doi.org/10.1136/ip.2008.021337. wisards/oiics/Doc/OIICS Manual 2012 v201.pdf
Leman, S., & Lehto, M. R. (2003). Interactive decision support system to predict print qual- Vallmuur, K. (2015). Machine learning approaches to analysing textual injury surveillance
ity. Ergonomics, 46(1–3), 52–67. http://dx.doi.org/10.1080/00140130303531. data: A systematic review. Accident Analysis & Prevention, 79, 41–49. http://dx.doi.org/
Marucci-Wellman, H. R., Lehto, M. R., & Corns, H. L. (2015). A practical tool for public 10.1016/j.aap.2015.03.018.
health surveillance: Semi-automated coding of short injury narratives from large ad- Wang, S., & Manning, C. D. (2012). Baselines and bigrams: Simple, good sentiment and
ministrative databases using Naïve Bayes algorithms. Accident Analysis and Prevention, topic classification. Stroudsburg, PA, USA: Association for Computational
84, 165–176. http://dx.doi.org/10.1016/j.aap.2015.06.014. Linguistics, 90–94 Retrieved from http://dl.acm.org/citation.cfm?id=2390665.
Marucci-Wellman, H., Lehto, M., & Corns, H. (2011). A combined Fuzzy and Naive Bayesian 2390688
strategy can be used to assign event codes to injury narratives. Injury Prevention, Wellman, H. M., Lehto, M. R., Sorock, G. S., & Smith, G. S. (2004). Computerized coding of
17(6), 407–414. http://dx.doi.org/10.1136/ip.2010.030593. injury narrative data from the National Health Interview Survey. Accident Analysis &
Measure, A. C. (2014). Automated Coding of Worker Injury Narratives (Joint Statistical Prevention, 36(2), 165–171. http://dx.doi.org/10.1016/S0001-4575(02)00146-X.
Meetings 2014 - Government Statistics Section). Boston, MA, USA: U.S. Bureau of Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and tech-
Labor Statistics Retrieved from http://www.bls.gov/osmr/pdf/st140040.pdf. niques (2nd ed.). Morgan Kaufmann Retrieved from https://books.google.com/
Noorinaeini, A., & Lehto, M. R. (2006). Hybrid singular value decomposition: A model of books?id=QTnOcZJzlUoC
human text classification. International Journal of Human Factors Modelling and Zhu, W., & Lehto, M. R. (1999). Decision support for indexing and retrieval of information
Simulation, 1(1), 95–118 (Retrieved from http://inderscience.metapress.com/ in hypertext systems. International Journal of Human Computer Interaction, 11(4),
content/4JDTJNQ7EJDNGUD1). 349–371. http://dx.doi.org/10.1207/S15327590IJHC1104_5.
82 G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

Gaurav Nanda is a Ph.D. candidate in the School of Industrial Engineering at Purdue Letitia Davis, ScD, EdM is the Director of the Occupational Health Surveillance Program,
University. His research area is to use machine learning methods and natural language Massachusetts Department of Public Health (MDPH) and has worked over many years
processing techniques along with domain knowledge-based rules to improve the to develop state-based surveillance systems for work-related injuries, illnesses, and haz-
autocoding accuracy of injury data for public health and occupational injury surveys. ards. She has overseen the formation of a comprehensive surveillance system for fatal oc-
cupational injuries, the Massachusetts Sharps Injury Surveillance System, a surveillance
Kathleen Grattan, MPH is an applied epidemiologist with the Massachusetts Department system for work-related asthma, the Massachusetts Occupational Lead Registry, and a
of Public Health (MDPH) in the Occupational Health Surveillance Program and has coordi- model surveillance system for work-related injuries to young workers.
nated a wide range of surveillance and research projects that involve the management,
analysis, and summary of large administrative and survey datasets, including hospital da- Mark Lehto, Ph.D., is a Professor at the School of Industrial Engineering and the Director of
ta, workers' compensation data, and data from the Survey of Occupational Injuries and Ill- Industrial Engineering Discovery-to-Delivery Center at Purdue University. His research in-
ness (SOII). terests include text mining, safety engineering, decision support systems, and human fac-
tors. He has taught and developed several different undergraduate and graduate courses
MyDzung Chu, MSPH is an epidemiologist with the Massachusetts Department of Public within the School of Industrial Engineering, including classes on Safety Engineering, Engi-
Health (MDPH) in the Occupational Health Surveillance and the Health Survey Programs. neering Economics, Industrial Ergonomics, and Work Design.
She started at MDPH as a CDC/CSTE Applied Epidemiology Fellow and has led geographic
analyses of work-related injuries using ACS data, analysis of workers' compensation data
for local public sector workers, and also coordinates the state's Youth Health Survey.

You might also like