Evaluation of Driver Drowsiness by Trained Raters: Pergsmon Oool-4575 (94) Eool7-F

Accid. Anal. and Prev., Vol. 26, No. 5, pp.
571-581, 1994
Pergsmon Copyright 6 1994 Elsevier Science Ltd
Printed in the USA. All rights reserved
MW-4.575194 $6.00 + .OO
OOOl-4575(94)EOOl7-F
EVALUATION OF DRIVER DROWSINESS BY

TRAINED RATERS
Vehicle Analysis and Simulation Laboratory, Virginia PoIytechnic Institute and State University,
Blacksburg, VA, U.S.A.
(Accepted 11January 1994)
Abstract-Drowsiness of vehicle operators is a major hazard in transportation systems, and methods need
to be developed for practical evaluation of drowsiness level. One suggested approach is observer rating.
Accordingly, an experiment was carried out using trained observer-raters to evaluate the levels of drowsiness
of drivers, the drivers’ faces were recorded on videotape. Videotaped segments of drivers at various stages
of drowsiness were presented in two sessions separated by a time interval of one week. The experiment was
directed at determining test-retest reliability, interrater reliability, intrarater re~iab~ity, and consistency.
Results indicate that such ratings are reliable and consjstent. A subsequent experiment shows that ratings
covary with other known indicators of drowsiness.
Keywords-Alertness, Driver performance, Drowsiness, Fatigue
INTRODUCTION observer ratings of drowsiness would be assessed.

The observer-rating approach appears to have been
Drowsiness of vehide operators is believed to be a
largely overlooked by researchers.
serious problem in transportation systems and a di-
Before describing the experiment conducted, it
rect or contributing cause of many accidents. Tiiley,
is worthwhile to discuss how observer ratings of
Erwin, and Gianturco (19731, for example, found,
drowsiness could be used, assuming they are scien-
by means of questionnaires administered to 1,500
ti~cal~y worthy. There are several potential applica-
drivers receiving license renewals, that 64% had at
tions. First, there is the possibility of a “black box”
one time or another become drowsy while driving.
recorder in which a videotape is made of the opera-
Seven percent of the drivers in the survey indicated
tor’s face. Modern cameras can operate at very low
that they had had an accident involving drowsiness,
light levels, so that darkness does not cause diffi-
and another seven percent indicated they had had
culties. In addition, modern cameras can be made
a near miss resulting from drowsiness. Because of
very small so that they are unobtrusive. If an acci-
the hazard that drowsiness presents in transporta-
dent occurs, the tape could be retrieved and assessed
tion systems, methods need to be developed for op-
for drowsiness. Such an approach could be used in
erationaliy assessing and counteracting its effects. gaining accurate records of accident causes and in
‘This paper describes a method of assessing the developing work schedules and other countermea-
level of drowsiness based on a rating system. The sures that minimize the Iike~ihood of such accidents.
input info~ation for the rating is a video image of fn another application, e.g. in transportation sys-
the vehicle operator’s face. Recently, while per- tems of limited range, such as subways, video im-
forming simulator experiments on sleep-deprived ages could be sent by land line to an observer station.
drivers, it was found that the experimenters con- The trained observer could then re-alert any one of
ducting the experiments could estimate the level of several operators who appears to be having drowsi-
drowsiness of the drivers based on characteristics ness difficulties, or, if necessary, take orher action
such as facial tone, slow eyelid closure, and manner- to minimize the chances of an accident. Yet another
isms (rubbing, yawning, nodding, etc.). Of course, very important application is that of serving as an
such an approach is anecdotal and requires scientific independent assessment variable. The development
examination. Sufficient promise existed that it was of any drowsiness detection device or system re-
decided to conduct a formal experiment in which quires an independent variable that assesses level
571
512 W. W. WIERWILLE and L. A. ELLSWORTH
of drowsiness and against which the device or sys- is an appropriate approach to studying the sleep-
tem can be checked. Traditionally, EEG (electroen- wake cycle in nursing home settings.
cephalogram) waveforms and estimation of sleep There has been one known previous application
stages have been employed for this purpose. While of “facial expression” to drowsiness research. In
the EEG approach is worthwhile, it is usually diffi- their paper on drowsiness warning devices, Yabuta
cult to use in practice. et al. (1985) used a three-level ordinal scale of facial
Because of the potential applications of ob- expression as one component in assessing alertness.
server ratings of drowsiness, an experiment was The other components were brain-wave evaluation
conducted to determine whether the concept has and amount of blinking. The combined assessment,
scientific merit. This paper describes the experiment called an alertness index, was used as an indepen-
and the results in detail. A subsequent experiment dent variable in simulator tests directed at devising
was also run. Its purpose was to determine the de- a drowsiness detection system. It appears that Ya-
gree of correlation of observer ratings with other buta et al. (198.5) did not evaluate their facial expres-
supposed indicators of drowsiness. High correla- sion ratings in terms of reliability.
tions would represent a form of validity of observer The literature does thus suggest the notion of
ratings as a measure of drowsiness. This latter exper- using observers for assessment of drowsiness-
iment is summarized in the present paper. related decrements, and, indeed, one study has used
a coarse scale of facial expression. However, none
of the previous studies provides an adequate basis
RESEARCH LITERATURE
for assessing the reliability and consistency of ob-
Most of the relevant literature has been con- server ratings of drowsiness for use in transportation
cerned with ratings by the subjects themselves (Ogil- systems.
vie, Wilkinson, and Allison 1989; Yoshitake 1971,
1978; Rosa et al. 1985). These studies presented the
METHOD
subjects with scales or questionnaires to fill out ei-
ther during or after completion of an experiment. Raters
The issue addressed was the feeling of fatigue at a Six individuals (three males and three females)
given time during the experiment. Most of the stud- volunteered to participate in this study. They were
ies have used the Stanford Sleepiness Scale (which graduate students in the Human Factors Engineering
is a 7-point Likert-type scale) or the fatigue rating program at Virginia Polytechnic Institute and State
scale developed by Yoshitake (1971, 1978). The University. These students were chosen because of
study by Ogilvie et al. (1989) found that there was their familiarity with rating procedures and human
a high correlation between the subjective index (the factors methodology. (It was assumed that persons
Stanford Sleepiness Scale) and behavioral and phys- performing drowsiness evaluations in research or in
iological variables. The study further suggests that applications would have received behavioral train-
the subjective index would be a useful identifier of ing.) Each individual participated in two separate
progress towards early phases of sleep (Ogilvie et sessions, each lasting approximately one hour.
al. 1989). Another study found a high correlation
between subjective ratings (using the Stanford Apparatus
Sleepiness Scale and questionnaires) and perfor- Previous experiments involving drowsy drivers
mance measures (Rosa et al. 1985). Two types of had been performed in the Vehicle Analysis and
cognitive tasks were used as indicators of perfor- Simulation Laboratory, in which low-light level
mance: grammatical reasoning and digit addition. video recordings of the drivers’ faces had been
Again, these studies used ratings given by the sub- made. The videotapes were retained for archival
jects as opposed to using ratings given by observers. purposes and were available for use in the present
One study was found in which observer rating study. The tapes showed drivers driving a computer-
of drowsiness was studied (Carroll, Bliwise, and controlled, moving-base driving simulator, and con-
Dement 1989). In particular, interrater reliability of tained episodes of a variety of levels of apparent
observations was examined. This study used two drowsiness. Thus, segments of the tapes could be
individuals to observe the disturbances of the sleep- transferred to new master tapes for use in the present
wake cycle in 39 nursing home residents. The obser- experiment. Two such tapes were made for the ex-
vations were made during daytime and nighttime. periment. The tapes were played on a high-quality
The results exhibited a high interrater reliability video recorder and 20-inch monitor. (The video im-
when observing the sleep-wake cycle. The study age was monochrome.)
results suggest that use of behavioral observations A new rating scale was developed for this exper-
Driver drowsiness evaluation 573
Segment #
I I I I I I I I I
Not Slightty Moderately Very Extremely
Drowsy Drowsy Drowsy Drowsy Drowsy
Fig. 1. Rating scale used in the experiment.
iment. It was a form of the Likert (descriptive graph- sion. A counterbalanced design was used in which
ics) scale. The continuous scale contained five half the raters received Tape A first, followed by
descriptors: Not Drowsy, Slightly Drowsy, Mod- Tape B, while the other half of the raters received
erately Drowsy, Very Drowsy, and Extremely Tape B first, followed by Tape A. Raters 1, 2, and
Drowsy. One scale was used for each segment to 3 (two males and one female) received Tape A, then
be evaluated, totaling 48 scales (24 scales for each Tape B, while Raters 4, 5, and 6 (two females and
session). The scale is shown in Fig. 1. Each copy one male) received Tape B, then Tape A. The pur-
of the scale was provided on a separate slip of paper, pose in using two video tapes, separated by one
approximately 22 cm wide and 7 cm high, to mini- week, was to allow the determination of test-retest
mize influences from previous ratings. reliability.
A Macintosh5 II personal computer was used
htrarrater reliability. On each of the tapes,
to analyze the data from this experiment. The Super
three of the segments were repeated on that same
ANOVA computer software package, version 1.11
tape. On Tape A, Segments 3, 8, and 10 were re-
(Abacus Concepts, Inc., Berkeley, CA), and the EX-
peated as Segments 23, 18, and 19, respectively.
CEL computer software package, version 3.0 (Mi-
Likewise, on Tape B, Segments 3, 8, and 10 were
crosoft Corporation, Redmond, WA), were used to
repeated as Segments 23, 18, and 19, respectively.
perform statistical analyses of the resulting data.
The three segments repeated on Tape A were differ-
ent from those repeated on Tape B. During a particu-
lar session, the rater was thus exposed to the three
The experimentai design used in this study was segments twice. The repetition of segments gave six
a single factor complete factorial design. The single sets (pairs) of scores for each rater which were used
factor was rater. This main factor (with six levels) to determine intrarrater reliability.
was treated as the independent variable. By treating Test-retest reliability. In addition to the repeti-
rater as the independent variable, each cell of the tion of three of the segments within each session,
experimental design contained 48 replications of the three different segments from the first session were
rating task. In this experimental design, rater was repeated in the second session. Segments 5, 12, and
a within-rating-task variable rather than rating-task 20 from Tape A were repeated as Segments 20, 12,
being a within-rater variable. The dependent vari- and 5, respectively, on Tape B. Therefore, each
ables were the raw-error-rating-scores. In both rater was exposed to these three segments a second
cases, errors were defined as differences from the time during Session 2. This procedure of repeating
mean across raters. There were 48 scores per experi- segments gave three pairs of scores per rater for use
mental cell, giving a total of 288 data points. in determining test-retest reliability.
The 48 segments to be rated were divided into Interrater reliability. To determine interrater
two groups and then recorded onto two video tapes reliability, all repeated segments and the first seg-
(24 segments per tape). The segments represented ment (the practice segment) were temporarily de-
various levels of alertness/drowsiness and were as- leted from the data, Only those segments that were
signed to a location on the tape. One tape was pre- not repeated were used in the statistical analysis.
sented during the first session, and the other tape After deleting the repeated segments, there re-
was presented during the second session, which oc- mained 28 segments per rater (14 segments from
curred approximately one week after the first ses- each session).
W. W. WIERWILLEand L. A. ELLSWORTH
I m Rater 1 0 Rater 2 0 Rater 3 0 Rater 4 ARater 5 0 Rater ~

6 I
E 90
B
d 80
E
,e 70
DA
f n
1 3 5 7 9 11 13 15 17 19 21 23 25 27
Rank of Segment Means
Fig. 2. Raw rating scores as a function of segment mean rank.
Procedure When the rater returned after approximately

On the first day of the experiment, the rater one week for the second session, the written instruc-
was asked to read the general instructions for the tions were offered for review. Once the instructions
experiment. These instructions described the nature were reviewed, the experimenter asked the rater to
of the experiment, the tasks to be performed, and review the Description of Drowsiness Continuum
the approximate length and timing of the two ses- form.
sions. The instructions made it clear that the rating Experimental task procedures. The rating task
scale was a continuous one and that the rater could consisted of viewing 24 segments of different drivers
place a rating anywhere on the scale, not just at one at various levels of drowsiness for each session and
of the descriptors. Once the instructions were read, subjectively rating each segment on its correspond-
the rater was asked to read the informed consent ing rating-scale form. When the experimental ses-
form and sign the form if he or she agreed to the sion began, the first videotaped image appeared on
conditions of the study. Any questions concerning the screen. A short time thereafter, a recorded voice
the instructions, the informed consent form, or the instructed the rater to begin the evaluation for that
experiment in general were answered. The rater was segment. For example, at the beginning of Segment
then seated in front of the videorecorder and moni- 1, the rater heard “Begin, Segment 1.” This com-
tor. At this point the experimenter reviewed the mand informed the rater that the evaluation period
instructions and gave additional instructions. These for segment 1 had begun. The rater observed the
additional instructions included showing the rater videotaped driver until a second voice command,
the rating forms, giving examples of how to mark “End, Segment I,” was given. (The length of the
the scales correctly, and providing the rater with the evaluation period was one minute.) The “End” com-
“Description of Drowsiness Continuum” form. This mand informed the rater that the evaluation period
form contained a description of the various levels was over and that a rating on the scale was to be
of drowsiness and gave an idea of the characteristics provided. The rater observed the beginning of the
to look for when rating the segments. This form is videotaped image prior to the “Begin, Seg-
shown in Appendix A. The rater read the description ment- ” command, but was instructed to only
form before the experiment began and was also al- rate the interval between the “Begin” and “End”
lowed to refer to the description form during the commands.
experiment. Once all questions had been answered, After the “End” command was given, the seg-
the first experimental session began. ment continued for 15 seconds, but the rater did
80
0 10 20 30 40 50 60 70 80 90 100
First Exposure
Fig. 3. Intrarrater reliability (first exposure vs. second exposure).
not evaluate this section. Once the 15 seconds had and the practice segment from the original data set.
elapsed, the screen went blank for 10 seconds before The final step was to convert the original data scores
the next segment appeared. The rater used the last into raw error rating scores and absolute error rating
15 seconds of the segment and the 10 seconds of scores. The mean of each segment (across raters)
blank screen between the segments (totaling 25 sec- was determined and ranked from low to high, Each
onds) to provide a rating. If this amount of time was rater’s raw data were then plotted against this rank-
insufficient, the rater asked the experimenter (who ing, as shown in Fig. 2. The graph suggests that
sat behind the rater) to pause the tape until the evalu- there was little, if any, error of central tendency in
ation was completed. This pausing technique al- the experiment. It can be seen from the graph that
lowed the rater to refer to the Description of Drowsi- raters rated at both the low end and the high end of
ness Continuum sheet. Once the rating was the scale as well as near the middle.
accomplished, the experimenter restarted the tape. The raw error scores were obtained by sub-
The rater could also change an answer if desired, tracting each segment’s mean score from the score
but only if the rating was changed before the next given by each rater. Because there is no numerical
segment started. (Only the current segment rating (or objective) definition of drowsiness, an indepen-
could be changed.) The rater was not permitted to dent variable did not exist for this experiment. (This
go back to a previous segment to change a rating. point is discussed in greater detail later in the section
The sequence continued until all 24 segments had entitled “Indications of Validity.” Therefore the rat-
been rated. er’s scores were compared to the mean segment
score to determine consistency of the scores. The
absolute error scores were obtained by taking the
RESULTS
absolute value of the raw error scores. Thus, 28
The first step in analyzing the data was to con- raw error scores and 28 absolute error scores were
vert the ratings on the rating scales to numerical derived for each rater resulting in a total of 168
values. This task was accomplished by converting raw and absolute error scores across the six raters.
the scale to a IOO-point scale and then measuring Because of the way the error scores were calculated,
the location of each given rating. Zero was at the a positive error score indicated that the segment was
left end of the scale and 100 was at the right end overrated compared to the segment mean rating and
(Fig. 1). The second step was to pair the repeated a negative error score indicated that the segment
segments to perform the correlations on these data, was underrated compared to the mean rating for that
The third step was to separate the paired data points segment.
5% W. W. WIERWILLE and L. A. ELLSWORTH
Four different correlations and four paired to the mean rating for that segment. In short, this
t-tests were performed on the data. The criterion analysis was used to determine whether the raters
for “acceptability” for the correlations was 0.80. rated the segments consistently from one rater to the
The first correlation compared first exposure to sec- next. The analysis gave an indication of interrater
ond exposure for the segments that were repeated reliability.
within a session. As mentioned, three segments were The second ANOVA compared the absolute
repeated in Session 1 and three segments were re- values of the raw error rating scores in each cell
peated in Session 2. Therefore, six pairs of data to the mean of the segment. This measure made it
points per rater existed, giving a total of 36 data possible to determine each rater’s score deviation
pairs with which to perform the correlation. This from the mean segment rating. Clearly, absolute er-
correlation was used to determine if raters were con- ror rating scores that are close {or equal) to zero
sistent within themselves when scoring segments indicate accuracy of rating with respect to the mean,
during the same time period. The measurement is whereas absolute error rating scores that are greater
an indication of intrarrater reliability. A paired t- than zero signify less accurate ratings with respect
test was also performed. to the mean. Thus, the ANOVA indicated whether
The second correlation was performed to deter- a difference existed across raters, in score deviations
mine the relationship between first exposure ratings from the mean.
and second exposure ratings from Session 1 only. Post-hoc analyses of significant main effects
There were three data pairs per rater (a total of 18 were performed using the Newman-Keuls pairwise
pairs) for this co~elation. The third correlation was comparison technique. This procedure was used to
also used to determine the relationship between first determine exactly which raters were significantly
exposure ratings and second exposure ratings, but different from one another on the rating task.
these data came from Session 2. Again, three data
pairs per rater, giving a total of 18 pairs of data,
were used to calculate the correlation. Both of these Anulysis of intrurruter reliability
correlations are indications of intrarrater reliability, The Pearson r correlation procedure gave a cor-
but the sessions were analyzed separately to deter- relation value of 0.88 (t = 10.92, d.f. = 34) for intrar-
mine whether a fatigue/learning effect existed. Once rater reliability. This value was significant (p <
computed, the correlations were compared to one 0.001). The data used for this procedure are depicted
another to determine whether a significant difference in Fig. 3 and indicate that raters consistently rated
existed. Separate f-tests were performed on each of the level of drowsiness when asked to rate the same
the two sets of data. segment twice. The paired t-test gave a value t =
The fourth correlation was performed to deter- 0.032 (d.f. = 3.5); p > 0.20. Thus, ~if~ere~~~s within
mine test-retest reliability. The three segments from pairs (i.e. differences between the first and second
Session 1 were paired with the corresponding re- members of each pair) of data were not significantly
peated segments from Session 2. The three pairs per different from zero.
rater gave a total of 18 pairs with which to determine
if raters consistently rated segments at two different
points in time (i.e. over a week’s period). A fourth Session 1 correlation versus Session 2 correlation
t-test was performed on the data to determine The comparison between the two correlations
whether the difference between pairs was signifi- did not show a significant difference. The correlation
cantly different from zero. value for Session 1 was 0.93 and the correlation
An additional correlation analysis was per- value for Session 2 was 0.85. Both of these correla-
formed as part of the interrater reliability analysis. tions are sign~icant (p < 0.001; t = 9.98, d.$ = 16
It involved correlating the raw ratings of each rater for Session 1 and t = 6.35, d.f. = 16 for Session 2).
with every other rater, as a means of quantitatively Comparison of the two values indicated that they
assessing consistency. did not differ significantly from one another (p =
Two analyses of variance (ANOVAs) were con- 0.1335). This comparison suggests there was no reli-
ducted on the data. The first ANOVA compared the able learning effect from Session 1 to Session 2. The
raw error rating scores in each experimental cell to t-tests performed on these data were not significant
determine whether there were any biases in rater (p > 0.20; t = -0.46, d.f = 17 for Session 1 and
ratings. Positive biases are indicative of a tendency t = 0.33, df. = 17 for Session 2). The results of the
to overrate segments as compared to the mean rating t-tests indicate that the differences within data pairs
for that segment, while negative biases are indicative in both sets of data were not significantly different
of a tendency to underrate segments as compared from zero.
Analysis of test-retest reliability Table 1. Interrater correlation matrix

The correlation value for test-retest reliability Rater number
as determined by the Pearson r correlation proce- Rater
number 2 3 4 5 6
dure was 0.81 (t < 5.45, d.J = 16). This value was
significant (p < 0.001) and indicates that raters con- 1 0.87 0.81 0.68 0.72 0.81
sistently rated the level of drowsiness when asked 2 0.85 0.79 0.85 0.91
3 0.84 0.80 0.86
to rate the same segment twice with a given period 4 (F=0.811 0.84 0.76
of time (i.e. one week) separating the two exposures. 5 0.80
Figure 4 shows a plot of data. The paired t-test
yielded a value t = 0.66 (d.J = 17) which is not
significant (p > 0.20), indicating that the differences demonstrate differential biases when rating the level
within pairs of data were not significantly different of alertness/drowsiness. Raters 1, 2, and 5 tended
from zero. to underrate the level of drowsiness with respect to
the mean and Raters 3, 4, and 6 tended to overrate
Correlations of non repeated segments the level of drowsiness with respect to the mean
(interrater reliability) (Fig. 5). Post-hoc analysis using the Newman-Keuls
The overall correlation value for interrater relia- technique ((Y = 0.05) revealed which raters were
bility as determined by averaging the individual significantly different from one another. Rater 3
Pearson r correlations was 0.81 (t = 3.707, d.f. = (mean = 8.17) rated significantly different from
26). This value was significant at a p < 0.001. The Raters 1 (mean = -6.90), 2 (mean = -3.94), and
individual paired rater correlations as well as the 5 (mean = -4.65). Rater 3 tended to overrate as
average correlation are shown in Table 1. The indi- compared to the mean while Raters 1,2, and 5 tended
vidual rater correlations range from 0.68 to 0.91. to underrate. Rater 4 (mean = 3.78) rated signifi-
These correlations and the average correlation indi- cantly higher as compared to the mean than Rater
cate that ratings tend to be consistent between rat- 1 (mean = -6.90). Rater 6 (mean = 3.53) rated
ers. In other words, the ratings from different raters significantly higher than Rater 1 (mean = -6.90) and
tend to follow the same trend. 2 (mean = -3.94).
Surprisingly, the Newman-Keuls post-hoc test
Analysis of raw error scores did not show a significant difference between Raters
The ANOVA performed on the raw error scores 6 and 5. The test also did not show a significant
revealed a significant main effect of Rater (F = difference of Rater 4 and Raters 2 or 5. The differ-
5.159, p = 0.001). This effect indicates that raters ences in the means of these raters are greater than
100
80
0
0 10 20 30 40 50 60 70 80 90 loo
First Exposure
Fig. 4. Test-retest reliability (first exposure vs. second exposure).

578 W. W. WIERWILLE and L. A. ELLSWORTH
10 -
8.17
8 --
6 --
3.78
L
3.53
3
t- 17
A
4
I
A
6
t
B
2
,
B
5
1
B
-4 --
-3.94
-4.65
-6 --
-8 - -6.90
Rater Number
Fig. 5. Mean rating error as a function of rater. (Mean ratings having common letters do not differ signi~cantly, a = 0.05.)
other differences in means, which are significant, graphically in Fig. 6. (The figure is included because
and seem to be an artifact of the Newman-Keuls it gives an indication of the spread in accuracy of
post-hoc test. If one rater is removed and the test the ratings. The reader is cautioned, however, that
is readministered using only 5 raters, then the above differences are not statistically significant.) The av-
mentioned nonsignificant differences become sig- erage mean absolute rating error across all raters
nificant. For example, if Rater 2 is removed from was 10.85. This value is an indication of the expected
the data, then Rater 4 is significantly different from magnitude of error (from the mean) for a given rater.
Rater 5. If Rater 5 is removed, Rater 4 becomes
significantly different from Rater 2. And, if the test
DISCUSSION
is performed after removing Rater 4, Rater 6 is sig-
nificantly different from Rater 5. These observations Interpretation of results
indicate that the above-mentioned raters should The correlation values for intrarrater reliability
probably be considered as significantly different and for test-retest reliability were greater than 0.80
from one another. Accordingly, the raters then fall indicating that raters tended to be consistent within
into two groups that are significantly different from themselves. Intrarrater reliability correlation (0.88)
one another. In Fig. 5 the first group is designated was slightly higher than test-retest reliability correla-
by the letter A and the second by the letter B. tion (0.81) and suggests that the raters may lose a
Descriptively speaking, the means ranged from small amount of consistency over time. However,
-6.90 to 8.17. The average standard deviation according to the statistical test to determine if a
across all raters’ scores was 12.5. This value is an learnin~/fatigue effect existed between Session 1
indication of the spread in bias of ratings that can and Session 2, the two correlations were not signifi-
be expected when informed raters perform ratings cantly different.
on the level of drowsiness. It is not surprising to find that there was a sig-
nificant rater main effect in the ANOVA that was
Analysis of absolute error scores performed on the raw error rating scores. However,
The ANOVA performed on the absolute error the previously mentioned study by Carroll et al.
scores revealed no significant effect of Rater, (1989) indicated that interrater reliability was high
F(.5,135) = 1.537, p = 0.1929. This result suggests for observers studying the disturbances of the sleep-
that raters do not differ significantly in their accuracy wake cycle. The rater main effect indicates that rat-
when it comes to rating the level of drowsiness. The ers demonstrate differential biases when rating the
mean absolute rating error for each rater is depicted level of drowsiness. Variability between raters can
13.16
11.85
11.14
10.78
10.45
7.72
+ L
+ t + 1 -I
6 1 5 4 3 2
Rater Number
Fig. 6. Mean absolute rating error as a function of rater. (Differences are not statistically significant, (Y= 0.05.)
most likely be attributed to differences in individual the next. Furthermore, the ANOVA performed on
definitions of drowsiness. Even though each rater absolute ratings as a function of rater was not sig-
was provided with the same Description of Drowsi- nificant. Therefore, a good degree of consistency
ness Continuum form, the interpretations of these is present between raters when rating the level of
descriptions may vary across raters. drowsiness in this study.
According to the analysis of variance performed Finally, it is clear that the raters in this study
on the absolute rating error scores, raters’ absolute were willing to use the entire scale. They ascribed
error rating scores were not significantly different widely different values to what they observed in the
from one another. This result suggests that informed various videotaped segments. These findings, along
raters tend to display the same accuracy when it with the reliability findings, suggest that ratings of
comes to rating the level of drowsiness. drowsiness by informed raters do consistently dis-
criminate between presented conditions.
CONCLUSIONS OF THE STUDY
INDICATIONS OF VALIDITY
The results from this study indicate that there
is a good degree of consistency among and within The previously described experiment shows
raters when rating the level of drowsiness using vid- that there is consistency and reliability in the ratings
eotaped segments of drivers’ faces. The intrarrater produced. However, the experiment does not and
reliability and the test-retest reliability indicate that cannot indicate the extent to which the raters are
raters are consistent within themselves. Even rating the “true drowsiness level,” since drowsiness
though the ANOVA of the raw error rating scores is not a precisely or numerically defined quantity.
showed a significant effect of rater, suggesting that It will be recalled that individual rating errors had
inconsistent biases in ratings exist between raters, to be defined in terms of deviations from the mean
one must look at the spread of the means of raw of all the raters, because there is no universal defini-
error rating scores as compared to the scale used. tion of drowsiness or drowsiness level. How, then,
The means range from -6.9 to 8.17 giving a spread does one determine the validity of a drowsiness as-
of approximately 15 points. Very small increments sessment procedure, such as that obtained from the
were used for the divisions on the scale used to rating process described in this paper? Or, in short,
convert the ratings to numerical values. The distance how does one establish validity?
between any two descriptors on the scale was 25 There are several approaches to validity. One
points. The 15 point spread of means constitutes approach is to apply the rating procedure to an actual
only 3/5 of the distance between one descriptor and or operational situation and determine whether the
580 W. W. WIERWKLE and L. A. ELLSWORTH
Table 2. Correlations of rater drowsiness ratings with other indicators (after Ellsworth et al. 1993)
PERCLOS AVECLOS EYEMEAS SUBRATE RTMTHCOR RTLTCOR MNALPHA MNTHETA ABRATIO THREOG MNHRT MNSQHRT
0.71 I 0.91 I 0.875 0.833 0.322 0.547 0.568 0.567 0.468 0.483 -0.547 -0.525
Note: Indkuior RIY~IIWS

PERCLOS: percentage of time that the eye? were more than 80% closed
AVECLOS: mean percent eye cloture
EYEMEAS: mean square of percent eye closure
SUHRATE: subject onlme rstmg of drowsiness using an adjustable bar-knob control
KTMTHCOR: mean time to correct ietponsein the math task
RTLTCOR: mean time to cwrect response in the letter search task
MNALPHA: mean amplitude of the EEG alpha wave, measured at the occipital lobe
MNTHETA: mean ampiltude of the EEG theta wake. measured at the occipital lobe
/‘&RATIO: ratio of MNALPHA to mean amplitude of the EEG beta wave, measured at the occipital lobe
THREOG: percent time that the electroculogram was above a set threshold (indicating eye blink or eye roll or both)
MNHRT: mean of instantaneous pulse rate
MNSQHRT: mean square of instantaneous pulse rate
procedure “measures what it is supposed to mea- and Mike Goodman of NHTSA for helpful suggestions. The opin-
ions expressed are those of the authors.
sure” (Ghiselli 1964). This is an application-oriented
approach. Another possible approach is to compare
the rating procedure to other supposed indicators of REFERENCES
drowsiness in a controlled experiment. Such indica-
Carroll, J. S.; Bliwise, D. L.; Dement, W. C. A method
tors might be physiological, performance based, or for checking interobserver reliability in observational
subjective. If it can be shown that the candidate sleep studies. Sleep 12:363-367; 1989.
assessment method provides results that covary Ellsworth, L. A.; Wreggit, S. S.; Wierwiile, W. W. Re-
with a variety of other known indicators, then the search on vehicle-based driver status/performance
new method reflects changes associated with the monitoring; Third semi-annual research report, Sep-
tember, 1992 to March, 1993. Dept. of Industrial and
common independent variables. Systems Engineering Report No. 93-02. Blacksburg,
To provide answers to questions about validity, Virginia: Virginia Polytechnic Institute and State Uni-
an additiona1, new experiment was conducted. versity; April 1993.
Briefly, the experiment involved having sleep- Ghiselli, E. E. Theory of psychological measurement.
deprived subjects perform alternating letter search New York: McGraw-Hill; 1964.
Ogilvie, IX.D.; Wilkinson, R.T.; Allison, S. The detection
and arithmetic tasks on a computer screen while a of sleep onset: Behavioral, physiological, and subjec-
variety of measures were taken (Ellsworth, Wreggit, tive convergence. Sleep 12: 458-474; 1989.
and Wierwille 1993). The various measures were Rosa, R. R.; Wheeler, D. D.: Warm, J. S.; Colligan,
then correlated with informed-rater drowsiness rat- M. 3. Extended workdays: Effects on performance and
ings using a procedure identical to that described in ratings of fatigue and alertness. Behavior Research
Methods, Instruments, & Computers 17 (1): 6-15; 1985.
this paper. Tilley, D. H.; Erwin, C. W.; Gianturco, D. T. Drowsiness
Typical results are shown in Table 2, which is and driving: Preliminary report of a population survey.
derived from Table 4 of Ellsworth et al. (1993). As Paper No. 73012 1. (Presented at the International Auto-
can be seen. correIations of rater ratings with eye motive Engineering Congress, Detroit, MI, January
ciosure and subject ratings are high, and correlations 1973.) Warrendaie, PA: Society of Automotive Engi-
neers; 1973.
with physiological and performance measures are Yoshitake, H. Relations between the symptoms and the
moderate. There are results for eight subjects and feeling of fatigue. Ergonomics 14: 175-186; 1971.
four raters. Three of the eight subjects did not exhibit Yoshitake, H. Three characteristic patterns of subjective
any signs of drowsiness whatsoever. When they fatigue symptoms. Ergonomics 2113): 231-233; 1978.
were eliminated from the data analysis, correlations Yabuta, K.; lizuka, H.; Yanagishima, T.; Kataoka, Y.:
Seno, T. (i985). The development of drowsiness warn-
increased to values greater than those shown in Ta- ing devices. In: Proceedings of the Tenth International
ble I (Ellsworth et al. 1993). The results taken to- Technical Conference on Experimental Safety Vehi-
gether support the validity of rater assessment of cles. Washington, DC.: U.S. Department of Transpor-
drowsiness and suggest that rater assessment is a tation; 198.5:282-288.
viable method of drowsiness assessment when a
video image of the vehicle operator is obtainable.
APPENDIX
Arknow/ed~ernents-This research was supported under a coop-
erative agreement with the National Highway Traffic Safety Ad- Description of Drowsiness Continuum
ministration, Office of Crash Avoidance Research. The authors
would like to thank Steven Wreggit who helped in preparing the A person who is not drowsy while driving will exhibit
videotapes and who made suggestions in the design of the study. behaviors such that appearance of alertness wilt be pres-
The authors wish to thank Ron Knipling. Bob Clarke, Mike Perel, ent. For exampfe, normal facial tone, normal fast eye
blinks, and short ordinary glances may be observed. Occa- cross-eyed (lack of proper vergence) look. Facial tone will
sional body movements or gestures may occur. probably have decreased. Very drowsy drivers may also
As an individual becomes drowsy, various behaviors exhibit a lack of apparent activity and there may be large
may be exhibited. These behaviors, called mannerisms, isolated (or punctuating) movements, such as providing a
may include rubbing the face or eyes, scratching, facial large correction to steering or reorienting the head from
contortions, and moving restlessly in the seat, among oth- a leaning or tilting position.
ers. These actions can be thought of as countermeasures Drivers who are extremely drowsy are falling asleep
to drowsiness. They occur during the intermediate stages and usually exhibit prolonged eyelid closures (4 seconds
of drowsiness. or more) and similar prolonged periods of lack of actvitity.
Not all individuals exhibit mannerisms during inter- There may be large punctuated movements as they transi-
mediate stages. Some individuals appear more subdued, tion in and out of intervals of dozing.
they may have slower eyelid closures, their facial tone
may decrease, they may have a glassy-eyed appearance, Please try to rate each videotape segment you will
and they may stare at a fixed position. be viewing by taking into account this description of the
As an individual becomes very drowsy eyelid closures drowsiness continuum. However, if you feel that the
of say 2 to 3 seconds or longer usually occur. This is often above description overlooks something of importance or
accompanied by a rolling upward or a sideways movement does not properly describe what you are viewing, then
of the eyes themselves. The individual may also appear supplement the description with your own best judgment
not to be focusing the eyes properly, or may exhibit a in making your rating.

Evaluation of Driver Drowsiness by Trained Raters: Pergsmon Oool-4575 (94) Eool7-F

Uploaded by

Copyright:

Available Formats

You might also like

Evaluation of Driver Drowsiness by Trained Raters: Pergsmon Oool-4575 (94) Eool7-F

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Evaluation of Driver Drowsiness by Trained Raters: Pergsmon Oool-4575 (94) Eool7-F

Uploaded by

Copyright:

Available Formats

Accid. Anal. and Prev., Vol. 26, No. 5, pp.

EVALUATION OF DRIVER DROWSINESS BY

(Accepted 11January 1994)

Keywords-Alertness, Driver performance, Drowsiness, Fatigue

INTRODUCTION observer ratings of drowsiness would be assessed.

Fig. 1. Rating scale used in the experiment.

I m Rater 1 0 Rater 2 0 Rater 3 0 Rater 4 ARater 5 0 Rater ~

Rank of Segment Means

Fig. 2. Raw rating scores as a function of segment mean rank.

Procedure When the rater returned after approximately

Fig. 3. Intrarrater reliability (first exposure vs. second exposure).

Analysis of test-retest reliability Table 1. Interrater correlation matrix

Fig. 4. Test-retest reliability (first exposure vs. second exposure).

Note: Indkuior RIY~IIWS

You might also like