Speech Tasks and Interrater Reliability in Perceptual Voice Evaluation

Speech Tasks and Interrater Reliability in Perceptual
Voice Evaluation
*Fang-Ling Lu and Samuel Matteson, *yDenton, Texas
Summary: Objective/Hypothesis. The optimal selection of speech task is essential for more reliable perceptual
ratings and a better understanding of the perceptual qualities of pathologic voices. Nevertheless, researchers have rarely
explored this issue using the GRBAS scale. This study investigates the effect of speech task selection on interrater reliability during perceptual voice assessment.
Study Design. Experimental study.
Methods. Sixty subjects, 39 dysphonic subjects and 21 normal controls, performed 13 speech tasks including three 5second sustained vowel sounds (//, /i/, and /u/) each at three pitch levels (high, habitual, and low), maximum phonation
of the vowel //, pitch glide, counting from 1 to 10, and oral reading of the Rainbow Passage. A group of 18 graduate
students in speech-language pathology served as perceptual judges and rated the dysphonic severity for the speech samples based on three parameters in the GRBAS scaleGrade, Roughness, and Breathiness. The formalism of the AC1
statistic proposed by Gwet was applied to determine relative reliability between the speech tasks and the raters.
Results. The counting task and sustained vowel // in high, habitual, and low registers exhibited the most reproducibility and consequently the highest reliability statistic.
Conclusions. The counting task and sustained // phonation are the optimal tasks for perceptual voice judgment in
regard to interrater reliability. Future perceptional studies may benefit from this finding to determine the relationship
between speech task selection and the validity of any given perceptual rating system in terms of sensitivity and
specificity.
Key Words: Sustained vowel phonationContextual speechInterrater reliabilityGRBAS scale.
INTRODUCTION
Auditory perceptual assessment plays a vital role in voice evaluation despite its inherent subjectivity and the lingering debate
regarding its reliability and validity.17 The most evident
advantages of using the perceptual voice assessment are
accessibility of the test materials and simplicity in implementation procedures. Several popular perceptual evaluation
scales such as the GRBAS scale8 or the CAPE-V system9,10
have been well studied and are readily accessible to
clinicians. Although auditory perceptual measures are often
used as a reference for other objective voice assessment tools
such as acoustic analysis, the link between objective
measures and perceptual assessment of dysphonic voices
remains disappointingly weak and inconclusive due to
intrinsic shortcomings of each measurement system.
Auditory perceptual assessment itself is an intricate process
involving numerous complex interrelated elements, many of
which are subjective by nature and not well understood.
Research studies suggest that judgments of vocal qualities are
inherently unstable and prone to measurement error caused
by many known and unknown variables. A number of factors
such as the listeners professional training in voice disorders,
the listeners bias derived from a prior knowledge of the
Accepted for publication January 31, 2014.

Portions of this article were presented at the 2012 National Convention of American
Speech-Language-Hearing Association; Atlanta, Georgia, November 16, 2012.
From the *Department of Speech and Hearing Sciences, University of North Texas,
Denton, Texas; and the yDepartment of Physics, University of North Texas, Denton, Texas.
Address correspondence and reprint requests to Fang-Ling Lu, Department of Speech
and Hearing Sciences, University of North Texas, 1155 Union Circle # 305010, Denton,
TX 76203-5017. E-mail: flu@unt.edu
Journal of Voice, Vol. 28, No. 6, pp. 725-732
0892-1997/$36.00
2014 The Voice Foundation
http://dx.doi.org/10.1016/j.jvoice.2014.01.018
speakers medical or voice history, voice features in a perceptual rating scale, or the type of speech stimuli to be judged
are known to have significant impact on the listeners ability
to differentiate between pathologic and nonpathologic voices.17,1118
Researchers have recommended three evidence-based approaches to mitigate rater-related variability, including provision of listening training, limited choice of voice features for
evaluation, and selection of suitable speech stimuli.5,19,20
Although the former two approaches have been extensively
studied, the issue concerning appropriate speech task selection
has not received significant attention.
Studies have shown that inter- and intrarater reliability may
improve when the listeners receive listening training before
testing.11,13,2123 The level of agreement among listeners may
also improve if only three voice featuresoverall severity,
roughness, and breathiness are judged.7,9,11,12,14,15,20,23,24
When selecting speech stimuli for perceptual evaluation, the
general recommendation is to include both sustained vowels
and contextual speech in testing, given the inherent features
unique in both types of speech samples.2528 Although
sustained vowel stimuli naturally produce a higher level of
interrater reliability due to their innate stability and
consistency, they are poor representatives of daily voice
usage and are prone to underestimation of the severity of
voice deviance.1,16,21,2931 On the other hand, although
inherent physiological complexity and naturalness in the
contextual speech stimuli provide a more accurate estimation
of deviant voice quality,16,21,29,30,32 this type of speech
sample is inclined to produce a lower rater reliability because
of their intrinsic variability in speaking pattern (eg, dialect,
speaking rate, prosody) and voice quality.30 To date, the selection of speech tasks for perceptual voice assessment is generally
726
limited to three speech taskssustained phonation of the vowel
sounds // and /i/, as well as a short reading of phonetically
balanced texts such as the Rainbow Passage. In the field of
voice research, there seemed to be a lack of attention heretofore, in general, that prompts an exploration of the utility of
other speech stimuli for perceptual voice evaluation.
The research evidence from objective measurement studies
using acoustical analyses, electroglottographic measures, or
strobolaryngoscopic examinations has shown significant relationships between speech tasks and the structural configuration
and positioning of the larynx and vibratory pattern of the vocal
folds.3337 For examples, the vowel type,33,34,3846 pitch
level,41,47,48 vocal effort,39,47,49,50 vowel- versus text-based
context,27,3336,5153 utterance length,40,5456 and speaking
rate57,58 are shown to influence vocal tract configuration, voice
onsets, or pauses between syllables or words. Based on these
findings and the evidence from foregoing perceptual voice
research, it may be safe to postulate that speech features (eg,
sustained phonation, pitch, loudness, articulation, speaking
rate, and so forth) play a crucial role in influencing a speakers
voice quality and may also have a significant effect on the
listeners ability to reliably and accurately judge the speakers
voice quality.3337
The present study was designed to determine the effect of
speech task selection on interrater reliability in perceptual voice
judgment. The relationship between interrater reliability and
the selection among 13 speech tasks was investigated. The authors hypothesized that the level of agreement between the listeners in judging dysphonic and nondysphonic voices could be
influenced by the selection of speech samples carrying certain
salient features (eg, vowels, contextual speech, various pitch
levels, utterance lengths, and so forth), and such findings could
provide a starting point to find optimal speech stimuli that could
provide a respectable level of rater reliability and adequate
identification of dysphonia.
MATERIALS AND METHODS
Subjects and voice samples
Voice samples were obtained from 39 dysphonic subjects and
21 normal controls. The dysphonic group consisted of 15 males
and 24 females with an age range of 1881 years; the average
age was 32.9 2.7 years. Subjects of the dysphonic group
were diagnosed with a wide range of laryngeal pathologies,
which were verified through the strobovideolaryngoscopic
examinations performed by otolaryngologists or speechlanguage pathologists. Diagnoses of vocal fold pathology
among dysphonic subjects included 13 cases of laryngitis, 13
cases of vocal nodules, five cases of vocal polyp or polyps,
three cases of presbylaryngis, two cases of muscle tension
dysphonia, one case of spasmodic dysphonia, and two cases
of unilateral vocal fold paralysis. The control group consisted
of one male and 20 females ranging in age from 21 to 35 years
(mean 25.5 years; standard deviation 0.9). All control
subjects exhibited normal laryngoscopic findings and reported
no current history of dysphonia. Informed consent was
obtained from all the subjects in the study, and the protocol
Journal of Voice, Vol. 28, No. 6, 2014
was approved by the University of North Texas Institutional Review Board.

Speech samples were recorded in a quiet room. Each subject
was fitted with a headset microphone (TalkPro Xpress Headset;
VXI Corp., Rollinford, NH) coupled with a high-quality digital
audio recorder (Olympus LS-10 Linear PCM recorder;
Olympus Imaging America Inc., Cypress, CA). The microphone was maintained at a distance of 7.5 cm from the subjects
mouth and slightly off center to avoid breath or plosive noise.
The recording volume was monitored continually to maintain
an optimal dynamic range in the recording system to avoid
distortion. Recorded speech samples were saved as .wav files
at 48 000 Hz sampling rate with 16 bits of amplitude resolution.
During each recording session, the subject performed 13
speech tasks as follows: (Tasks 19) 5-second sustained phonation of three vowel sounds (//, /i/, and /u/), each at three pitch
levels (high, habitual, and low); (Task 10) maximum prolongation of the vowel // in one breath; (Task 11) a pitch glide saying
the ah sound; (Task 12) counting from 1 to 10; and (Task 13)
oral reading of the first paragraph of the Rainbow Passage.
Before recording, each subject received instructions to perform
habitual-pitched phonation of three vowels, maximum phonation of //, counting, and passage reading at a comfortable pitch
and loudness level. Subjects were also instructed to maintain a
natural pace in counting (10 words) and reading of the Rainbow
Passage (100 words). During high- and low-pitched vowel productions, subjects were shown by the investigator (the first
author) to sustain the vowel sounds without reaching the range
of falsetto (ie, loft register) or glottal fry (ie, pulse register). For
the pitch glide task, subjects were instructed to begin the //
vowel at habitual pitch level, followed by an ascending glide
to reach the highest pitch level without causing any pitch breaks
and then a descending glide to reach the lowest pitch level
without producing any glottal fry. The investigator closely
monitored each subjects voice volume to avoid unintended
loudness shifts during high- or low-pitched phonation. Subjects
were asked to produce each task twice consecutively. In all,
each subject produced 26 speech samples (13 tasks 3 2 trials).
Each sustained vowel sample was approximately 5 seconds
in duration, whereas other speech samples had a wide range
of lengths from 60 subjects. The average length for the
maximum phonation task was between 5.2 and 29.3 seconds;
for the pitch glide task, between 1.2 and 15.3 seconds; for the
counting task, between 4.2 and 10.4 seconds, resulting in an
average speaking rate of 59143 words per minute (WPM);
and for the Rainbow Passage reading, between 22.4 and
57.1 seconds, resulting in an average speaking rate of 105
268 WPM. Altogether, 120 voice samples were generated by
60 subjects in each speech task, resulting in a total of 1560 voice
samples from all 13 tasks (13 tasks 3 2 trials 3 60 subjects).
The recorded samples were saved in a computer, and speech
samples in each task set were organized in a random order to
mitigate potential learning effects on listeners during testing.
Judges and listening training
Eighteen 2nd-year graduate students in speech-language pathology were recruited to be judges without any monetary or
Fang-Ling Lu and Samuel Matteson
727
Speech Tasks and Interrater Reliability
course credit incentives. Graduate research assistants who were

involved in preparation for the study were excluded from
serving on the judge panel. All judges had classes and various
clinical experiences associated with voice disorders and had a
basic understanding of perceptual rating scales (eg, GRBAS
scale) through didactic coursework or clinical practicum. All
judges received additional listening training where voice samples with known degrees of dysphonia and normal voices
were provided for practice with the chosen rating scale.
Listening training was implemented through a PowerPoint presentation setup on a laboratory computer; each training session
lasted approximately 60 minutes.
Perceptual evaluation
Each judge rated voice samples on the first three parameters of
the GRBAS scale: Grade (G) represents the severity of overall
voice abnormality; Roughness (R) refers to the voice quality
associated with inharmonic vocal fold vibrations and fluctuating fundamental frequency; and Breathiness (B) refers to
the voice quality caused by air leakage through the glottis.
Each parameter was rated on a four-point scale with 0 for
normal status, one for mild deviance, two for moderate deviance, and three for severe deviance.8
Eighteen judges were divided into six groups of three listeners. Because judges also volunteered in other listeningrelated studies, each cohort of judges in the present study was
assigned only to five task sets from 30 speakers to prevent the
listener fatigue. The task sets assigned to each cohort (ie, a
group of three judges) typically included the same vowel in
three pitch levels (eg, the // vowel of high-, mid-, and lowpitched productions) and two of the nonsustained vowel
tasksmaximum phonation, pitch glide, counting, or the
Rainbow Passage reading. As a result, each vowel task was
evaluated by at least three judges and each nonsustained vowel
task was evaluated by six judges. On average, each judge was
assigned to 300 speech samples (5 tasks 3 2 trials 3 30
subjects).
During the listening session, each judge was situated at a
computer work station and fitted with a set of headphones; he
or she played back the audio samples at a predetermined volume via Windows Media Player. Judges were permitted to set
their own pace when listening to the speech samples, typically
allowing a short break (about 10 seconds) between speech samples to record the rating scores. To maintain uniformity of the
protocol, judges listened to speech samples only at a preset volume and in a predetermined random order; they were allowed to
play back each sample twice if needed and take as many breaks
as necessary to avoid fatigue. On average, the time allocation
for the ratings of 300 speech samples of each judge was between five and 6 hours.
Statistical analysis
The statistical evaluation of the reliability of a panel of
judges customarily uses a measure based on the reproducibility
of a single judges assessment and/or the degree of agreement
between the independent judges assessment. Fleiss59 kappa k
is a frequently used statistic of reliability. However, Gwet60,61
has criticized the statistic because of some paradoxical results

in which one obtains a low value of kappa although the
assessments are in high agreement when there is a high
prevalence of a single value. This is precisely the situation
that pertains to the present investigation. In all categories
(Grade, Roughness, and Breathiness: GRB), over 7080% of
the subjects were perceived to be normal. Clearly, it is
prudent to use an alternative statistic that accommodates this
scenario. Gwet has proposed the AC1 (Agreement Coefficient
1) statistic to be the metric of reliability for just such case.
The AC1 statistic
Most interrater reliability and intrarater reliability (IRR) statistics attempt to measure the fraction of time that judges agree on
the nominal classification of a subject, corrected for random
coincidental concurrence. The statistics (both Fleiss k and
Gwets AC1) thus take the form:
b
g
pa pe
AC1 or k
1 pe
(1)
b is the statistical measure of agreement among the

Where g
judges corrected for random agreement; pa is the average fraction of situations for which the judges agree on the assessment,
whereas pe is the fraction of cases for which the judges agree that
are expected from random coincidence. The latter term is the
point of divergence between Gwets AC1 and Fleiss kappa k.
Like his predecessors, Gwet computes the fraction of agreement by
pa
q
n X
1X
rik rik 1
n i1 k1 rr 1
(2)
With r judges, q categories, and n subjects. The quantity rik is

the number of judges that assess the same rating k for subject
i. Moreover, the probability of the occurrence of the category
k is
bk
p
n
1X
rik
n i1 r
(3)
Gwet pointed out that cases in which there is a marked prevalence of one assessment or category, Fleiss calculation of the
random probability of concurrence overestimated pe , just the
case that occurs in the present data.
Therefore, the authors chose to use the AC1 statistic as the
metric of interrater reliability for which
1
pAC
q
1 X
b k :
b k 1 p
p
q 1 k1
(4)
RESULTS
The results of the foregoing statistical analysis are unambiguous: Figure 1 and Table 1 report the principal results of this
study. The bar chart in Figure 1 and AC1 values in Table 1
728
FIGURE 1. The statistic of reliability of Gwet called the Agreement Coefficient AC1 for 13 tasks performed by 60 subjects and evaluated by 18
judges.
represent the average value of Gwets AC1 agreement coefficient for two trials plotted for each of the scales G, R, and B
and by task. The data appear in descending order of the average
AC1 for the combination of G, R, and B scale assessments.
Alternately, one can aggregate the ranking by computing the
Borda Count,62 essentially averaging the rank of the reliability
metric AC1 for each task based on the block of assessments
based on subject group, judge cohort, and Grade, Roughness,
and Breathiness rather than the value of the statistic. The rank
of reliability of the various tasks differs only slightly using
the latter method: high // and low // (rank 3 and 4, respectively) interchange places; Rainbow and high /i/ (rank 6 and
7) interchange; and norm /i/ and low /i/ (rank 11 and 12) inter-
change between the two methods. To examine the statistical significance of the ranking, the authors performed a Friedman rank
test. The Friedmans Q in the current situation approximates the
chi-square distribution. For these data, Q 48.6 with 12 degrees of freedom, for a P < 0.02, implying the high significance
of the current ranking.63
Figure 2 further illustrates the results of this investigation by
plotting the average AC1 for the G, R, and B scales, which also
appear in Table 1. The top four most reliable tasks (as defined
by the AC1 statistic) are counting and sustained // vowels in
the habitual, high, and low registers. Judges assessing these
four tasks exhibited a level of mutual agreement that produced
an average AC1 greater than or equal to 0.66. On the other hand,
TABLE 1.
Interrater Reliability Statistic AC1 by Task and Three GRBAS Features, Grade, Roughness, and Breathiness, Ranked by
Average AC1
Task
Counting
Habitual-//
Low-//
High-//
Habitual-/i/
High-/i/
Rainbow Passage
Pitch glide
Maximum phonation
Habitual-/u/
High-/u/
Low-/i/
Low-/u/
Grade
Roughness
Breathiness
GRB Average
0.63
0.59
0.57
0.58
0.53
0.53
0.61
0.46
0.44
0.29
0.35
0.30
0.35
0.64
0.63
0.61
0.63
0.66
0.65
0.46
0.50
0.45
0.34
0.47
0.35
0.29
0.77
0.81
0.80
0.75
0.67
0.60
0.68
0.69
0.52
0.56
0.35
0.52
0.28
0.68
0.68
0.66
0.65
0.62
0.59
0.58
0.55
0.47
0.40
0.39
0.39
0.31
729
FIGURE 2. Average AC1 rank ordered by the Agreement Coefficient averaged over the three categories of Grade, Roughness, and Breathiness.
The counting task and the sustained vowel /a/ were the most reproducible in this study and, therefore, exhibit the highest AC1 values.
the habitual-pitched /i/, high-pitched /i/, the Rainbow Passage,

and pitch glide tasks performed less reliably in this study,
whereas maximum phonation, sustained /u/ vowels, and
lower-pitched /i/ exhibited relatively poor reliability with an
average AC1 of below 0.5 or less. Thus, counting and sustained
// vowel sounds appear to be the clear winners, independent
of the aggregation method with statistical significance.
A closer look at Figure 1 suggests that Breathiness, Roughness, and Grade in that respective order demonstrated the
most to least relative reliability. This may be an artifact, however, inversely related to the relative sensitivity of the three scales,
as the authors will explore shortly.
DISCUSSION
A higher degree of agreement between the raters implies a
lesser degree of random error of measurement caused by the
known or unknown variables.64 When perceptually judging
dysphonic or nondysphonic voices, one recommended
approach to improving interrater reliability and IRR is the selection of suitable speech stimuli. Thirteen speech tasks examined in the study contained a wide variety of speech features
pertaining to vowel- and text-based materials, pitch level, and
utterance length. Among the tasks tested, counting and sustained vowel // productions at three pitch levels produced
greater AC1 values, thus better interrater reliability, on individual and combined GRB scale assessments, whereas maximum
phonation, sustained /u/ vowels, and lower-pitched /i/ exhibited
relatively low AC1 (<0.5).
In contrast to the notion that contextual speech samples are
prone to produce a lower rater reliability,30 the results of this
study clearly show the relative success of counting for eliciting
highly reproducible voice ratings between judges across all in-
dividual and combined GRB scale measurements. Although

another contextual speech sample, the Rainbow Passage, contains many more samples of consonant-vowel combination, it
failed to elicit the same level of interrater reliability as the
counting task in this study. It was observed that the majority
of speakers in the study counted the numbers deliberately
with audible pauses between words, thus producing a speaking
rate half of that when reading the Rainbow Passage, 59143 and
105268 WPM, respectively. Therefore, the authors speculate
that fluidity and rapidity of Rainbow Passage reading may
potentially mask embedded voice features in the speech samples, for instance, roughness or breathiness, and may have an
inverse effect on the listeners discriminatory ability of voice
quality. There is no research evidence to prove if the speakers
speaking rate has any bearing on the judges ratings of the voice
samples; hence, the authors interpretation on this matter remains to be determined. Another plausible reason for a low interrater reliability related to the Rainbow Passage reading is the
listeners fatigue and inattentiveness from listening to a lengthy
message. This issue is further discussed later in the Discussion
section. Clearly, further studies are warranted to determine the
potential benefit of using shorter and slowly spoken speech
samples in perceptual voice evaluation.
Following counting, the AC1 values from an aggregate GRB
scale assessment were also substantially high for the vowel //,
from high to low in the order of habitual-, high-, and lowpitched levels. Although a potential influence of vowel variation on interrater reliability is not yet well understood, the
authors theorize that the reason for the superiority of the vowel
// relative to other vowels in this regard is attributable to the
relative higher power and location of the formants in the spectrum, coincidently falling within the listeners range of highest
aural sensitivity. The research evidence from the source/filter
730
model of speech production indicates unique glottal and supraglottal behaviors corresponding with the vowel productions.
The results from imaging and acoustic studies of three corner
vowels (//, /i/, and /u/) demonstrate a larger glottal opening
and a lower laryngeal position in the // production,38 which
also is produced with a lower fundamental frequency and
greater voice perturbations.4042,45,46 Furthermore, the //
vowel is produced with a lower and fronting position of the
tongue body accompanied by a large mouth opening,
resulting in a high first formant frequency (F1 768
936 Hz) and a lower second formant frequency (F2 1333
1551 Hz), which in turn forms broadband resonance energy
clustering at a low frequency area consequently producing
greater vocal intensity. Although both high vowels /i/ and /u/
are produced by high positioning of the tongue body and
carry low F1 (342437 Hz for /i/ and 378459 Hz for /u/),64
they differ in the horizontal position of the tongue and produce
different ranges of F2a higher value for the vowel /i/ (2322
2761 Hz) due to a front tongue position and a lower value for the
vowel /u/ (9971105 Hz) because of its back tongue position.65
Both vowels typically are produced with a smaller aperture of
the mouth, which likely impedes the airflow traveling through
the vocal tract and in turn dampens the excitation and radiation
of sound pressure in the vocal tract, consequently producing
weaker vocal signals.39 In a word, the // vowel production
generally generates greater overall intensity of voice signals
due to a large mouth opening and a formant frequency cluster
formed by F1 and F2, and conversely, the production of /i/ or
/u/ produces a weaker overall vocal energy because of widely
spaced formant frequencies and a restricted mouth opening.
As a result, the vowel // has the advantage of enhancing the
raters aural ability in detecting salient acoustic features in
dysphonia and nondysphonic voices, hence raising the level
of reproducible assessment between judges.
In this study, pitch shifts, in pitch variation of sustained
vowels and pitch glide, did not show notable effect on interrater reliability. It was initially hypothesized that highpitched voice samples could reduce interrater reliability based
on the research evidence indicating a reverse relationship between fundamental frequency and the intensity of harmonics.
Because a higher-pitched speech sound tends to carry weaker
vocal energy and causes the sound to be thin and difficult to
pronounce accurately,66 one might speculate whether it could
also negatively affect the listeners perceptual judgment. However, such hypothesis is not fully supported by the findings
derived from this study. Generally speaking, the selection of
vowels seems to outweigh the pitch factor, as shown in the results of the study.
The results of this study also indicate an inverse relationship
between the length of speech samples and interrater reliability.
Shorter speech samples seemingly produce more favorable results in interrater reliability than the lengthier speech tasks. For
instance, counting is more superior than the Rainbow Passage
reading in terms of interrater reliability, despite the former
being much shorter (4.210.4 seconds) than the latter (22.4
57.1 seconds). For vowel segments, the 5-second sustained
// vowel productions produced a significantly higher level of
interrater reliability than their counterpartsmaximum phonation (duration ranging from 5.2 to 29.3 seconds) and pitch glide
(duration ranging from 1.2 to 15.3 seconds). The reasons that
longer segments yield lower rater reliability values may be
attributed to either (1) the listeners inattentivenss or fatigue
or (2) the distraction factor from signal redundance. The
observed differences of rater reliability could also be due to a
combination of both influences.
There are two issues concerning the selection of judges and
subjects in this study that need to be addressed due to potential impacts on the results. The primary reasons for recruiting
18 graduate students to serve as judges in the study are feasibility and consistency. For practical purposes, it is unattainable to recruit a sufficient number of experienced clinicians
from the community to carry out nearly 100 hours of listening
sessions in a laboratory setting. Fortunately, the research evidence indicates that use of novice listeners is not a deterrent to
obtaining good interrater agreement. The literature to date reports no significant relationships between several demographic variables (eg, age, the level of voice-related clinical
experience, vocal fold pathology) and the listeners judgment
of dysphonic voices.3,5,13,20 Specifically, de Bodt et al20 point
out that professional background (eg, otolaryngologists vs
speech-language pathologists) has a greater influence on
perceptual voice rating than the number of years of clinical
experience, and they conclude that raters inexperience has
no significant impact on interrater reliability so long as the listeners have some training in perceptual voice judgment. One
strength of the present study is inclusion of a large number
of listeners, which serves to minimize any random or systematic error related to perceptual measures and thus strengthens
the validity of the results.3 Another issue to be addressed is in
regard to the composition of subject groups. Subjects with
laryngeal pathologies were recruited from the universityaffiliated voice diagnostic clinic, whereas the nondysphonic
subjects were recruited from the university or surrounding
community. The authors recognize the potential confounding
effects from having a wide range of laryngeal pathologies
present in the dysphonic group and apparent disparity among
subjects in terms of gender and age. Unfortunately, it was not
an attainable goal to attempt gender-, age-, or pathologymatching during subject selection given time restraint and
the location of the study (ie, nonmedical university setting).
As Eadie et al3 report no significant relationship between
speaker-related factors (eg, age, laryngeal pathology) and interrater reliability, the authors postulate that the results of
this study were generally unaffected by subject disparity.
Nevertheless, it is imperative for future studies to determine
any potential influence of speakers gender on interrater
reliability.
A careful examination of Figure 1 raises an additional question about the relative reliability of the GRBAS scale. The authors found an inverse correlation between the fraction of
dysphonic assessments (ratings greater than 0) and the interrater reliability statistic. In fact, only approximately 2035%
of the assessments indicated dysphonia despite the fact that
65% (39/60) of the subjects were diagnosed with some level
of laryngeal pathology. This observation implies that sensitivities of, at best, 50% are possible in this sample. Indeed, the
high reliability as indicated by the reproducibility of the perceptional assessment may be an artifact of a diminished sensitivity.
In the extreme, one might observe very high inter-rater reliability when all subjects are assessed as normal or nondysphonic. This study is thus only the first chapter in a much
longer tale. To fully validate the task, one must determine the
sensitivity and specificity of perceptual assessment as well as
the reproducibility. A suitable gold standard for dysphonia
must be identified. One candidate is laryngoscopic examination. It is not immediately obvious, however, that all physical
anomalies will result in acoustical detectable dysphonia. This
topic is the subject of future inquiry.
CONCLUSION
The present study is the first to include a wide variety of speech
stimuli to determine the potential influence of speech tasks on
interrater reliability during perceptual voice assessment. The
results implicate that certain salient speech features, such as
contextual speech samples, open vowels, shorter segments,
and a slow speaking rate may facilitate a higher level of perceptual agreement among the listeners.
This study has limited control over the interplay between
many independent variables, such as speaking rate, utterance
length, or pitch variation, and those are the subjects warranting
further study. The findings disseminated in the study provide an
appropriate starting point for our ensuing study, which will
focus on investigating the potential influence of speech tasks
on the validity (ie, sensitivity and specificity) of the GRB scales.
The ultimate goal is to find speech stimuli that can provide a
respectable level of IRR and interrater reliability and simultaneously exhibit adequate diagnostic power for differentiating
between dysphonic and nondysphonic voices.
Acknowledgments
Equipment funding of this project was provided by the
Research Infrastructure Grant from the University of North
Texas for which the authors express appreciation to the university administration. The authors thank three graduate students,
Megan Nieckarz, Landon Hughes, and Nurinder Singh, for their
vital roles in data collection. The authors also extend their
thanks to the several anonymous subjects who so willingly performed the many voice tasks.
REFERENCES
1. Behrman A. Common practices of voice therapists in the evaluation of patients. J Voice. 2005;19:454469.
2. De Bodt MS, Van de Heyning PH, Wuyts F, Lambrechts L. The perceptual
evaluation of voice disorders. Acta Otorhinolaryngol Belg. 1996;50:
283291.
3. Eadie TL, Kapsner M, Rosenzweig J, Waugh P, Hillel A, Merati A. The role
of experience on judgments of dysphonia. J Voice. 2010;24:110.
4. Kent RD. Hearing and believing: some limits to the auditory-perceptual
assessment of speech and voice disorders. Am J Speech Lang Pathol.
1996;5:723.
731
5. Kreiman J, Gerratt BR, Kempster GB, Erman A, Berke GS. Perceptual evaluation of voice quality: review, tutorial, and a framework for future
research. J Speech Lang Hear Res. 1993;36:2140.
6. Ma EP, Yiu EM. Multiparametric evaluation of dysphonic severity. J Voice.
2005;20:380390.
7. Webb AL, Carding PN, Deary IJ, MacKenzie K, Steen N, Wilson JA. The
reliability for three perceptual scales for dysphonia. Eur Arch Otorhinolaryngol. 2004;261:429434.
8. Hirano M. Clinical Examination of Voice. New York, NY: Springer-Verlag;
1981:8384.
9. Karnell MP, Melton SD, Childes JM, Coleman TC, Dailey SA,
Hoffman HT. Reliability of clinician-based (GRBAS and CAPE-V) and
patient-based (V-RQOL and IPVI) documentation of voice disorders.
J Voice. 2007;21:576590.
10. Kempster GB, Gerratt BR, Abbott KV, Barkmeier-Kraemer J, Hillman RE.
Consensus auditory-perceptual evaluation of voice: development of a standardized clinical protocol. Am J Speech Lang Pathol. 2009;18:124132.
11. Chan MK, Yiu EM. A comparison of two perceptual voice evaluation
training programs for nave listeners. J Voice. 2006;20:229241.
12. Dejonckere PH, Obbens C, De Moor GM, Wieneke GH. Perceptual evaluation of dysphonia: reliability and relevance. Folia Phoniatr (Basel). 1993;
45:7683.
13. Eadie TL, Baylor CR. The effect of perceptual training on inexperienced
listeners judgments of dysphonic voice. J Voice. 2006;20:527544.
14. Hammarberg G, Fritzell B, Gauffin J, Sundberg J, Wedin L. Perceptual and
acoustic correlates of abnormal voice qualities. Acta Otolaryngol. 1980;90:
441451.
15. Kreiman J, Gerratt BR. Validity of rating scale measures of voice quality.
J Acoust Soc Am. 1998;104(3 pt 1):15981608.
16. Revis J, Giovanni A, Wuyts F, Triglia JM. Comparison of different voice
samples for perceptual analysis. Folia Phoniatr Logop. 1999;51:108116.
17. Solomon NP, Helou LB, Stojadinovic A. Clinical versus laboratory ratings
of voice using the CAPE-V. J Voice. 2011;25:E7E14.
18. Wuyts FL, De Bodt MS, Van de Heyning PH. Is the reliability of a visual
analog scale higher than an ordinal scale? An experiment with the GRBAS
scale for perceptual evaluation of dysphonia. J Voice. 1999;13:508517.
19. Oates J. Auditory-perceptual evaluation of disordered voice quality. Folia
Phoniatr Logop. 2009;61:4956.
20. De Bodt MS, Wuyts FL, Van de Heyning PH, Croux C. Test-retest study of
the GRBAS scale: influence of experience and professional background on
perceptual rating of voice quality. J Voice. 1997;11:7480.
21. Bassich CJ, Ludlow CL. The use of perceptual methods by new clinicians
for assessing voice quality. J Speech Hear Disord. 1986;51:125133.
22. Gerratt BR, Kreiman J, Antonanzas-Barroso N, Berke GS. Comparing internal and external standards in voice quality judgments. J Speech Hear
Res. 1993;36:1420.
23. Iwarsson J, Petersen NR. Effects of concensus training on the reliability of
auditory perceptual ratings of voice quality. J Voice. 2012;26:304312.
24. Dejonckere PH, Remacle M, Fresnel-Elbaz E, Woisard V, Millet B. Reliability and clinical relevance of perceptual evaluation of pathological voices. Rev Laryngol Otol Rhinol (Bord). 1998;119:247248.
25. Fex S. Perceptual evaluation. J Voice. 1992;6:155158.
26. Maryn Y, Roy N. Sustained vowels and continuous speech in the auditoryperceptual evaluation of dysphonia severity. J Soc Bras Fonoaudiol. 2012;
24:107112.
27. Parsa V, Jamieson DG. Acoustic discrimination of pathological voice: sustained vowels versus continuous speech. J Speech Hear Res. 2001;44:
327339.
28. Zraick RI, Wendel K, Smith-Olinde L. The effect of speaking task on
perceptual judgment of the severity of dysphonic voice. J Voice. 2005;19:
574581.
29. Askenfelt AG, Hammarberg B. Speech waveform perturbation analysis: a
perceptual-acoustical comparison of seven measures. J Speech Hear Res.
1986;29:5064.
30. de Krom G. Consistency and reliability of voice quality ratings for different
types of speech fragment. J Speech Hear Res. 1994;37:9851000.
31. Wolfe V, Cornell R, Fitch J. Sentence/vowel correlation in the evaluation of
dysphonia. J Voice. 1995;9:297303.
732
32. Takahashi H, Koike Y. Some perceptual dimensions and acoustical correlates of pathologic voices. Acta Otolaryngol Suppl. 1975;338:124.
33. Colton RH, Paseman A, Kelley RT, Stepp D, Casper JK. Spectral moment
analysis of unilateral vocal fold paralysis. J Voice. 2011;25:330336.
34. Lowell SY, Colton RH, Kelley RT, Hahn YC. Spectral- and cepstral-based
measures during continuous speech: capacity to distinguish dysphonia and
consistency within a speaker. J Voice. 2011;25:E223E232.
35. Moers C, Mobius B, Rosanowski F, Noth E, Eysholdt U, Haderlein T.
Vowel- and text-based cepstral analysis of chronic hoarseness. J Voice.
2012;26:416424.
36. Watts CR, Awan SN. Use of spectral/cepstral analyses for differentiating
normal from hypofunctional voices in sustained vowel and continuous
speech contexts. J Speech Hear Res. 2011;54:15251537.
37. Awan SN, Roy N, Dromey C. Estimating dysphonia severity in continuous
speech: application of a multi-parameter spectral/cepstral model. Clin
Linguist Phon. 2009;23:825841.
38. Ahmad M, Dargaud J, Morin A. Dynamic MRI of larynx and vocal fold vibrations in normal phonation. Folia Phoniatr Logop. 2009;23:235239.
39. Awan SN, Giovinco A, Owens J. Effects of vocal intensity and vowel type
on cepstral analysis of voice. J Voice. 2012;26:670.e15670.e20.
40. Deem JF, Manning WH, Knack JV, Matesich JS. The automatic extraction
of pitch perturbation using microcomputers: some methodological considerations. J Speech Hear Res. 1989;32:689697.
41. Gelfer MP. Survey of Communication Disorders: A Social and Behavioral
Perspective. New York, NY: McGraw-Hill Publishing; 1995.
42. Honda K. Relationship between pitch control and vowel articulation. In:
Bless D, Abbs J, eds. Vocal Fold Physiology. San Diego, CA: CollegeHill Press; 1983:286297.
43. Kilic MA, Ogut F, Dursun G, Okur E, Yildirim I, Midilli R. The effects of
vowels on voice perturbation measures. J Voice. 2004;18:318324.
44. MacCallum JK, Zhang Y, Jiang JJ. Vowel selection and its effects on perturbation and nonlinear dynamic measures. Folia Phoniatr Logop. 2011;63:
8897.
45. Story BH. An overview of the physiology, physics and modeling of the
sound source for vowels. Acoust Sci Technol. 2002;23:195206.
46. Sussman JE, Sapienza C. Articulatory, developmental, and gender effects
on measures of fundamental frequency and jitter. J Voice. 1994;8:145156.
47. Lin E, Jiang J, Noon SD, Hanson DG. Effects of head extension and tongue
extrusion on voice perturbation measures. J Voice. 2000;14:816.
48. Preciado Lopez JA, Calzada Uriondo MG, Zabaleta Lopez M, Garcia
Cano FJ. Variability in the digital voice analysis depending on the analyzed
vocal, in normal patients and patients with dysphonia. Acta Otorrinolaringol Esp. 2000;51:618628.
49. Orlikoff RF, Kahane JC. Influence of mean sound pressure level on jitter
and shimmer measures. J Voice. 1991;5:113119.

50. Thomas-Kersting C, Casteel RL. Harsh voice: vocal effort perceptual ratings and spectral noise levels of hearing-impaired children. J Commun Disord. 1989;22:125135.
51. Kingholz F. Acoustic recognition of voice disorders: a comparative study of
running speech versus sustained vowels. J Acoust Soc Am. 1990;87:
22182224.
52. Maryn Y, Corthals P, Van Cauwenberge P, Roy N, De Bodt M. Toward
improved ecological validity in the acoustic measurement of overall voice
quality: combining continuous speech and sustained vowels. J Voice. 2010;
24:540552.
53. Moon KR, Chung SM, Park HS, Kim HS. Materials of acoustic analysis:
sustained vowel versus sentence. J Voice. 2012;26:563565.
54. Karnell M. Laryngeal perturbation analysis: minimum length of analysis
window. J Speech Hear Res. 1991;34:544548.
55. Scherer RC, Vail VJ, Guo CG. Required number of tokens to determine
representative voice perturbation values. J Speech Hear Res. 1995;38:
12601269.
56. Titze IR, Horii Y, Scherer RC. Some technical considerations in voice
perturbation measurements. J Speech Hear Res. 1987;30:252260.
57. Boucher V, Lamontagne M. Effects of speaking rate on the control of vocal
fold vibration: clinical implications of active and passive aspects of devoicing. J Speech Hear Res. 2001;44:10051014.
58. Stemple JC, Glaze LE, Klaben BG. Clinical Voice Pathology: Theory and
Management. 3rd ed. San Diego, CA: Singular Publishing; 2000.
59. Fleiss JL. Measuring nominal scale agreement among many raters. Psychol
Bull. 1971;76:378382.
60. Gwet KL. Kappa statistic is not satisfactory for assessing the extent of
agreement between raters. Statistical Methods For Inter-Rater Reliability
Assessment [Internet]. Gaithersburg, MD: Advanced Analytic, LLC;
2002. Available at: http://agreestat.com/research_papers/kappa statistic_
is_not_satisfactory.pdf. Accessed November 9, 2012.
61. Gwet KL. Computing inter-rater reliability and its variance in the presence
of high agreement. Br J Math Stat Psychol. 2008;61:2948.
62. Dummett M. The Borda count and agenda manipulation. Soc Choice Welfare. 1998;15:289296.
63. Siegel S, Castellan NJ. Nonparametric Statistics For the Behavioral Sciences. 2nd ed. Columbus, OH: McGraw-Hill; 1988.
64. American Psychological Association. Standards of Psychological and
Educational Testing. Washington, DC: American Educational Research
Association, American, Psychological Association, National Council on
Measurement in Education; 1985.
65. Hillenbrand J, Getty LA, Clark MJ, Wheeler K. Acoustic characteristics of
American English vowels. J Acoust Soc Am. 1995;97(5 pt 1):30993111.
66. Sundberg J. The acoustics of the singing voice. Sci Am. 1977;236(3):
8291.

Speech Tasks and Interrater Reliability in Perceptual Voice Evaluation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Tasks and Interrater Reliability in Perceptual Voice Evaluation

Uploaded by

Copyright:

Available Formats

Speech Tasks and Interrater Reliability in Perceptual

Accepted for publication January 31, 2014.

Journal of Voice, Vol. 28, No. 6, 2014

was approved by the University of North Texas Institutional Review Board.

Fang-Ling Lu and Samuel Matteson