Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

INT J LANG COMMUN DISORD, JANUARYFEBRUARY

VOL. 50, NO. 1, 1430

2015,

Review
Event- and interval-based measurement of stuttering: a review
Ana Rita S. Valente, Luis M. T. Jesus, Andreia Hall and Margaret Leahy
Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
Department of Education (DE), University of Aveiro, Aveiro, Portugal
School of Health Sciences (ESSUA), University of Aveiro, Aveiro, Portugal
Department of Mathematics, University of Aveiro, Aveiro, Portugal
Department of Clinical Speech and Language Studies, Trinity College Dublin, Dublin, Ireland

(Received October 2013; accepted April 2014)


Abstract
Background: Event- and interval-based measurements are two different ways of computing frequency of stuttering.
Interval-based methodology emerged as an alternative measure to overcome problems associated with reproducibility in the event-based methodology. No review has been made to study the effect of methodological factors in
interval-based absolute reliability data or to compute the agreement between the two methodologies in terms of
inter-judge, intra-judge and accuracy (i.e., correspondence between raters scores and an established criterion).
Aims: To provide a review related to reproducibility of event-based and time-interval measurement, and to verify the
effect of methodological factors (training, experience, interval duration, sample presentation order and judgment
conditions) on agreement of time-interval measurement; in addition, to determine if it is possible to quantify the
agreement between the two methodologies
Methods & Procedures: The first two authors searched for articles on ERIC, MEDLINE, PubMed, B-on, CENTRAL
and Dissertation Abstracts during JanuaryFebruary 2013 and retrieved 495 articles. Forty-eight articles were
selected for review. Content tables were constructed with the main findings.
Main Contribution: Articles related to event-based measurements revealed values of inter- and intra-judge greater
than 0.70 and agreement percentages beyond 80%. The articles related to time-interval measures revealed that,
in general, judges with more experience with stuttering presented significantly higher levels of intra- and interjudge agreement. Inter- and intra-judge values were beyond the references for high reproducibility values for
both methodologies. Accuracy (regarding the closeness of raters judgements with an established criterion), intraand inter-judge agreement were higher for trained groups when compared with non-trained groups. Sample
presentation order and audio/video conditions did not result in differences in inter- or intra-judge results. A
duration of 5 s for an interval appears to be an acceptable agreement. Explanation for high reproducibility values
as well as parameter choice to report those data are discussed.
Conclusions & Implications: Both interval- and event-based methodologies used trained or experienced judges for
inter- and intra-judge determination and data were beyond the references for good reproducibility values. Interand intra-judge values were reported in different metric scales among event- and interval-based methods studies,
making it unfeasible to quantify the agreement between the two methods.
Keywords: stuttering, event-based measurement, interval-based measurement, inter-judge, intra-judge, accuracy.

What this paper adds?


A systematic review has been completed regarding event- and interval-based measurements of frequency of stuttering.
Interval-based measurement was developed to overcome reproducibility problems from event-based methodology.
This study was developed to verify if it is possible to quantify the agreement between the two methods to conclude
what is the most reproducible or more accurate way to assess stuttering frequency and to analyse the effect of
methodological factors on agreement of time-interval measurement. Both methodologies presented agreement values
beyond the highest references for all metric scales. It was not possible to perform a comparison study between
methods, so it would be useful in future research to quantify the agreement between the two assessment methods.

Address correspondence to: Ana Rita S. Valente, Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro,
P-3810193 Aveiro, Portugal; e-mail: rita.valente@ua.pt.
International Journal of Language & Communication Disorders
C 2014 Royal College of Speech and Language Therapists
ISSN 1368-2822 print/ISSN 1460-6984 online 
DOI: 10.1111/1460-6984.12113

Event- and interval-based measurement of stuttering: a review


Introduction
Several authors have shown that therapy for adults who
stutter (AWS) using behavioural treatment (e.g., fluency
shaping and/or stuttering modification approaches) can
be effective, resulting in a significant decrease in stuttering frequency (Andrews et al. 1980, Bothe et al. 2006,
Herder et al. 2006). The frequency or percentage of
stuttered syllables (%SS) or words (%SW) has been
used to assess changes in fluency as a result of intervention. Stuttering behaviours that include part-word
repetitions, prolongations and blocks (or other defined
stutters) are considered as core behaviours of the disorder
(Van Riper 1982) and these are determined and counted,
and their occurrence reported as a frequency or percentage (Brundage et al. 2006). As an important part of the
assessment researchers and clinicians suggest that reproducibility of frequency estimation should be discussed to
develop reliable and valid measurements [ . . . ] for documenting treatment candidacy and outcomes, and for
research and replications (Brundage et al. 2006: 272).
To assess the degree to which repeated measurement
in stable study objects, often persons, provide similar
results reliability and agreement parameters were estimated (de Vet et al. 2006: 1033). The focus of the concepts were different: agreement (i.e., absolute reliability)
refers to the closeness of repeated measurements (by different judges or by the same judge on various situations)
concerning measurement error (Cordes 1994, Cordes
and Ingham 1994b); reliability (i.e., relative reliability)
explains how well subjects can be distinguished regardless of measurement error, and is related to dependability of obtained data or test scores (Cordes 1994). As an
umbrella term, the word reproducibility has been used
to refer both concepts (de Vet et al. 2006). Percentage
agreement, standard error of measurement (SEM) and
limits of agreement are typical parameters for agreement,
and intra-class correlation (ICC) is a common index for
reliability reports (de Vet et al. 2006, Gisev et al. 2013).
Frequency counts in the assessment of stuttering
are a popular outcome measure, with a widespread use.
However, studies of listener agreement for the occurrence of stuttering events have reported unit-by-unit
inter-judge agreement to be below 60% (Curlee 1981,
Coyle and Mallard 1979, Emerick 1960, MacDonald
and Martin 1973, Young 1975, Martin and Haroldson
1981, Martin et al. 1988). There is also evidence of interclinic disagreement for elements of frequency counts,
including: total count of stuttering events (Kully and
Boberg 1988, Ham 1989, Cordes et al. 1992); speech
judged as stuttered based on perceptual definitions, i.e.,
threshold for stuttering is not fixed for any judge or for
many judges; speech judged as stuttered based on different lists of disfluency types (Cordes 2000, Einarsdottir
and Ingham 2005, Young 1975). Such lack of agreement

15
can lead to different diagnosis or severity description
at different clinics, or even to different definitions of
stuttering (Brundage et al. 2006). In addition, the relevance of estimating severity on the basis of frequency
of stuttering events has been criticized by Leith et al.
(1993) because of the lack of valid and reliable severity
instruments, and because elements of stuttering (such as
blocks accompanied by tension) may indicate the emotional involvement of people who stutter (PWS) when
stuttering.
Other studies based on total counts of stuttering
(Craig 1990, Gaines et al. 1991, Guitar et al. 1992,
Lewis 1991, Onslow et al. 1990, Prins and Hubbard
1990, Zebrowski 1991) reported high estimates of interand intra-judge agreement (above 90%). Many problems have been demonstrated in frequency counts of
the studies mentioned above, e.g., the high agreement
for total counts of stuttering judgments produced high
values, but many of these agreement estimates neither
presented judgment agreement for total events counts,
nor corroborated that the data and experimental analysis were unaffected by differences between and among
judges. Such problems lead to the conclusion that differences between low agreement reported in stuttering
measurement research and the high agreement reported
in other stuttering research may be more apparent than
real (Cordes and Ingham 1994b).
An alternative measurement system is based on
behaviours occurring within a defined time-interval
(Cordes et al. 1992). This methodological variation
(time-interval analysis) was introduced by Schloss et
al. (1987) and does not focus on individual stuttering
events but on the presence of a stuttering event within a
time interval (Alpermann et al. 2012). Alpermann et al.
(2012) showed that the ease of judging intervals instead
of individual events could make this type of measurement more feasible for clinicians.
In the time-interval methodology, video or audio
samples of speech are divided into short intervals of time
allowing judges to make a binary judgment (Brundage
et al. 2006, Cordes et al. 1992). Intervals that contain at least one stuttering event judgment are denominated stuttered intervals and intervals with no stuttering
event judgments are labelled as non-stuttered intervals
(Cordes et al. 1992). In this context, stuttering events
are defined as whatever specified observers point to as
stuttering (Bloodstein 1990: 392) based on a modified
version of the perceptual definition of stuttering (Martin and Haroldson 1981, Bloodstein and Ratner 2008).
Two additional supplements are added in order to control judgment idiosyncrasies: judges should only identify
stuttered disfluencies; judges should consider that one
stuttering event may encompass more than one syllable
or more than one word (Cordes et al. 1992).

16
One important implication of time-interval analysis
is that training based on this methodology increases
agreement and accuracy (i.e., correspondence judgment
of stuttered or non-stuttered interval with the criterion
established by a group of authoritative judges (Cordes
and Ingham 1999)) of inexperienced judges (Cordes and
Ingham 1996, 1999, Einarsdottir and Ingham 2005,
Ingham et al. 1993b, Yaruss 1997, Cordes and Ingham
1994a), making judgments more replicable and precise
(Cordes and Ingham 1994a). Ingham et al. (1998b) developed a software and hardware system based on timeinterval methodologyStuttering Measurement Assessment and Training (SMAAT)that include judgment
stimulus from PWS (adults) divided into 5 s intervals
and as judged by 10 highly experienced authorities. The
use of the SMAAT program improves inter-judge agreement, intra-judge agreement and accuracy in untrained
judges (Cordes and Ingham 1999). Ingham et al. (1999)
developed the Stuttering Measurement System (SMS),
which is a standardized training programme to make
real-time judgments who stutter in terms of speech rate,
stuttering frequency and speech naturalness, in children
and adults samples previously assessed by experts.
The frequency data are computed in an event-based
methodology, but the SMS program allows the option of
calculating stuttering frequency using an interval-based
methodology (the number of non-stuttered intervals
can also be used to estimate the stutter-free speech
rate).
A modified approach of time-interval analysis with
three categories (fluent intervals, stuttered intervals and
intervals with trained speech pattern), is currently being studied (Alpermann et al. 2010, 2012). The modified time-interval analysis was introduced by Natke and
Alpermann (2010) to distinguish the use of trained
speech patterns (acquired during stuttering therapy)
from stuttered and fluent speech by means of timeinterval analysis. Alpermann et al. (2010, 2012) concluded that modified time-interval analysis is a reliable
and valid measurement technique.
Time-interval-based measurement is proposed as a
focus of attention here, based on concerns regarding the
agreement of counting instances of stuttering (Yaruss
1997). The comparison between a new measurement
method (i.e., time-interval) and one previously proposed
(i.e., event-based) is necessary to assess how this new
methodology is likely to differ from the established one,
by computing their agreement (Bland and Altman 2010,
Bartlett and Frost 2008).
This paper presents a review of event-based measurement and time-interval analysis in terms of reproducibility data, as well as a discussion related to the comparison between the two methodologies. It also presents
a review regarding the effect of different methodological factors (i.e., training, experience, interval duration,

Ana Rita S. Valente et al.


sample presentation order and judgment conditions) on
inter- and intra-judge agreement related to the timeinterval method.
Aims
This paper aims:

r To review and discuss reproducibility data of


event-based analysis and time-interval analysis.

r To discuss the effect of different methodological factors (training, experience, interval duration,
sample presentation order and judgment conditions) on inter- and intra-judge agreement of
time-interval method.
r To discuss the comparison between the two
methodologies in terms of which one represents
the most reproducible or more accurate way to assess stuttering frequency (event-based or intervalbased) based on selected studies.
Methods
Review
The first two authors of this paper conducted independent literature searches to locate as many relevant
studies as possible. A systematic search of literature
was conducted on electronic databases (ERIC, MEDLINE, PubMed, B-on and Cochrane Central register
of Controlled Trials and Dissertation Abstracts) during JanuaryFebruary 2013. Reviewers looked for studies related to accuracy and reproducibility of timeinterval measurement and with reproducibility data
for event-based measurement, from 1992 (year after which the two measurement systems coexisted)
to the present year. Keywords included: stuttering,
stammering, stuttering judgment, time-interval measurement, time-interval analysis, intra-judge timeinterval, inter-judge time-interval, inter-judge agreement, inter-judge reliability, intra-judge agreement,
intra-judge reliability, percentage of syllable stuttered,
and agreement of percentage of syllable stuttered. The
two reviewers also attempted to identify studies published in languages other than English.
Each citation abstract retrieved was reviewed independently by the two reviewers in order to assess the
appropriateness of the study for inclusion in the systematic review. If it was indicated by the abstract that the
study met the inclusion criteria (or if the abstract was
unclear), the full text of the article was obtained and assessed for inclusion in the review, based on the inclusion
criteria presented below.
When a complete copy of all potential studies was
obtained, an independent evaluation of each study was

Event- and interval-based measurement of stuttering: a review


conducted by the two reviewers, in order to produce a
final inclusion decision. In case of disagreement, reviewers discussed key issues until a consensus decision was
attained.
Inclusion and exclusion criteria
For a study to be included in this review, the reviewers
used the following inclusion and exclusion criteria:

r Participants characteristics: the participants were

r
r

those who had been diagnosed as PWS (adolescents and adults) without other co-occurring
disorders, using labels such as stuttering or
stammering.
Judges characteristics: the judges were professionals with different backgrounds that have diverse
experience levels related to stuttering.
Nature of the intervention: studies that reported an intervention refer to the use of a
training programme using judgment training
based on event- and interval-based methodologies. The reviewers excluded studies based
on other methodologies, such as self-judgment,
time-modified intervals and transcription-based
techniques.
Outcome variables: the outcomes included were
accuracy, number of stuttered intervals, %SS,
ICC, Pearson product-moment coefficient correlation and Cohens kappa. When the metric
scale was not described, the studies were excluded.
The outcome variables related to validity, graphical representation and disfluency-types were not
included.
Methodological quality: no restrictions were imposed regarding methodological quality.
Coding

Papers meeting the inclusion criteria were coded for


judgment experience, outcome and methodological factor as well as statistical method used. The coding was
conducted independently by the two reviewers, each
using a complete copy of the retrieved article. Agreements between the reviewers occurred for 95% of the
occasions and the disagreements were solved through a
discussion.
Biostatistical methods
Both reviewers individually extracted data related to
inter- and intra-judge values as well as accuracy from
selected studies. Data from each study are summarized
in tables A1, B1 and B2 in the appendices.

17
Results
Overview of the literature search
The reviewers identified a total of 495 articles via initial
search, based on different word combinations. The
articles were flagged based on title and abstract and
repeated articles were excluded. Following evaluation
of the abstract and methodology, the number of articles
was reduced to 70 for further review. The articles were
excluded due to repetitions observed on electronic
databases and because of the absence of inter- and
intra-judge values. The detailed review of the 70
studies based on the exclusion criteria (i.e., articles with
reproducibility data related to self-judgement, modifier
time-interval, transcription-based or disfluency types,
as well as graphical representation and a logarithmic
scale transformation of values were excluded) revealed
that 48 articles presented inter-judge and/or intra-judge
values that could be used in the review. Data are shown
in tables A1, B1 and B2 in the appendices.
Characteristics of selected studies and results
summary
The majority of studies related to event-based methodology reported indices of reliability. However, some selected studies used the concepts of agreement and reliability interchangeably, not being specific about the
reproducibility concept reported. For that reason, the
reviewers decided to use the term reproducibility when
referring to event-based methodology and agreement
when mentioning time-interval methodology.
Event-based methodology
A total of 32 studies related to event-based methodology
were selected (see table A1 in appendix A). The aims and
objectives of the selected studies refer to a broad scope
(not specific to the study of the impact of methodological factors on inter- and intra-judge agreement). However, reproducibility data of %SS was a common outcome measure between the studies. Those studies were
mostly related to factors that included: the study of several stuttering intervention programmes (e.g., Camperdown, SpeechEasy, Stuttering Modification Programs or
cognitivebehavioural therapy); the effect of delayed auditory feedback (DAF) and frequency altered feedback
(FAF); the effect of personality traits and temperament
on stuttering frequency.
Event-based methodology is used for frequency
calculation in assessment instruments, such as the
Stuttering Severity Instrument4 (SSI-4) (Riley 2009)
or the Test of Childhood Stuttering (TOCS) (Gillam
et al. 2009). SSI-4 is a frequently used instrument for

18

Ana Rita S. Valente et al.

determining aspects of stuttering severity in children


and adults. This instrument uses %SS for frequency
computing. TOCS is used to assess speech fluency
skills and stuttering-related behaviours in children
412 years old (Gillam et al. 2009). For the assessment
of stuttering frequency, TOCS uses the percentage of
stuttered words. The use of event-based methodology
in standardized tests, in research and in clinical
decision-making process led to the widespread use of
this methodology (Yaruss 1997).
Similar to some interval-based methodology studies
(Finn 1997, Godinho et al. 2006, Ingham and Cordes
1997, Ingham et al. 1995), the majority of the selected event-based studies described judges as experienced (researchers or speech and language therapists
SLT), trained or enrolled in a training programme, as
experienced and trained judges were, in general, more
reliable than judges with no experience or with no training (Ingham et al. 1993b, Cordes and Ingham 1994a,
1996, 1999, Einarsdottir and Ingham 2008). This was
observed across the studies, given that the inter- and
intra-judge reproducibility exhibit high values for all indices (Hinkle et al. 2003, Landis and Koch 1977, Nunnally and Bernestein 1994, Baer et al. 1987, McHugh
2012, Cordes et al. 1992).
Interval-based methodology
A total of 16 studies related to time-interval analysis methodology were selected (see table B1 and B2
in appendix B). Ten studied the influence of different methodological factors (training, experience, interval duration, sample presentation order and judgment
conditions) on inter- and intra-judge agreement.
All but one of the selected studies (Einarsdottir and
Ingham 2008) was conducted with speech samples of
adolescent/adult population.
Five studies conducted training programmes with
judges to discriminate between agreed intervals of stuttered and non-stuttered speech (Ingham et al. 1993b,
Cordes and Ingham 1994a, 1996, 1999, Einarsdottir
and Ingham 2008). Two (Einarsdottir and Ingham
2008; Ingham et al. 1998a) of these five studies used
a software and hardware system known as Stuttering
Measurement Assessment and Training (Ingham et al.
1998a).
Four studies conducted an investigation with judges
who had different stuttering judgment experiences (Ingham et al. 1993a, Brundage et al. 2006, Cordes et al.
1992, Cordes and Ingham 1994c). These studies showed
that, in general, judges with more experience with stuttering presented significantly higher levels of intervalby-interval intra- and inter-judge agreement than judges
with less experience (Cordes et al. 1992).
Two studies investigated the effects of interval duration on observers judgments of stuttering (Cordes and

Ingham 1994c, Cordes et al. 1992). There does not appear to be one obvious interval duration that satisfies all
the factors that appear to be relevant in selecting interval
duration. Nevertheless, some findings suggest the use of
intervals with a duration between 3 and 5 s (Cordes and
Ingham 1994c).
Two studies reported inter-clinic comparisons of
stuttering judgments. One was related to graduate and
undergraduate students (Ingham et al. 1993a); the other
was about recognized authorities in stuttering research
(Cordes and Ingham 1995). Inter-clinic comparisons revealed that judges from the same research centre tended
to do similar stuttered judgments. However, there was
a great amount of disagreement between authorities in
stuttering research from different research centres and
an uniformly high intra-judge agreement among judges
(Cordes and Ingham 1995).
One of the selected studies explored differences in
stuttering judgments related to sample presentation order (Ingham et al. 1993b) and one other study examined differences in audiovisual and audio-only judgment
conditions (Ingham et al. 1993a). No significant differences were found across sample presentation order or for
audiovisual/audio-only conditions (Ingham et al. 1993a,
1993b).
Six studies (Ingham et al. 1995, 1997, Fox et al.
2000, Finn 1997, Godinho et al. 2006, Ingham and
Cordes 1997) had different aims and objectives, but reported inter- and intra-judge agreement of stuttering
frequency using time-interval analysis technique. Three
of these six studies (Ingham et al. 1995, Finn 1997, Ingham and Cordes 1997) describe judges as experienced
researchers or SLT.
Accuracy, intra- and inter-judge agreement was
higher for trained groups when compared with nontrained groups (Ingham et al. 1993b, Cordes and Ingham 1994a, 1996, 1999, Einarsdottir and Ingham
2008, Godinho et al. 2006). Comparing stuttered intervals judgments and non-stuttered intervals judgments
made by students and clinicians with judgments made
by highly experienced judges, the accuracy levels were
higher for non-stuttered intervals than for stuttered intervals (Brundage et al. 2006).

Discussion
Inter- and intra-judge reproducibility of event-based
methodology
Reproducibility data related to %SS were reported as
Pearson product-moment correlation, Cohens kappa,
ICC and agreement percentage (see table A1 in appendix A). Pearson product-moment correlation values for inter-judge ranged between 0.88 and 0.99 and
intra-judge data ranged between 0.86 and 1.0. The

Event- and interval-based measurement of stuttering: a review


values represent a high correlation (Hinkle et al.
2003).
Cohens Kappa inter-judge values ranged from 0.73
to 0.98 and intra-judge values varied between 0.76 and
0.93, which can be interpreted as a substantial to almost
perfect agreement correlation (Landis and Koch 1977).
ICC inter-judge values ranged was between 0.83
and 1.00; intra-judge ranged between 0.89 and 0.99.
The reported values mean that more than 89% of the
observed score variance is due to true score variance
and that the balance of the variance (1 minus the ICC
value) is attributable to error (Weir 2005). The reported
outcomes were greater than 0.70, which is classified as
good (Nunnally and Bernestein 1994).
Studies that report agreement percentage present
inter-judge values that vary between 88% and 98% and
intra-judge varying from 95% to 99%. These values
were above the minimum acceptable agreement (i.e., >
80%) (McHugh 2012, Baer et al. 1987).
The low number of samples/syllables as well as the
small number of judges used to assess them can help
to explain the high levels of intra- and inter-judge values. Also the use of experienced and trained judges,
could also have contributed to the high values reported,
as stated in the SSI-4 manual with practice most clinicians can achieve self-agreement of 85% or higher; [ . . . ]
more stringent training procedures should achieve 90%
(Riley 2009: 17). Frequency estimation through eventbased methodology presents some other problems related to the reproducibility indices chosen. ICC and
Person product-moment correlation (the more usual indices) are not a report of absolute reliability but relative
reliability. Absolute reliability (i.e., agreement) is preferable in all situations in which the instrument will be used
for evaluation purposes and is a characteristic of the instrument instead of a performance with a specific sample
(relative reliability) (de Vet et al. 2006: 1038). Relative
reliability is highly dependent on the variation in the
population sample and is only generalisable to samples with a similar variation (de Vet et al. 2006: 1038).
However, there is no published report of absolute reliability using more appropriate indices, e.g., limits of
agreement, ratio limits of agreement or standard error
of measurement (Gisev et al. 2013, de Vet et al. 2006).

Inter- and intra-judge agreement of interval-based


methodology
Inter- and intra-judge agreement data related to intervalbased technique were reported as agreement percentage
between or within judges. Agreement will be discussed
in terms of the impact of methodological factors.
The studies (Ingham et al. 1993a, Cordes and Ingham 1994c, 1995, Brundage et al. 2006, Cordes et al.

19
1992) in which training was not applied showed that
intra-judge agreement (reported as agreement percentage) of time-interval measurement ranged between a
mean of 76% and 98% for more experienced judges,
and between 60% and 96%, for inexperienced judges.
Values of inter-judge agreement (reported as agreement
percentage) for experienced judges ranged between 57%
and 87%. Inexperienced inter-judge agreement values
ranged between 74% and 89%. Generally, studies concluded that inter- and intra-judge agreement were higher
in more experienced judges (Ingham et al. 1993a). Nevertheless, different studies presented distinct results concerning the experience with stuttering, mainly due to
ill-defined judges experience with stuttering (Ingham
et al. 1993a). In this sense, more recent studies that
do not analyse the influence of methodological factors
(Ingham et al. 1995, Finn 1997, Godinho et al. 2006,
Ingham and Cordes 1997) used experienced judges for
agreement purposes based on assumption that, generally, more experienced judges reached better agreement
values.
The studies in which training was applied to inexperienced judges led to a higher increment in accuracy, inter- and intra-judge agreement in experimental
groups, when compared with control groups. Ingham
et al.s (1993b) study of inter-judge agreement results
were 71% in pre-training and increased to 86% after
training. The accuracy decreased from 5.0 errors (i.e.,
a non-correspondence between judges scores and the
judgment defined by a group of authorities in stuttering assessment) on pre-training to 1.6 errors on posttraining. Cordes and Inghams (1999) study, training
lead to a greater accuracy, whether judges needed to learn
to match more stuttering events or learn not to over identify non-stuttered intervals. Accuracy and inter-judge
agreement also increased for samples who had not been
used for training, which means that training had effect
for familiar (i.e., samples used during training sessions)
and unfamiliar samples (i.e., samples not used during
training sessions). Related to training with highly agreed
intervals (i.e., 100% agreement), it has been shown that
it is more effective for intra-judge agreement (increase
from 82% in pre-training to 88% in post-training) and
inter-judge (increase from 7980% to 8488% in posttraining) than training or exposure to poorly agreed
(intra-judge increase from 82% in pre-training to 86%
in post-training; inter-judge decrease from 81% in pretraining to 78% in post-training) or randomly selected
intervals (Ingham et al. 1993b, Cordes and Ingham
1994a). Software for judges training (SMAAT) with
samples from AWS (Ingham et al. 1998a) or preschool
children who stutter (Einarsdottir and Ingham 2008)
used highly agreed intervals to improve the correspondence between judges score and the criterion judgement established by a group of authoritative judges in

20
stuttering (Brundage et al. 2006). Cordes and Inghams
(1999) study also showed that the accuracy increased
whether it was assessed with respect to event judgments
or in terms of interval judgment, either in agreed stuttered and agreed non-stuttered intervals.
Therefore, results of the studies based on intervalrecording training (Ingham et al. 1993b, Cordes and
Ingham 1994a, 1996, 1999, Einarsdottir and Ingham
2008) or interval analyses of event judgments (Cordes
and Ingham 1999) do provide ambiguous evidence that
interval-based procedures will solve the many problems
related to stuttering measurement.
Sample presentation order and audiovisual or audioonly conditions did not reveal different results in interand intra-judge agreement (Cordes and Ingham 1996,
Ingham et al. 1993a). Audio-only experimental conditions (Alpermann et al. 2010) have shown similar results
to audiovisual conditions (Cordes and Ingham 1995).
Martin and Haroldsons (1992) study on speech naturalness and severity of stuttering, revealed that intrajudge agreement and inter-judge agreement were similar
in audio-only or audiovisual conditions. However, the
speech naturalness of stutter samples (judge in a 1
9 naturalness scale) was judged systematically as more
unnatural in audiovisual conditions than under audioonly conditions (Martin and Haroldson 1992). The
authors argued that the difference could be explained
by the rating procedure (more than as a function of
particular speech behaviour). Nevertheless, Alpermann
et al. (2010), concluded that the use of audio-only
conditions might hinder the judgment of some types
of stuttering (e.g., inaudible blocks) and Martin and
Haroldson (1992) suggested that the audiovisual signal
appends visible struggle behaviours to the audio-only
signal.
With regard to sample presentation order, this
methodological factor was studied by Ingham et al.
(1993b), but since then, only two of the selected studies used samples without intervening recording intervals (Cordes and Ingham 1994a, 1999). Ingham et
al. (1993b) argued that randomizing the sample and
separating it by a recorded silent interval provided an
additional control of the judgment task, in order to
avoid the influence of behavioural cues from preceding
intervals. Furthermore, Ingham et al. (1993b) argued
that the absence of differences in inter-judge agreement
on different sample presentation order may be due to
the experience acquired during the study. Cordes and
Ingham (1994a, 1999) showed that judges could be
trained to achieve better agreement and accuracy values in a continuous-interval format, which means that
time-interval procedures could be used effectively with
spontaneous speech samples in clinical settings.

Ana Rita S. Valente et al.


Duration is another methodological factor explored
in two of the selected studies (Cordes and Ingham
1994c, Cordes et al. 1992). Cordes et al. (1992) found
that inter-judge agreement was maximized at greater
than chance levels for intervals between 3 and 4 s.
Two further studies (Ingham et al. 1993a, 1993b)
used 4 s interval-recordings. However, the 4 s interval duration was chosen based on interval analysis of
event-recorded data, which means that this interval
could not be the best time interval for interval judgments (Cordes and Ingham 1994c). Cordes and Ingham
(1994c) made direct comparisons between different interval durations, within an interval-recording judgment
system and proposed the use of intervals with a duration between 3 and 5 s with acceptable agreement.
After Cordes and Inghams (1994c) study, all the selected studies for the systematic review used 5 s interval
duration.
The appropriateness of using interval-based
methodology has been discussed in the literature. It
was stated that the subdivision of speech into intervals
presented randomly to the judge affects the assessment
of co-articulatory gestures, insofar as the speech gestures
can be influenced by the preceding and following gestures [ . . . ] and its duration can be longer than the interval duration (Guntupalli et al. 2006, p. 10). The same
authors also refer to under- or over-count of stuttering
events that can occur in time-interval based methodology. Due to issues that were discussed by Guntupalli et al.
(2006), the increase in agreement produces a decrease in
validity. The duration of time intervals has also been discussed. Howell et al. (1998) and Howell (2005) suggest
that long intervals lead to a ceiling effect with all intervals
being judged as stuttered, which impairs the detection of
changes during intervention. Howell (2005) also found
that the effect of interval duration is dependent on the
severity of stuttering (i.e., the more severe the stuttering,
the more likely intervals were to be judged consistently as
stuttered).
Yaruss (1997) also discusses some concerns related
to time-interval methodology. The author argues that
this methodology cannot be easily transferred to clinical
settings and nor can it be employed during the clinical
decision making process with clinical utility (i.e., timeinterval does not present a frequency value with clinical
significance).
Despite the disadvantages noted in the literature
related to the use of interval based methodology in
the assessment of stuttering, time interval methods
have often been used to estimate behaviours due
to the impracticality of focusing on multiple categories of behaviour/multiple subjects (Meany-Daboul
et al. 2007).

Event- and interval-based measurement of stuttering: a review


Is it possible to quantify which of the two
methodologies represents the most reproducible or
more accurate way to assess stuttering frequency
based on selected studies?
To analyse which of the two methodologies presents
better inter-judge, intra-judge or accuracy related to assessment of the frequency, a method comparison study
would be necessary in order to inspect the differences
between measurements made by the two methods. The
two methods should be in the same metric or scale (i.e.,
equal agreement index estimated from both methodologies during a pair of measurements from each subject)
to quantify their agreement (Bartlett and Frost 2008).
The stuttering frequency of the same sample of subjects needs to be measured (and the calculation of interjudge, intra-judge or accuracy needs to be made) with
both measurement methodologies (Bartlett and Frost
2008) to quantity the agreement between methods and
to assess which method would give the least inter-rater
and inter-method variation (Bowers et al. 2009). Researchers can visually assess the agreement/disagreement
between measurement techniques based on graphical
techniques such as the one proposed by Bland and
Altman (1986). Data from the subjects measurements
made by the two methods could be plot (i.e., the difference between measurements against the mean) and
limits of agreement could be calculated. Such limits of
agreement would represent the range within which it is
expected that most of the differences (95%) lie.
Considering reproducibility data of event-based
methodology and interval-based methodology it is important to mention the use of the two methodologies in
the same speech samples in two studies, those of Bothe
(2008) and Cordes and Ingham (1999). Bothe (2008)
assessed the similarities between judgments made in the
same speech sample with interval-based methodology
and disfluency types present (a categorical event-based
approach) in each 5 s interval from 20 children who
stutter. The author concluded that the agreement for
disfluency types was very low and that between 54%
and 55% of the intervals agreed by the highly experienced judges (to be stuttered or non-stuttered) could be
categorized similarly on the basis of disfluency types data
(if a disfluency type was accepted as present based on
80% of judgments). In spite of the comparison of binary
stuttered/non-stuttered judgment and disfluency types,
it is not possible to quantify the agreement/disagreement
because variables are different and with obviously distinct metrics. The other study (Cordes and Ingham
1999) used a previously developed software and hardware system for training (SMAAT) to investigate if a
training programme based on time-interval methodology could improve the judgement of stuttering events.
After training, inter-judge, intra-judge and accuracy val-

21
ues increased (agreement was only calculated for timeinterval methodology and accuracy for both methodologies). Even though the accuracy data were calculated for
both methods on the same speech sample, quantitative
specific data does not exist for event-based methodology,
making it unfeasible to compare the two methods.
Based on the data of selected studies, presented in
tables A1, B1 and B2 in the appendices, related to interjudge, intra-judge or accuracy for the two methodologies, it can be concluded that it was not possible to
quantify which of the methodologies represent the most
reproducible and more accurate method to assess stuttering frequency because the majority of event-based
methodology studies reported relative reliability indices
and interval-based methodology studies reported indices
of absolute reliability (i.e., percentage of agreement);
further, in the majority of studies, the two measurement methods were not used in the same speech sample
(Bartlett and Frost 2008). It was not possible to assess
the consistency of study outcomes due to the lack of
data (i.e., confidence intervals reported or the necessary
data to compute).
Although a method comparison study has not been
implemented, many studies reported that time-intervalbased measure may lead to a markedly (Ingham and
Cordes 1997) or demonstrably (Brundage et al. 2006)
better inter- or intra-judge agreement than the traditional obtained with procedures required to identify
events (%SS). Some studies (Curlee 1981, Coyle and
Mallard 1979, Emerick 1960, MacDonald and Martin
1973, Young 1975, Martin and Haroldson 1981, Martin et al. 1988) about listeners agreement of stuttering
events (which reported unit-by-unit inter-judge agreement below 60%, as reported by Cordes et al. 1992)
used an agreement index developed by Young (1975)
based on the number of observers (n), the total number
of words marked as stuttered (T) and the total number
of different words marked as stuttered (Td):


1
T
Agreement index (Young 1975) =
[( ) 1]
n1
Td
The interval-based methodology estimates the percentage of judges agreement for the occurrence of stuttering based on the positive agreement index (PA index),
in which the number of positive intervals (i.e., NP, an
interval in which 80% or more of the judges recognized at least one stuttering event) and the number of
ambiguous intervals (NA) were considered as factors:
PA index =N P /(N P + N A)
Despite the same type of metric scale (percentage of
agreement, which is an index of absolute reliability), the
two measurement techniques are still not comparable
due to estimation differences inherent in the formulas
used for calculation.

22

Ana Rita S. Valente et al.


Study limitations

The main limitation of the present study was related to


the statistical methods used in the selected studies for
the review. Specifically, the majority of selected studies
that used event-based methodology present indices of
relative reliability (e.g., Pearson product-moment correlation and ICC), which are less stable over different population than agreement indices. The number of studies
with reproducibility values reported is reduced (n = 48),
compared with the number of studies retrieved from the
database initial search (n = 495). The lack of studies
reporting agreement indices leads to uncertainty related
to the closeness of the scores stemming from the eventbased methodology during repeated measurements (de
Vet et al. 2006). It should also be noted that although
the large number of studies (n = 495) identified in initial search, only 48 present sufficient agreement data
suitable for review. The lack of agreement data makes
it impossible to generalize experimental effects to other
populations, as well make conclusions about their consistency or stability in replications (Breakwell et al. 2012,
de Vet et al. 2006).
Conclusions
The frequency of fluency breaks is often one of the most
obvious aspects of the problem and to some degree impacts the perception of severity, particularly by the listener (Manning 2010: 163). Researchers and clinicians
concur that objective definitions of stuttering, and reliable and valid measurements procedures are both desirable and a core necessity for treatment outcome research
and replication (Brundage et al. 2006).
Both interval- and event-based methodologies use
experienced and/or trained judges for reproducibility
purposes. Trained judges as well as the small number of
samples used for reassessment in event-based methodology studies could contribute to agreement values beyond
the references for good reproducibility values in all indices (Hinkle et al. 2003, Landis and Koch 1977, Nunnally and Bernestein 1994, Baer et al. 1987, McHugh
2012). It can also be concluded that it was impracticable to quantify the agreement between inter-judge, intrajudge or accuracy measurements made by the two methods in the selected studies (Bartlett and Frost 2008).
Frequency is an aspect that should be relatively easy
to calculate, but it is difficult to obtain reliable tabulations of event counts (Curlee 1981, Coyle and Mallard
1979, Emerick 1960, MacDonald and Martin 1973,
Young 1975, Martin and Haroldson 1981, Martin
et al. 1988, Ham 1989, Kully and Boberg 1988, Cordes
et al. 1992). Researchers attempted to minimize this
problem by developing better descriptions of stuttering
(Packman and Onslow 1998), developing new methods

for the detailed written analysis of stuttering (Yaruss


et al. 1998), developing training methods (Cordes and
Ingham 1999, Ingham and Cordes 1997) or using consensus judgment procedures (Cordes 2000). Stuttering,
as a complex behaviour, can be measured with different
methods, based on the desired parameters and the
aims of the assessment. With the ideas reported in the
present paper, clinicians/researchers should reflect upon
the influence of methodological factors and choose the
most appropriate frequency assessment method.
A suggestion for future research is to implement a
method comparison study between event- and intervalbased methodologies, both in the same metric scale of
agreement values. The aim of such a (method comparison) study would be to verify whether the measurements made by the two methods are sufficiently close
or whether the methods can coexist due to good agreement in terms of reproducibility values (Bartlett and
Frost 2008).
Acknowledgments
This study was developed as part of the Ph.D. Thesis of the
first author at the University of Aveiro, Portugal. This work was
partially funded by National Funds through FCTFoundation
for Science and Technology, in the context of project PEstOE/EEI/UI0127/2014. This research was also partly supported by
a doctoral grant (SFRH/BD/78311/2011) from FCT to Ana Rita
S. Valente. Declaration of interest: The authors report no conflicts
of interest. The authors alone are responsible for the content and
writing of the paper.

References
ALPERMANN, A., HUBER, W., NATKE, U. and WILLMES, K., 2010,
Measurement of trained speech patterns in stuttering: interjudge and intra-judge agreement of experts by means of modified time-interval analysis. Journal of Fluency Disorders, 35,
299313.
ALPERMANN, A., HUBER, W., NATKE, U. and WILLMES, K., 2012,
Construct validity of modified time-interval analysis in measuring stuttering and trained speaking patterns. Journal of
Fluency Disorders, 37, 4253.
ANDREWS, G., GUITAR, B. and HOWIE, P., 1980, Meta-analysis of the
effect of stuttering treatment. Journal of Speech and Hearing
Research, 45, 287307.
ANTIPOVA, E. A., PURDY, S. C., BLAKELEY, M. and WILLIAMS, S.,
2008, Effects of altered auditory feedback (AAF) on stuttering
frequency during monologue speech production. Journal of
Fluency Disorders, 33, 274290.
ARMSON, J. and KIEFTE, M., 2008, The effect of SpeechEasy on stuttering frequency, speech rate, and speech naturalness. Journal
of Fluency Disorders, 33, 120134.
ARMSON, J., KIEFTE, M., MASON, J. and DE CROOS, D., 2006, The
effect of SpeechEasy on stuttering frequency in laboratory
conditions. Journal of Fluency Disorders, 31, 137152.
ARMSON, J. and STUART, A., 1998, Effect of extended exposure
to frequency-altered feedback on stuttering during reading
and monologue. Journal of Speech, Language, and Hearing
Research, 41, 479490.

Event- and interval-based measurement of stuttering: a review


BAER, D. M., WOLF, M. M. and RISLEY, T. R., 1987, Some stillcurrent dimensions of applied behavior analysis. Journal of
Applied Behavior Analysis, 20, 313327.
BARTLETT, J. W. and FROST, C., 2008, Reliability, repeatability and
reproducibility: analysis of measurement errors in continuous
variables. Ultrasound in Obstetrics and Gynecology, 31, 466
475.
BAUERLY, K. R. and DE NIL, L. F., 2011, Speech sequence skill
learning in adults who stutter. Journal of Fluency Disorders,
36, 349360.
BEILBY, J. M., BYRNES, M. L. and YARUSS, J. S., 2012, Acceptance
and commitment therapy for adults who stutter: psychosocial
adjustment and speech fluency. Journal of Fluency Disorders,
37, 289299.
BLAND, J. M. and ALTMAN, D. G., 1986, Statistical methods for
assessing agreement between two methods of clinical measurement. Lancet, i, 307310.
BLAND, J. M. and ALTMAN, D. G., 2010, Statistical methods for assessing agreement between two methods of clinical measurement. International Journal of Nursing Studies, 47, 931936.
BLOODSTEIN, O., 1990, On pluttering, skivering, and floggering:
a commentary. Journal of Speech and Hearing Disorders, 55,
392393.
BLOODSTEIN, O., and RATNER, N. B., 2008, A Handbook On Stuttering (Clifton Park: Delmar).
BOBERG, E. and KULLY, D., 1994, Long term results of an intensive
treatment program for adults and adolescents who stutter.
Journal of Speech and Hearing Research, 37, 10501059.
BOTHE, A., 2008, Identification of childrens stuttered and nonstuttered speech by highly experienced judges: binary judgments
and comparisons with disfluency-types definitions. Journal of
Speech, Language and Hearing Research, 51, 867878.
BOTHE, A. K., DAVIDOW, J. H., BRAMLETT, R. E. and INGHAM, R.,
2006, Stuttering treatment research 19702005: I. Systematic
review incorporating trial quality assessment of behavioral,
cognitive, and related approaches. American Journal of Speech
Language Pathology, 15, 321341.
BOWERS, K. M., BELL, D., CHIODINI, P. L., BARNWELL, J., INCARDONA, S., YEN, S., LUCHAVEZ, J. and WATT, H., 2009, Interrater reliability of malaria parasite counts and comparison of
methods. Malaria Journal, 8, 267275.
BREAKWELL, G., SMITH, J. and WRIGHT, D. (Editors), 2012, Research
Methods in Psychology, 4th Edition (London: Sage).
BRUNDAGE, S., BOTHE, A., LENGELING, A. K. and EVANS, J., 2006,
Comparing judgments of stuttering made by students, clinicians, and highly experienced judges. Journal of Fluency Disorders, 31, 271283.
CORDES, A., 1994, The reliability of observational data: I. Theories
and methods for speechlanguage pathology. Journal of Speech
and Hearing Research, 37, 264278.
CORDES, A. K., 2000, Individual and consensus judments of disfluency types in the speech of persons who stutter. Journal of
Speech, Language, and Hearing Research, 43, 951964.
CORDES, A. and INGHAM, R., 1994a, Time-interval measurement
of stuttering: effects of training with highly agreed or poorly
agreed exemplars. Journal of Speech and Hearing Research, 37,
12951307.
CORDES, A. K. and INGHAM, R., 1994b, The reliability of observational data: II. issues in the identification and measurement
of stuttering events. Journal of Speech and Hearing Research,
37, 279294.
CORDES, A. K. and INGHAM, R., 1994c, Time-interval measurement
of stuttering: effects of interval duration. Journal of Speech and
Hearing Research, 37, 779788.

23
CORDES, A. K. and INGHAM, R., 1995, Judgments of stuttered
and nonstuttered intervals by recognized authorities in stuttering research. Journal of Speech and Hearing Research, 38,
3341.
CORDES, A. K. and INGHAM, R., 1996, Time-interval measurement
of stuttering: establishing and modifying judgment accuracy.
Journal of Speech and Hearing Research, 39, 298310.
CORDES, A. K. and INGHAM, R., 1999, Effects of time-interval
judgment training on real-time measurement of stuttering. Journal of Speech, Language, and Hearing Research, 42,
862879.
CORDES, A. K., INGHAM, R., FRANK, P. and INGHAM, J., 1992, Timeinterval analysis of inter-judge and intra-judge agreement for
stuttering event judgment. Journal of Speech and Hearing Research, 35, 483494.
COYLE, M. M. and MALLARD, A. R., 1979, Word-by-word analysis
of observer agreement utilizing audio and audiovisual techniques. Journal of Fluency Disorders, 4, 2328.
CRAIG, A., 1990, An investigation into the relationship between
anxiety and stuttering. Journal of Speech and Hearing Disorders,
55, 290294.
CREAM, A., OBRIAN, S., JONES, M., BLOCK, S., HARRISON, E.,
LINCOLN, M., HEWAT, S., PACKMAN, A., MENZIES, R. and
ONSLOW, M., 2010, Randomized controlled trial of video
self-modeling following speech restructuring treatment for
stuttering. Journal of Speech, Language, and Hearing Research,
53, 887897.
CURLEE, R. F., 1981, Observer agreement on disfluency and stuttering. Journal of Speech and Hearing Research, 24, 595600.
DE VET, H. C. W., TERWEE, C. B., KNOL, D. L. and BOUTER, L.
M., 2006, When to use agreement versus reliability measures.
Journal of Epidemiology, 59, 10331039.
EICHSTA DT, A., WATT, N. and GIRSON, J., 1998, Evaluation of
the efficacy of a stutter modification program with particular
reference to two new measures of secondary behaviors and
control of stuttering. Journal of Fluency Disorders, 23, 231
246.

, J. and INGHAM, R. J., 2005, Have disfluency-type


EINARSDOTTIR
measures contributed to the understanding and treatment
of developmental stuttering? American Journal of Speech
Language Pathology, 14, 231243.

, J. and INGHAM, R., 2008, The effect of stuttering


EINARSDOTTIR
measurement training on judging stuttering occurrence in
prechool children who stutter. Journal of Fluency Disorders,
33, 167179.
EMERICK, L. L., 1960, Extensional definition and attitude toward
stuttering. Journal of Speech and Hearing Research, 3, 181
186.
FINN, P., 1997, Adults recovered from stuttering without formal
treatment: perceptual assessment of speech normalcy. Journal
of Speech, Language, and Hearing Research, 40, 821831.
FOX, P. T., INGHAM, R., INGHAM, J., ZAMARRIPA, F., XIONG, J.
and LANCASTER, J., 2000, Brain correlates of stuttering and
syllable production: a PET performance-correlation analysis.
Brain, 123, 19852004.
GAINES, N. D., RUNYAN, C. M. and MEYERS, S. C., 1991, A comparison of young stutterers fluent versus stuttered utterances
on measures of length and complexity. Journal of Speech and
Hearing Research, 34, 3742.
GALLOP, R. F. and RUNYANB, C. M., 2012, Long-term effectiveness
of the SpeechEasy fluency-enhancement device. Journal of
Fluency Disorders, 37, 334343.
GILLAM, R. B., LOGAN, K. J. and PEARSON, N. A., 2009, Test of
Childhood Stuttering (Austin, TX: PRO-ED).

24
GISEV, N., BELL, J. S. and CHEN, T. F., 2013, Interrater agreement and interrater reliability: key concepts, approaches, and
applications. Research in Social Administrative Pharmacy, 9,
330338.
GODINHO, T., INGHAM, R., DAVIDOW, J. and COTTON, J., 2006,
The distribution of phonated intervals in the speech of individuals who stutter. Journal of Speech, Language and Hearing
Disorders, 49, 161171.
GUITAR, B., SCHAEFER, H. K., DONAHUE-KILBURG, G. and BOND,
L., 1992, Parent verbal interactions and speech rate: a case
study in stuttering. Journal of Speech and Hearing Research,
35, 742754.
GUNTUPALLI, V. K., KALINOWSKI, J. and SALTUKLAROGLU, T., 2006,
The need for self-report data in the assessment of stuttering
therapy efficacy: repetitions and prolongations of speech. The
stuttering syndrome. International Journal of Language and
Communication Disorders, 41, 118.
HAM, R. E., 1989, What are we measuring? Journal of Fluency Disorders, 14, 231243.
HERDER, C., HOWARD, C., NYE, C. and VANRYCKEGHEM, M., 2006,
Effectiveness of behavioral stuttering treatment: a systematic
review and meta-analysis. Contemporary Issues in Communication Science and Disorders, 33, 6173.
HINKLE, D. E., WIERSMA, W. and JURS, S. G., 2003, Applied Statistics
for the Behavioral Sciences (Boston, MA: Houghton Mifflin).
HOWELL, P., 2005, The effect of using time intervals of different
length on judgements about stuttering. Stammering Research,
1, 364374.
HOWELL, P., STAVELEY, A., SACKIN, S. and RUSTIN, L., 1998, Methods of interval selection, presence of noise and their effects on
detectability of repetitions and prolongations. Journal of the
Acoustical Society of America, 104, 35583567.
HUBBARD, C. P., 1998, Stuttering, stressed syllables, and word onsets.
Journal of Speech, Language, and Hearing Research, 41, 802
808.
HUINCK, W. and RIETVELD, T., 2007, The validity of a simple outcome measure to assess stuttering therapy. Folia Phoniatrica
et Logopaedica, 59, 9199.
INGHAM, R. J., BAKKER, K., MOGLIA, R. and KILGO, M., 1999,
Stuttering Measurement System (SMS) (Santa Barbara, CA:
University of California).
INGHAM, R. J. and CORDES, A. K., 1997, Identifying the authoritative judgments of stuttering: comparisons of self-judgments
and observer judgments. Journal of Speech, Language and
Hearing Research, 40, 581594.
INGHAM, R., CORDES, A. K. and FINN, P., 1993a, Time-interval
measurement of stuttering: systematic replication of Ingham,
Cordes, and Gow (1993). Journal of Speech and Hearing Research, 36, 11681176.
INGHAM, R., CORDES, A. K. and GOW, M., 1993b, Time-interval
measurement of stuttering: modifying inter-judge agreement.
Journal of Speech and Hearing Research, 36, 503515.
INGHAM, R., CORDES, A. K., INGHAM, J. and GOW, M. L., 1995,
Identifying the onset and offset of stuttering events. Journal
of Speech and Hearing Research, 38, 315326.
INGHAM, R. J., CORDES, A. K., KILGO, M. and MOGLIA, R., 1998b.
Stuttering Measurement Assessment and Training (SMAAT)
(Santa Barbara, CA: University of California).
INGHAM, R., CORDES, A. K., KILGO, M. and MOGLIA, R., 1998a,
Stuttering Measurement Assessment and Training (SMAAT)
(Santa Barbara, CA: University of California).
INGHAM, R., MOGLIA, R., FRANK, P., INGHAM, J. and CORDES, A. K.,
1997, Experimental investigation of the effects of frequencyaltered auditory feedback on the speech of adults who stutter.

Ana Rita S. Valente et al.


Journal of Speech, Language and Hearing Research, 40, 361
372.
KALINOWSKI, J., ARMSON, J. and STUART, A., 1995, Effect of normal
and fast articulatory rates on stuttering frequency. Journal of
Fluency Disorders, 20, 293302.
KALINOWSKI, J., STUART, A., SARK, S. and ARMSON, J., 1996, Stuttering amelioration at various auditory feedback delays and
speech rates. European Journal of Disorders of Communication,
31, 259269.
KALINOWSKI, J., STUART, A., WAMSLEY, L. and RASTATTER, M. P.,
1999, Effects of monitoring condition and frequency-altered
feedback on stuttering frequency. Journal of Speech, Language,
and Hearing Research, 42, 13471354.
KULLY, D. and BOBERG, E., 1988, An investigation of inter-clinic
agreement in the identification of fluent and astuttered syllables. Journal of Fluency Disorders, 13, 309318.
LANDIS, J. R. and KOCH, G. G., 1977, The measurement of observer
agreement for categorical data. Biometrics, 33, 159174.
LANGEVIN, M. and BOBERG, E., 1993, Results of an intensive stuttering therapy program. Canadian Journal of SpeechLanguage
Pathology and Audiology, 17, 158166.
LANGEVIN, M., HUINCK, W. J., KULLY, D., PETERS, H. F. M.,
LOMHEIM, H. and TELLERS, M., 2006, A cross-cultural, longterm outcome evaluation of the ISTAR Comprehensive Stuttering Program across Dutch and Canadian adults who stutter.
Journal of Fluency Disorders, 31, 229256.
LANGEVIN, M., KULLY, D., TESHIMA, S., HAGLER, P. and NARASIMHA,
P. N. G., 2010, Five-year longitudinal treatment outcomes of
the ISTAR Comprehensive Stuttering Program. Journal of
Fluency Disorders, 35, 123140.
LEITH, W., MAHR, G. C. and MILLER, L. D., 1993, The Assessment
of Speech-Related Attitudes and Beliefs of People who Stutter
(Rockville, MD: American SpeechLanguageHearing Association).
LEWIS, K. E., 1991, The structure of disfluency behaviors in the
speech of adult stutterers. Journal of Speech and Hearing Research, 34, 492500.
LINCOLN, M., PACKMAN, A. and ONSLOW, M., 2010, An experimental investigation of the effect of altered auditory feedback
on the conversational speech of adults who stutter. Journal of
Speech, Language, and Hearing Research, 53, 11221131.
MACDONALD, J. and MARTIN, R. R., 1973, Stuttering and disfluency
as two reliable and unambiguous response classes. Journal of
Speech and Hearing Research, 16, 691699.
MACLEOD, J., KALINOWSKI, J., STUART, A. and ARMSON, J., 1995,
Effect of single and combined altered auditory feedback on
stuttering frequency at two speech rates. Journal of Communication Disorders, 28, 217228.
MANNING, W. H., 2010, Clinical Decision Making in Fluency Disorders (Clifton Park: Delmar).
MANNING, W. H. and BECK, J. G., 2013, Personality dysfunction in
adults who stutter: another look. Journal of Fluency Disorders,
38, 184192.
MARTIN, R. R. and HAROLDSON, S. K., 1981, Stuttering identification: standard definition and moment of stuttering. Journal
of Speech and Hearing Research, 24, 5963.
MARTIN, R. R. and HAROLDSON, S. K., 1992, Stuttering and speech
naturalness: audio and audiovisual judgments. Journal of
Speech and Hearing Research, 35, 521528.
MARTIN, R. R., HAROLDSON, S. K. and WOESSNER, G. L., 1988,
Perceptual scaling of stuttering severity. Journal of Fluency
Disorders, 13, 2747.
MAX, L. and BALDWIN, C. J., 2010, The role of motor learning in
stuttering adaptation: repeated versus novel utterances in a

Event- and interval-based measurement of stuttering: a review


practice-retention paradigm. Journal of Fluency Disorders, 35,
3343.
MCHUGH, M. L., 2012, Interrater reliability: the kappa statistic.
Biochemia Medica, 22, 276282.
MEANY-DABOUL, M. G., ROSCOE, E. M., BOURRET, J. C. and
AHEARN, W. H., 2007, A comparison of momentary time
sampling and partial-interval recording for evaluating functional relations. Journal of Applied Behavior Analysis, 40, 501
514.
MENZIES, R. G., OBRIAN, S., ONSLOW, M., PACKMAN, A., ST CLARE,
T. and BLOCK, S., 2008, An experimental clinical trial of a
cognitivebehavior therapy package for chronic stuttering.
Journal of Speech, Language, and Hearing Research, 51, 1451
1464.
NATKE, U. and ALPERMANN, A., 2010, Stottern: Erkenntnisse, Theorien, Behandlungsmethoden (Bern: Hans Huber).
NUNNALLY, J. C. and BERNESTEIN, I. H., 1994, Psychometric Theory
(New York, NY: McGraw-Hill).
OBRIAN, S., ONSLOW, M., CREAM, A. and PACKMAN, A., 2003, The
Camperdown Program: outcomes of a new prolonged-speech
treatment model. Journal of Speech, Language, and Hearing
Research, 46, 933946.
OBRIAN, S., PACKMAN, A. and ONSLOW, M., 2008, Telehealth delivery of the Camperdown Program for adults who stutter:
a phase I trial. Journal of Speech, Language, and Hearing Research, 51, 184195.
ODONNELL, J. J., ARMSON, J. and KIEFTE, M., 2008, The effectiveness of SpeechEasy during situations of daily living. Journal
of Fluency Disorders, 33, 99119.
ONSLOW, M., COSTA, L. and RUE, S., 1990, Direct early intervention
with stuttering: some preliminary data. Journal of Speech and
Hearing Disorders, 55, 405416.
PACKMAN, A. and ONSLOW, M., 1998, The behavioral data language
of stuttering. In A. K. C. R. J. Ingham (ed.), Treatment Efficacy
for Stuttering: A Search for Empirical Bases (San Diego, CA:
Singular), 2750.
PACKMAN, A., ONSLOW, M. and VAN DOOM, J., 1994, Prolonged
speech and modification of stuttering: perceptual, acoustic,
and electroglottographic data. Journal of Speech and Hearing
Research, 37, 724737.
PRINS, D. and HUBBARD, C. P., 1990, Acoustical durations of speech
segments during stuttering adaptation. Journal of Speech and
Hearing Research, 31, 696709.

25
RILEY, G., 2009, Stuttering Severity InstrumentFourth Edition
(Austin, TX: PRO-ED).
SCHLOSS, P. L., FREEMAN, C. A. and SMITH, M. A., 1987, Influence
of assertiveness training on the stuttering rates exhibited by
three young adults. Journal of Fluency Disorders, 12, 333
353.
SPARKS, G., GRANT, D. E., MILLAY, K., WALKER-BATSON, D. and
HYNANB, L. S., 2002, The effect of fast speech rate on stuttering frequency during delayed auditory feedback. Journal of
Fluency Disorders, 27, 187201.
STUART, A., KALINOWSKI, J., ARMSON, J., STENSTROM, R. and
JONES, K., 1996, Fluency effect of frequency alterations of
plus/minus onehalf and onequarter octave shifts in auditory
feedback of people who stutter. Journal of Speech and Hearing
Research, 39, 396401.
UNGER, J. P., GLU CKB, C. W. and CHOLEWAA, J., 2012, Immediate effects of AAF devices on the characteristics of stuttering: a clinical analysis. Journal of Fluency Disorders, 37, 122
134.
VAN RIPER, C., 1982, The Nature of Stuttering (Englewood Cliffs,
NJ: Prentice-Hall).
VINCENT, I., GRELA, B. G. and GILBERT, H. R., 2012, Phonological
priming in adults who stutter. Journal of Fluency Disorders,
37, 91105.
WEIR, J. P., 2005, Quantifying testretest reliability using the intraclass correlation coefficient and the SEM. Journal of Strength
and Conditioning Research, 19, 231240.
YARUSS, J. S., 1997, Clinical measurement of stuttering behaviors.
Contemporary Issues in Communication Science and Disorders,
24, 3344.
YARUSS, J. S., MAX, M., NEWMAN, R. and CAMPBELL, J., 1998, Comparing real-time and transcript-based techniques for measuring stuttering. Journal of Fluency Disorders, 23, 137151.
YOUNG, M. A., 1975, Observer agreement for marking moments of
stuttering. Journal of Speech and Hearing Research, 29, 484
489.
ZEBROWSKI, P. M., 1991, Duration of the speech disfluencies of
beginning stutterers. Journal of Speech and Hearing Research,
34, 483491.
ZIMMERMAN, S., KALINOWSKI, J., STUART, A. and RASTATTER, M.,
1997, Effect of altered auditory feedback on people who stutter during scripted telephone conversations. Journal of Speech,
Language, and Hearing Research, 40, 11301134.

Two: one rater enrolled in an intensive


graduate fluency course and one
experienced speechlanguage
pathologist with fluency disorders
Two clinicians
Two experienced speech and language
therapists (SLT)
Two SLT
Two SLT
Two SLT
Two raters: the first author and a
trained doctoral student
Two trained research assistants
Two trained research assistant

Sparks et al. (2002)

Three Dutch and four Canadian


research assistants (trained with a
training programme)
Two trained by clinician with over 25
years of clinical experience in the area
of stuttering
Two: the first author and a research
assistant, (both been trained to
identify stuttering episodes by a
clinician with 30 years of clinical
experience in the area of stuttering)

Langevin et al. (2006)

Armson et al. (2006)

ODonnell et al. (2008)

Kalinowski et al. (1999)

Armson and Stuart (1998)

Eichstadt et al. (1998)

Zimmerman et al. (1997)

Kalinowski et al. (1995)


Stuart et al. (1996)

OBrian et al. (2008)


Lincoln et al. (2010)
Cream et al. (2010)
Manning and Beck (2013)

OBrian et al. (2003)


Menzies et al. (2008)

Kalinowski et al. (1996)

Packman et al. (1994)

Two: the first author and independent


trained researcher
Two graduate speechlanguage
pathologists
Two trained research assistants (trained
by an American
Speech-Language-Hearing
Association (ASHA)-certified
clinician with over 20 years of
experience)
Two trained research assistant

Judges
Two: research assistant, the rater, was
trained to criterion level at the
Institute by the clinical director and
the clinical director
Two independent raters (with 6 h of
training)
Two: the first investigator and an
independent judge
Two judges

Study
Langevin and Boberg (1993)

Boberg and Kully (1994)

Appendix A

10% of the total sample (210 speech


samples)

11% of the total sample from Dutch


participants (473 samples); 15% of
the total sample from Canadian
participants (54 samples)
13% of the total sample (600 syllables)

10% of the total sample (1800 syllables)

10 samples (no data related to the


totality of syllables)
12% of the total sample (no data related
to the totality of syllables)

10% of total sample (135 samples)

10% of the total sample (100 samples)


12.5% of the total sample (240 samples)
10% of the total sample (705 samples)
6000 (40%) words chosen from a total
of 15 000 words
10% of total sample (600 syllables)
10% of total sample (3000 syllables)

10% of the total sample (88 samples)


10% of the total sample (200 samples)

26 samples from six subjects (chosen


from a total of 49 PWS)
24 speaking sessions chosen from a
totality of 88 speaking sessions
One-third of the total sample (2400
syllables)
150 syllables

Number of samples/syllables assessed


Five samples from 10 subjects (chosen
from a pool of, at least, 600 samples)

0.85 (Cohens kappa),


0.99 (Spearman
rank-order
correlation)
0.86

0.78 (syllable by
syllable)
0.97

0.78

0.68

0.84
0.73 (syllable by
syllable)
0.90

0.94
0.96
0.8
0.92

0.99
0.96

Not calculated

0.88

0.99

0.99

Inter-judge
0.99

Table A1. Results summary of selected studies related to reproducibility data of %SS

0.89

Not calculated

0.86 (syllable by
syllable)
Not calculated

Not calculated

0.76

0.92

0.93
Not calculated

0.98
0.94
0.95
0.98

1.00
0.99

0.92, 0.87, 1.00, 1.00

Not calculated

Information not
available
0.99

Intra-judge
0.999

Continued

Cohens kappa

Cohens kappa

Cohens kappa

Cohens kappa

Cohens kappa

Cohens kappa

Cohens kappa

Cohens kappa
Cohens kappa

Pearsons r
Pearsons r
Pearsons r
Pearsons r

Pearsons r
Pearsons r

Pearsons r

Pearsons r

Pearsons r

Pearsons r

Statistical method
Pearsons r

26
Ana Rita S. Valente et al.

Three raters

Two: the first author and an


independent/research clinician
Four trained research assistants

Huinck and Rietveld (2007)

Antipova et al. (2008)

Two: one SLT and one graduate student


in speechlanguage pathology, who
had been trained in identifying
stuttering moments

Vincent et al. (2012)

Hubbard (1998)

Two
Two: one trained speechlanguage
pathology graduate student and one
trained research assistant
Two SLT

Gallopa and Runyanb (2012)


Macleod et al. (1995)

Beilby et al. (2012)

Unger et al. (2012)

Trained research assistants (number not


mentioned) and the first author
Two SLT

Information not available

Bauerly and De Nil (2011)

Langevin et al. (2010)

Judges
Two research assistants
Two judges

Study
Armson and Kiefte (2008)
Max and Baldwin (2010)

Table A1. Continued

12 390 syllables (the totality of the


sample)

1538 syllables (the totality of the


sample)

900 syllables (totality of the sample)


30% of total sample (3000 syllables)

2000 syllables (totality of the sample)

13 samples chosen from a totality of


126 samples
Information not available

17% of the total sample for inter-judge


reliability (952 samples); 12% of the
total sample for intra-judge reliability
(952 samples)
10% of the total sample (150 samples)

Number of samples/syllables assessed


13% of the total sample (600 syllables)
4000 words from a total of 60 000
words
25% of total data (no data related to the
totality of syllables)

98% (point-to-point
agreement for
stuttering
occurrence and
non-occurrence)
95%

0.91 (one-way
independent group
random effect
model analysis)
0.83
88%

Not calculated

0.98 (CI = 0.94,


0.99)
1.00

98% (point-to-point
agreement for
stuttering
occurrence and
non-occurrence)
99%

Not calculated
95%

0.89

Not calculated

0.99

0.95

Not calculated

Intra-judge
Not calculated
Not calculated

0.91

0.98 (for reading


samples of PWS);
0.96 (for speaking
samples of PWS)
0.95

Inter-judge
0.93
0.89

Agreement percentage

Agreement percentage

ICC
Agreement percentage

ICC

ICC

ICC

ICC

ICC

Cohens kappa

Statistical method
Cohens kappa
Cohens kappa

Event- and interval-based measurement of stuttering: a review


27

Appendix B

Stuttered intervals:Pre-training ranged


from 67% to 89%Post-training
ranged from 84% to
99%.Non-stuttered
intervals:Pre-training ranged from
89% and 100%Post-training ranged
from 43% and 96%Disagreed
intervals:Pre-training ranged from
63% and 96%Post-training ranged
from 73% and 80%

TH: mean of 82% on occasions 12,


88% on occasions 45TP: mean of
82% on occasions 12, 86% on
occasions 45TR: mean of 83% on
occasions 12, 83% on occasions
45CH: mean of 83% on occasions
12; 87% on occasions 45CP: mean
of 82% on occasions 12: 85% on
occasions 45CR: mean of 81% on
occasions 12; 85% on occasions 45

THc : mean of 80% on occasion 1 and


79% on occasion 2; 84% on occasion
4 and 88% on occasion 5TPd : mean
of 81% on occasions 1 and 2; 78%
on occasions 4 and 5TRe : mean of
81% on occasion 1 and 78% on
occasion 2; 77% on occasion 4 and
65% on occasion 5CHf : mean of
76% on occasion 1 and 80% on
occasion 2; 80% on occasion 4 and
78% on occasion 5CPg : mean of
81% on occasion 1 and 80% on
occasion 2; 82% on occasion 4 and
83% on occasion 5CRh : mean of
80% on occasion 1 and 79% on
occasion 2; 69% on occasions 4 and 5
Stuttered intervals:Pre-training: 48%
(occasion 1), 41% (occasion 2) for
training speakers; 63% (occasions 1
and 2) for non-training
speakersPost-training: 96% (occasion
3) and 98% (occasion 4) for training
speakers; 97% (occasion 3) and 99%
(occasion 4) for non-training
speakersNon-stuttered
intervals:Pre-training: 98% (occasion
1) and 96% (occasion 2) for training
speakers; 100% (occasion 1) and
99% (occasion 2) for non-training
speakersPost-training: 81% (occasion
3) and 77% (occasion 4) for training
speakers; 85% (occasion 3) and 87%
(occasion 4) for non-training
speakersDisagreed
intervals:Pre-training: 83% (occasion
1) and 75% (occasion 2) for training
speakers; 89% (occasion 1) and 87%
(occasion 2) for non-training
speakersPost-training: 45% (occasion
4 and 5) for training speakers; 46%
(occasion 3) and 99% (occasion 5)
for non-training speakers, for
stuttered intervals

Cordes and Ingham (1994b): 30


speechlanguage pathology and
audiology students divided into six
groups

Cordes and Ingham (1996): 10


undergraduate students

Intra-judge agreement
Group A: a mean of 91%Group B: not
calculatedGroup C: a mean of 84%

Inter-judge agreement
Group Aa : 81% for occasion 1
(so-called pre-training and 83% for
occasion 2 (so-called
post-training)Group Bb : 71% for
occasion 1 and 86% for occasion
2Group Ca : 75% for occasion 1 and
65% for occasion 2

Study and judges


Ingham et al. (1993b): 23
undergraduate students

Phase II: first three


trials accuracy
between 45% and
95%; last three
trials accuracy not
lower than 91%
Maintenance of
accuracy during
occasions 3 and 4

Accuracy
Group B: 5.0 errors
on occasion 1 and
1.6 errors on
occasion 2Group
C: 5.4 errors on
occasion 1 and 6.0
errors on occasion
2
Experimental:
number of errors
reduced to zero
during training
sessions
Control: no tendency
to change the total
number of errors

Table B1. Results summary of selected studies related to agreement data in which training was applied

Training

Training

Factor
TrainingSample
presentation order

Agreement percentage

Agreement percentage

Statistical method
Agreement percentage

28
Ana Rita S. Valente et al.

Inter-judge agreement
Group 1i :Pre-training: ranged from
71% and 76%Post-training: ranged
from 82% and 91%Group
2j :Pre-training: ranged from 75%
and 87%Post-training: ranged from
83% and 88%

Experimental group: not


calculatedControl group: not
calculated

Study and judges


Cordes and Ingham (1999): 20
university students semi-randomly
allocated to two judge groups

Einarsdottir and Ingham (2008): 20


female preschool teachers randomly
allocated to an experimental group
and to a control group

Experimental group: not calculated;


Control group: not calculated

Intra-judge agreement
Group 1:Pre-training: ranged from
81% and 87%; Post-training: ranged
from 89% and 92%; Group
2:Pre-training: ranged from 86% and
92%; Post-training: ranged from
92% and 94%

Table B1. Continued


Factor
Number of stuttering
events; Stuttering
duration; Number
of intervalsIntervalby-interval
inter-judge;
Interval-by-interval
intra-judge;
Accuracy
Training

Agreement percentage

Statistical method
Agreement percentage

Event- and interval-based measurement of stuttering: a review


29

Not calculated
Not calculated
Mean of 88% for cliniciansMean of
90% for students
Ranged from 81% to 100% across
speakers

89% met the criteria of 80% agreement


Mean interval-by-interval were 93%
Mean of 87% for cliniciansMean of
89% for students
Ranged between 75% and 97% across
speakers

Inter-clinic comparisons with


recognized authorities in stuttering

Ranged from 83% and 98% for all


judges
Not calculated

Agreement percentage
Agreement percentage
Agreement percentage

Agreement percentage

Agreement percentage
Agreement percentage

Agreement percentage

Agreement percentage

Agreement percentage

Agreement percentage

Statistical method
Agreement percentage

Experience

Interval duration; Judges experienced

Audiovisual conditions versus


audio-only conditions; Judges
experience; Inter-clinic comparison

Factor
Experience; Interval duration

Experienced: not calculated;


Inexperienced: not calculated

Not calculated
Not calculated

Ingham et al. (1997): two judges


Ingham and Cordes (1997): 10
experienced researchers and clinic
directors
Finn (1997): five graduate students in
speechlanguage pathology
Fox et al. (2000): three judges
Brundage et al. (2006): 41 students and
31 clinicians
Godinho et al. (2006): two: one
experimenter and one research
assistant
Notes: a Control group.
a
Experimental group.
b
Training high.
c
Training poor.
d
Training random.
e
Control high.
f
Control poor.
g
Control random.
h
Training group 1.
i
Training group 2.

Cordes and Ingham (1994a): 12


undergraduate students and 12
graduate students and clinicians
Cordes and Ingham (1995): 10
researchers and clinical researchers
Ingham et al. (1995); four researchers

Experienced: mean of 82.1 on Set 1 and


57.1 on Set 2 for AV; mean of 77.2
on Set 1 and 70.0 on Set 2 for
AInexperienced: mean of 78.6 on Set
1 and 78.6 on Set 2 for AV; mean of
81.4 on Set 1 and 74.3 on Set 2
Experienced: mean of 85% for all
judgesInexperienced: mean of 85%
for all judges
84% on occasion 1 and 86% on
occasion 2, for all judges
Ranged from 50% of 1 s intervals
agreed to 77% of 6 s intervals agreed
Ranged from 86% to 93%
76% (overall inter-judge agreement)

Ingham et al. (1993a): 34 graduate and


undergraduate students

Intra-judge agreement
Undergraduate: mean of between 60%
and 85%, for 0.2 s (agreement for
total counts of stuttered intervals);
Graduate: mean of between 82% and
94%, for 0.2 s (agreement for total
counts of stuttered intervals);
Experience: mean of between 88%
and 93%, for 0.2 s (agreement for
total counts of stuttered intervals)
Experienced: mean ranged from 76% to
94% for AV; from 79% to 94% for
A; Inexperienced: mean ranged from
79% to 96% for AV; from 79% to
93% for A

Table B2. Results summary of selected studies related to agreement data in which training was not applied
Inter-judge agreement
Undergraduate: unavailable
dataGraduate: unavailable
dataExperience: unavailable data

Study and judges


Cordes et al. (1992): three groups of
judges (18 judges): six undergraduate
students (UG), six graduate students
(G) and six experienced clinicians (E)

30
Ana Rita S. Valente et al.

You might also like