Assessment of Speaking

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/277707664
Assessment of Speaking
Chapter · November 2013

DOI: 10.1002/9781405198431.wbeal0882
CITATIONS READS
8 56,435
1 author:
April Ginther
Purdue University
37 PUBLICATIONS 915 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Assessment in Second Language Pronunciation View project
Test Score Use and Interpretation View project
All content following this page was uploaded by April Ginther on 18 November 2017.
The user has requested enhancement of the downloaded file.

Assessment of Speaking
APRIL GINTHER
Speaking is seen by language testers as the most difficult of the four language skills to
assess. Assessing speaking requires that we either observe a “live” oral performance or
that we capture the performance by some means for evaluation at a later time. A method
for elicitation must be selected, rating scales must be developed, and interviewers and/or
raters must be trained. Therefore, understanding the assessment of speaking requires
examination of assessment methods, scales, and raters.
Speaking Assessment Methods
Although in a sense, all language tests are an indirect measure of the ability they measure,
Clark’s (1979) classification of language assessment methods as indirect, semidirect, and
direct has proven useful for understanding speaking assessment methods. Indirect tests
evaluate the skills and abilities that underlie an examinee’s performance by eliciting per-
formance on item types such as a multiple-choice main idea item to measure reading
comprehension. However, indirect methods do not lend themselves well, if at all, to the
assessment of speaking.
Direct tests evaluate speaking skills and abilities in actual performance. The classic
example of a direct assessment of speaking is an interview in which participants engage
in structured or semistructured interaction with an interlocutor/interviewer/rater. Speaking
assessment methods centered on interviews are collectively referred to as OPIs or oral
proficiency interviews. A very well-known OPI is the American Council of Teachers of
Foreign Languages Oral Proficiency Interview or the ACTFL OPI (ACTFL, 2009), and many
locally developed OPIs are modifications of the proficiency guidelines and elicitation
procedures associated with the ACTFL OPI. Common OPI structures involve a series of
warm-up questions followed by a subsequent series of increasingly difficult questions
where examinees are expected to display concomitantly increasing levels of complexity
in their responses. Interviewers may be required to elicit a preselected set of responses, or
may decide to follow up on topics or comments that the participant has introduced, or
both. Examinee performance may be rated simultaneously by the interviewer or by addi-
tional raters who rate as the interview proceeds. When an audio or video recording is
made, responses can be rated after the interview is completed. A variation of the direct
method may require the examinee to give a presentation on a selected topic, and presen-
tations often include or require face-to-face engagement with members of an audience who
pose follow-up questions.
Direct methods are defined as “procedures in which the examinee is asked to engage
in face-to-face communicative exchanges with one or more human interlocutors” (Clark,
1979, p. 36). Direct tests have the perceived advantage of their elicitation of speaking skills
in a manner which duplicates “the setting and operation of the real-life situations in which
proficiency is normally demonstrated” (Shohamy, 1994, p. 100); that is, direct assessments
of speaking abilities have considerable face validity. However, an important but often
overlooked caveat is one that Clark (1979) identified early on: “In the interview situation,
the examinee is certainly aware that he or she is talking to a language assessor and not a
The Encyclopedia of Applied Linguistics, Edited by Carol A. Chapelle.

© 2013 Blackwell Publishing Ltd. Published 2013 by Blackwell Publishing Ltd.
DOI: 10.1002/9781405198431.wbeal0052
2 assessment of speaking
waiter, taxi driver, or personal friend” (p. 38). Indeed, the idea of the fidelity of OPIs to
natural conversation has been challenged by a number of researchers (Ross & Berwick,
1992; Johnson & Tyler, 1998), leading others to qualify OPIs as a specific genre of face-to-
face interaction (He & Young, 1998). Nevertheless, research on actual interaction in such
tests indicates the OPI genre does share important characteristics with natural conversation
(Lazaraton, 1992, 1997).
While OPIs have traditionally been administered with a single interviewer and a single
interviewee, speaking assessment of examinees in pairs, or even groups, has become the
object of a growing amount of attention from both researchers and language assessment
practitioners (Brooks, 2009; Ducasse & Brown, 2009). The procedure is often referred to
as “paired orals” or “group orals.” Such formats are argued to hold potential for increased
interactivity and authenticity relative to a one-on-one interview; however, the added
complexity complicates rating. Nevertheless, paired and group oral assessments have
successfully been incorporated into large-scale assessment programs (Hasselgreen, 2005;
Van Moere, 2006).
Some large-scale testing programs find any type of direct testing of speaking unwork-
able, and therefore choose semidirect methods, which do not require the presence of an
interlocutor to administer the test. Examinees are presented with a set of prerecorded
questions or tasks typically under laboratory conditions, and responses are recorded for
subsequent rating. The most obvious advantages of semidirect methods are their potential
for efficiency, time and cost savings, and high reliability. The absence of a human inter-
locutor decreases costs associated with administration which can be considerable when
assessments must be conducted and scores reported in tight time frames. Potential savings,
however, may only be realized in the long run. The development and maintenance of
a semidirect computer-based system requires considerable technical expertise. A more
important advantage of the use of semidirect methods is the potential reduction of construct-
irrelevant variance associated with interviewer effects. Because speaking performances
may be influenced variably and in unintended ways by the interviewer’s technique and
demeanor, the use of semidirect methods may even the playing field for all examinees.
Researchers comparing direct and semidirect OPI testing methods have reported strong,
positive correlations (.89 to .95) leading Stansfield (1991) and Stansfield and Kenyon (1992)
to argue that the methods are largely equivalent, statistically speaking. However, qualita-
tive analyses of the language have revealed differences in the language produced. Semidirect
responses tend to display greater formality and more cohesion while being accompanied
by longer pauses and hesitations (Shohamy, 1994; Koike, 1998). Research investigating
examinees’ affect and perceptions (Brown, 1993; Qian, 2009) has found that examinees
tend to prefer direct methods, reporting that they find interaction with a recorder or com-
puter forced and unnatural. This preference may abate as examinees become more famil-
iar with the large-scale semidirect methods introduced in 2005 in the TOEFL iBT speaking
subsection and as the use of software to develop local, semidirect computer-based programs
increases. In any case, appropriate examinee preparation is important when any semidirect
method is employed. Ultimately, Shohamy (1994) concludes that the selection of a direct
or semidirect method is dependent on four related concerns: accuracy (a function of reli-
ability), utility (the assessment’s relation to instruction and the difficulties associated with
rater training), feasibility (ease and cost of administration), and fairness.
Rating Scales and Scale Descriptors
Assessment of speaking requires assigning numbers to the characteristics of the speech

sample in a systematic fashion through the use of a scale. A scale represents the range of
assessment of speaking 3
values that is associated with particular levels of performance, and scaling rules represent
the relationship between the characteristic of interest and the value assigned (Crocker &
Algina, 1986). The use of a scale for measurement is more intuitively clear in familiar
domains apart from language ability. For example, we can measure weight very accurately.
While the use of pounds or kilograms is usually sufficient for measuring the weight of
adults, when measuring babies, we move to smaller and more precise units—ounces or
grams. The characteristics of the objects measured and our need for accuracy determine
the units of measurement selected, and, in turn, scaling rules describe the units of meas-
urement we employ.
Measurement of a speaking performance, however, requires a different kind of scale,
such as those used in rating performance in sports competitions. For example, the quality
of a figure skater’s performance in the Olympics is based on rank; there is no equal-interval
unit of measurement comparable to ounces or pounds that allows precise measurement
of a figure skating performance. Likewise, assessing speaking is generally considered an
endeavor that ranks students into ordinal categories (often referred to as vertical categories)
similar to bronze, silver, or gold; short, medium, and tall; third, second, or first—more
familiar in instructional contexts involving language as beginning, intermediate, and
advanced.
Such a global assessment of performance would result from the use of a holistic scale.
To clarify what such a global assessment means, the abilities associated with scale levels
are represented by level descriptors which represent a qualitative summary of the raters’
observations. In order to facilitate description, benchmark performances are selected to
exemplify the levels and their descriptors. Such descriptors are typically associated with,
but are not limited to, descriptions of the following components of a speaking performance
at different levels of the scale: pronunciation (focusing on segmentals); phonological con-
trol (focusing on suprasegmentals); grammar/accuracy (morphology, syntax, and usage);
fluency (speed and pausing); vocabulary (range and idiomaticity); coherence; and organiza-
tion. If the assessment involves evaluation of interaction, the following may also be included:
turn-taking strategies, cooperative strategies, and asking for or providing clarification when
needed.
Holistic vertical indicators, even when accompanied by scale descriptors and benchmarks,
may not be sufficient for making instructional or placement decisions. In such cases, an
analytic rating is done to produce scores on components of the examinee’s performance.
The specific components chosen, which can include any of the same aspects of performance
used in holistic scale descriptors, will depend on the purpose of the test and the needs of
the score users.
Score users are of central concern in Alderson’s (1991) distinction among three types of
scales: user-oriented, assessor-oriented, and constructor-oriented. The language used to
describe abilities tends to focus on the positive aspects of performance in user-oriented
and constructor-oriented scales where the former may focus on likely behaviors at a given
level and the latter may focus on particular tasks that may be associated with a curriculum
or course of instruction. Assessor-oriented scales shift the focus from the learner and the
objectives of learning toward the rater; the scale descriptors are often negatively worded
and focus on the perceptions of the rater. These scales tend to be more useful for screening
purposes.
From another perspective, scales for speaking assessments can be theoretically oriented,
empirically oriented, or a combination of both. The starting point for all assessments is
usually a description or theory of language ability (e.g., Canale & Swain, 1980; Bachman,
1990). These broad theoretical orientations are then narrowed down to focus on a particular
skill and particular components of skills. Empirical approaches to the development and
validation of speaking assessment scales involve identification of characteristics of interest
for the subsequent development of scale levels (e.g., Chalhoub-Deville, 1995; Fulcher, 1996)
and/or explications of characteristics of assigned ability levels (e.g., Xi & Mollaun, 2006;
Iwashita, Brown, McNamara, & O’Hagan, 2008; Ginther, Dimova, & Yang, 2010).
Specific-purpose scales are often derived from general guidelines and frameworks. For
example, The ACTFL Proficiency Guidelines (2009) serve as a starting point for the ACTFL
OPI scale. Another influential framework is the Common European Framework of Reference
for Languages (Council of Europe, 2001). The CEFR is a collection of descriptions of lan-
guage ability, ranging from beginning to advanced, across and within the four main skills.
The document is comprehensive and formidable in scope, but in spite of its breadth, the
CEFR has been used to construct scales for assessing language performance, communicate
about levels locally and nationally (Figueras & Noijons, 2009), and interpret test scores
(Tannenbaum & Wylie, 2008).
The most familiar part of the framework presents six stages of proficiency from A1
(Breakthrough) to C2 (Mastery) in general terms. Descriptors are provided in a series of
tables for each skill and then each skill is additionally broken down into subskill descrip-
tions that can be adapted for use in the creation of scales for specific purposes within
specific contexts. For example, with respect to speaking, the framework presents separate
tables for overall oral production standards, for understanding a native speaking inter-
locutor, for conversation, for informal discussion with friends, for formal discussions and
meetings, for goal-oriented cooperation, for transactions to obtain goods and services, for
information exchange, and for interviewing and being interviewed. Perhaps of greatest
interest here are the “qualitative aspects of spoken language use” in which spoken abilities
are additionally divided into CEFR levels in which aspects of range (vocabulary), accuracy,
fluency, interaction, and coherence are described.
Raters
In order for raters to achieve a common understanding and application of a scale, rater
training is an important part of assessing speaking. As the standard for speaking assess-
ment procedures involving high-stakes decisions is an inter-rater reliability coefficient of
0.80, some variability among raters is expected and tolerated. Under optimal conditions,
the sources of error that can be associated with the use of a scale are expected to be random
rather than systematic. Therefore, research aims to identify and control systematic error
resulting from rater performance.
One type of systematic error results from a rater’s tendency to assign either harsh or
lenient scores. When a pattern is identified in comparison to other raters in a pool, a rater
may be identified as negatively or positively biased. Systematic effects with respect to score
assignment have been found in association with rater experience, rater native language
background, and also examinee native language background. Every effort should be made
to identify and remove as their presence negatively affects the accuracy, utility, interpret-
ability, and fairness of the scores we report.
With fairness at issue, researchers have studied factors affecting ratings. Brown (1995)
compared differences across Japanese language teachers and professional tour guides in
their assignment of scores to 51 Japanese tour guide candidates. While no differences were
found in the scores assigned, the two pools of raters did apply different criteria in their
score assignments: teachers tended to focus on grammar, vocabulary, and fluency while
tour guides tended to focus on pronunciation. Chalhoub-Deville (1995) examined the
performance of three rater groups who differed in professional background and place of
residence (Arab teachers residing in the USA and nonteaching Arabs residing in the USA
or in Lebanon) and found a tendency for the teachers to rate grammar more harshly in
comparison to the nonteaching groups who emphasized communicative success. Chalhoub-

Deville and Wigglesworth (2005) compared native-speaking English raters from Australia,
Canada, the UK, and the USA and found raters from the UK harshest while raters from
the USA were the most lenient. Differences in raters’ application of a scale have been found
not only across raters of different backgrounds and experiences, but also across trained
raters of similar backgrounds.
Studies comparing native speaker and nonnative speakers as raters have produced mixed
findings. While some studies have identified tendencies for non-native speakers to assign
harsher scores (Ross, 1979), others have found the opposite to be the case (Winke, Gass,
& Myford, 2011). In Winke et al., raters with first language backgrounds that matched
those of the candidates were found more lenient when rating second language English
oral proficiency, and the authors suggest that this effect may be due to familiarity with
accent. In an attempt to ameliorate such potential effects, Xi and Mollaun (2009) provided
special training for Indian raters who were evaluating the English language responses of
Indian examinees on the TOEFL iBT. While the performance of the Indian raters was found
comparable to that of Educational Testing Service raters both before and after the training,
the Indian raters showed some improvement and increased confidence after participating
in the training. Far fewer studies have been conducted on differences in ratings assigned
by interviewers; however, there is no reason to expect that interviewers would be less
subject to interviewer effects than raters are to rater effects. Indeed, in an examination of
variability across two interviewers with respect to how they structured the interview, their
questioning techniques, and the feedback they provided, Brown (2003) identified differ-
ences that could easily result in different score assignments as well as differences in
interpretations of the interviewee’s ability.
These findings underscore the importance of rater training; however, the positive effects
of training tend to be short-lived. In a study examining rater severity over time, Lumley
and McNamara (1995) found that many raters tended to drift over time. The phenomenon
of rater drift calls into question the practice of certifying raters once and for all after
successfully completing only a single training program and highlights the importance of
ongoing training in order to maintain rater consistency. A more important concern raised
by studies of rater variability—one that can only be partially addressed by rater training—
is whose standard, whether that of an experienced rater, of an inexperienced rater, of a
teacher, of a native speaker, or of a non-native speaker is the more appropriate standard
to apply.
Conclusion
Bachman (1990) argued that in order to
maximize the reliability of test scores and the validity of test use, we should . . . provide
clear and unambiguous theoretical definitions of the abilities we want to measure and
specify precisely the conditions, or operations that we will follow in eliciting and observ-
ing performance. (p. 90)
Choices with respect to methods, scales, and raters are therefore critical in assessment of
speaking.
SEE ALSO: Assessment of Listening; Assessment of Reading; Assessment and Testing:

Overview; Assessment of Writing; Fluency; Intelligibility; Language Assessment Methods;
Pronunciation Assessment; Rating Oral Language; Rating Scales for Language Tests
References
Alderson, J. C. (1991). Bands and scores. In J. C. Alderson & B. North (Eds.), Language testing
in the 1990s: The communicative legacy (pp. 71–86). London, England: Macmillan.
American Council on the Teaching of Foreign Languages (ACTFL) (2009) Testing for Proficiency.
The ACTFL Oral Proficiency Interview. Retrieved December 1, 2009 from http://www.actfl.org/
i4a/pages/index.cfm?pageid=3348
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, England: Oxford
University Press.
Brooks, L. (2009). Interacting in pairs in a test of oral proficiency: Co-constructing a better
performance. Language Testing, 26(3), 341–66.
Brown, A. (1993). The role of test-takers’ feedback in the test development process: Test-takers’
reactions to a tape-mediated test of proficiency in spoken Japanese. Language Testing, 10(3),
277–301.
Brown, A. (1995). The effect of rater variables in the development of an occupation-specific
performance test. Language Testing, 12(1), 1–15.
Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language
Testing, 20(1), 1–25.
Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second
language teaching and testing. Applied Linguistics, 1(1), 1–47.
Chalhoub-Deville, M. (1995). Deriving oral assessment scales across different tests and rater
groups. Language Testing, 12(1), 16–21.
Chalhoub-Deville, M., & Wigglesworth, G. (2005). Rater judgment and English language speak-
ing proficiency. World Englishes, 24(3), 383–91.
Clark, J. L. D. (1979). Direct vs. semi-direct tests of speaking ability. In E. J. Briere & F. B. Hinofotis
(Eds.), Concepts in language testing: Some recent studies (pp. 35–49). Washington, DC: TESOL.
Council of Europe. (2001). The common European framework of reference for languages:
Learning, teaching, and assessment. Cambridge, England: Cambridge University Press.
Retrieved April 13, 2011 from http://www.coe.int/t/dg4/linguistic/source/Framework_
EN.pdf
Crocker, L., & Algina. J. (1986). Introduction to classical and modern test theory. Belmont, CA:
Wadsworth Group/Thomson Learning.
Ducasse, A. M., & Brown, A. (2009). Assessing paired orals: Raters’ orientation to interaction.
Language Testing, 26(4), 423–43.
Figueras, N., & Noijons, J. (Eds.). (2009). Linking to the CEFR levels: Research perspectives. Arnhem,
Netherlands: Cito/EALTA.
Fulcher, G. (1996). Does thick description lead to smart tests? A data-based approach to rating
scale construction. Language Testing, 13(2), 208–38.
Ginther, A., Dimova, S., & Yang, R. (2010). Conceptual and empirical relationships between
temporal measures of fluency and oral English proficiency with implications for automated
scoring. Language Testing, 27(3), 379–99.
Hasselgreen, A. (2005). Testing the spoken English of young Norwegians: A study of test validity and
the role of “smallwords” in contributing to pupils’ fluency. New York, NY: Cambridge University
Press.
He, A. W., & Young, R. (1998). Language proficiency interview: A discourse approach. In
R. Young & A. W. He (Eds.), Talking and testing: Discourse approaches to the assessment of oral
proficiency (pp. 1–26). Philadelphia, PA: John Benjamins.
Iwashita, N., Brown, A., McNamara, T., & O’Hagan, S. (2008). Assessed levels of second language
speaking proficiency: How distinct? Applied Linguistics, 29(1), 24–49.
Johnson, M. & Tyler, A. (1998). Re-analyzing the OPI: How much does it look like natural
conversation? In R. Young & A. W. He (Eds.), Talking and testing: Discourse approaches to the
assessment of oral proficiency (pp. 27–51). Philadelphia, PA: John Benjamins.
Koike, D. A. (1998). What happens when there’s no one to talk to? Spanish foreign language in
simulated oral proficiency interviews. In R. Young & A. W. He (Eds.), Talking and testing:
Discourse approaches to the assessment of oral proficiency (pp. 69–101). Philadelphia, PA: John
Benjamins.
Lazaraton, A. (1992). The structural organization of a language interview: A conversation analytic
perspective. System, 20(3), 373–86.
Lazaraton, A. (1997). Preference organization in oral proficiency interviews. Research on Language
and Social Interaction, 30(1), 53–72.
Lumley, T. and McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for
training. Language Testing, 12(1), 54–71.
Qian, D. D. (2009). Comparing direct and semi-direct modes for speaking assessment: Affective
effects on test takers. Language Assessment Quarterly, 6(2), 113–25.
Ross, J. R. (1979). Where’s English? In C. J. Fillmore, D. Kempler, & W. S. Y. Wang (Eds.),
Individual differences in language ability and language behavior (pp. 127–63). New York, NY:
Academic Press.
Ross, S., & Berwick, R. (1992). The discourse accommodation in oral proficiency interviews.
Studies in Second Language Acquisition, 14(2), 159–76.
Shohamy, E. (1994). The validity of direct versus semi-direct oral tests. Language Testing, 11(2),
99–123.
Stansfield, C. (1991). A comparative analysis of simulated and direct oral proficiency interviews.
In S. Anivan (Ed.), Current developments in language testing (pp. 199–209). Singapore: RELC.
Stansfield, C., & Kenyon, D. (1992). Research on the comparability of the oral proficiency
interview and the simulated oral proficiency interview. System 20(3), 329–45.
Tannenbaum, R., & Wylie, E. C. (2008). Linking English-language test scores onto the common
European framework of reference: An application of standard-setting methodology. Princeton, NJ:
Educational Testing Service.
Van Moere, A. (2006). Validity evidence in a university group oral test. Language Testing, 23(4),
411–40.
Winke, P., Gass, S. M., & Myford, C. (2011). The relationship between raters’ prior language study
and the evaluation of foreign language speech samples. Princeton, NJ: Educational Testing Service.
Xi, X., & Mollaun, P. (2006). Investigating the utility of analytic scoring for the TOEFL academic
speaking test (TAST) (ETS Research Report). Princeton, NJ: Educational Testing Service.
Xi, X. and Mollaun, P. (2009). How do raters from India perform in scoring the TOEFL iBT speaking
section and what kind of training helps. Princeton, NJ: Educational Testing Service.
Suggested Readings
Crocker, L., & Algina. J. (1986). Introduction to classical and modern test theory. Belmont, CA:
Wadsworth Group/Thomson Learning.
Fulcher. G. (2003). Testing second language speaking. Harlow, England: Pearson Longman.
Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assessment:
Expectations and validation criteria. Educational Researcher, 20(8), 15–21.
McNamara, T. (1996). Measuring second language performance. London, England: Longman.
View publication stats

Assessment of Speaking

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assessment of Speaking

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Chapter · November 2013

Assessment in Second Language Pronunciation View project

Test Score Use and Interpretation View project

The user has requested enhancement of the downloaded file.

Speaking Assessment Methods

The Encyclopedia of Applied Linguistics, Edited by Carol A. Chapelle.

Rating Scales and Scale Descriptors

Assessment of speaking requires assigning numbers to the characteristics of the speech

comparison to the nonteaching groups who emphasized communicative success. Chalhoub-

Bachman (1990) argued that in order to

SEE ALSO: Assessment of Listening; Assessment of Reading; Assessment and Testing:

View publication stats

You might also like