Professional Documents
Culture Documents
Chapelle 2010
Chapelle 2010
Chapelle 2010
Drawing on experience between 2000 and 2007 in developing a validity argument for the
high-stakes Test of English as a Foreign LanguageTM (TOEFL R
), this paper evaluates the differences
between the argument-based approach to validity as presented by Kane (2006) and that described
in the 1999 AERA/APA/NCME Standards for Educational and Psychological Testing. Based on an
analysis of four points of comparison—framing the intended score interpretation, outlining the
essential research, structuring research results into a validity argument, and challenging the
validity argument—we conclude that an argument-based approach to validity introduces some new
and useful concepts and practices.
Keywords: claim, construct, high-stakes, inference, interpretive argument, Standards, TOEFL, validity argument
Copyright
C 2010 by the National Council on Measurement in Education 3
Table 1. Key Aspects in the Process of Validation in the Standards (1999) and in Educational
Measurement (Kane, 2006)
Four Aspects Characterizing
Approaches to Validity Standards (1999) Kane (2006)
Framing the intended score interpretation A construct An interpretive argument
Outlining the essential research Propositions consistent with the Inferences and their assumptions
intended interpretation
Structuring research results into a validity Listing types of evidence Series of inferences linking grounds with
argument conclusions
Challenging the validity argument Counterevidence for propositions Refuting the argument
organizing concept of an “interpretive argument,” which does Chalhoub-Deville, 1997, 2001; Oller, 1979; McNamara, 1996).
not rely on a construct, proved to be useful. Nevertheless, most would agree that limiting a construct of
language proficiency to a trait such as knowledge of vocabu-
A Construct lary or listening is too narrow for the interpretations that test
users want to make for university admissions decisions. In-
In the Standards, the term “construct” refers broadly to “the stead, test users are typically interested in examinees’ ability
concept or characteristic that a test is designed to measure” to use a complex of knowledge and processes to achieve par-
(1999, p. 5) and “proposed interpretation” is used inter- ticular goals. Therefore, strategies or processes of language
changeably with “construct” (e.g., 1999, p. 9). This approach use have been included in constructs of language proficiency,
is consistent with views in educational measurement in the called communicative competence (Canale & Swain, 1980)
early 1990s that a theoretical construct should provide the ba- or communicative language ability (Bachman, 1990). Applied
sis for score interpretation for a large-scale, high-stakes test linguists would also agree that language proficiency needs to
(e.g., Messick, 1994). Like experts in other areas, language be defined in terms of contexts of performance because sec-
assessment experts have attempted to capture the dynamic ond language learners can be proficient in some contexts
and context-mediated nature of the construct measured by but lack proficiency in others (Cummins, 1983). A concep-
a test such as the TOEFL. Discussions of how to do so (e.g., tualization of language proficiency that recognizes one trait
Stansfield, 1986) ultimately led to the launch of a project (or even a complex of abilities) as responsible for perfor-
to develop a revised TOEFL that would be based on modern mance across all contexts fails to account for the variation in
conceptions of language proficiency. Designers of the new performance observed across these different contexts of lan-
TOEFL agreed that theoretical rationales underlying score guage use (Bachman & Palmer, 1996; Chalhoub-Deville, 1997;
interpretation needed to come from a construct of language Chapelle, 1998; McNamara, 1996; Norris, Brown, Hudson, &
proficiency and that therefore this would serve as a basis for Bonk, 2002; Skehan, 1998). As a result, language proficiency
test design. constructs of interest are difficult to define in a precise way
The idea that a language proficiency construct should un- (e.g., as construct representation; Embretson, 1983). Bach-
derlie test development and validation has strong support in man (2007) provides an analysis of the issues and the various
language testing. For example, Alderson, Clapham and Wall perspectives on construct definition that have been presented
(1995) tie the construct to both the test specifications and over the past fifty years in applied linguistics.
validation. “For validation purposes, the test specifications This historical perspective reveals the challenge the
need to make the theoretical framework which underlies the TOEFL project faced in attempting to use a construct def-
test explicit, and to spell out relationships among its con- inition developed by applied linguists as a basis of the va-
structs, as well as the relationship between the theory and lidity argument. But this challenge is not unique to con-
the purpose for which the test is designed” (p. 17). Bach- structs in applied linguistics. Borsboom (2006) points out
man and Palmer (1996) include definition of the construct that “psychological theories are often simply too vague to mo-
in the fourth step of language test development: “ . . . a the- tivate psychometric models” (p. 437), and without a specified
oretical definition of the construct . . . provides the basis for psychometric model underlying score interpretation, what
considering and investigating the construct validity of the kind of evidence should be sought for the validity argument?
interpretations we make of test scores. This theoretical def- When the construct underlying test score interpretation is so
inition also provides a basis for the development . . . of test complex, how can a validity argument be formulated? After
tasks” (p. 89). Weir (2005) ties the construct to validation: struggling with these questions, we welcomed the different
“The more fully we are able to describe the construct we perspective offered by Kane for approaching the problem of
are attempting to measure at the a priori stage the more score interpretation.
meaningful might be the statistical procedures contribut-
ing to construct validation that can subsequently be applied
to the results of the test . . . . We can never escape from the An Interpretive Argument
need to define what is being measured, just as we are obliged Kane’s approach does not require a construct per se but rather
to investigate a test in operation” (p. 18). an explicitly stated interpretation called an interpretive ar-
Despite agreement on the need to define the construct as gument (Kane, 1992, p. 527; 2001; Kane et al., 1999). He
a basis for test development, no agreement exists concerning describes the interpretive argument as consistent with the
a single best way to define constructs of language proficiency general principles accepted for construct validity that appear
to serve as a defensible basis for score interpretation (e.g., in the Standards: “Validation requires a clear statement of
Bachman, 1990; Bachman & Palmer, 1996; Chapelle, 1998; the proposed interpretations and uses” (Kane, 2006, p. 23).
Spring 2010 5
Observed Score: The student’s spoken response received a score of 2.
Evaluation
FIGURE 2. Example grounds and conclusion for an evaluation inference. (Adapted from Chapelle, Enright, and Jamieson, 2008, p. 11.
Reprinted courtesy of Taylor and Francis.)
categories of evidence based on test content, response pro- 8. Test performance is not affected by examinees’ familiarity
cesses, internal structure, relations to other variables, and with computer use.
consequences, as outlined by Messick (1989). Each of these 9. Test performance is not affected inappropriately by back-
lines of evidence suggests methodologies, and there are plenty ground knowledge of the topics represented on the test.
of examples of these in language testing and elsewhere. How- 10. The test assesses second language abilities independent of
ever, as Shepard (1993) pointed out, the idea that “many lines general cognitive abilities.
of evidence can contribute” offers a large set of options rather 11. Criterion measures can validly assess the linguistic aspects
than guidance to an efficient and effective path for validation. of academic success.
According to the Standards, “The decision about what types 12. Test scores are positively related to criterion measures of
of evidence are important for validation in each instance can success.
be clarified by developing a set of propositions that support the 13. Use of the test will result in positive washback in ESL/EFL
proposed interpretation for a particular purpose of testing” instruction, such as increased emphasis on speaking and
(AERA/APA/NCME, 1999, p. 9). For example, if one wishes to writing and focus on academic language.
make the proposition that the test score distinguishes among We stopped at number 13, recognizing that the list could
examinees at different English ability levels, the validation go on and on unless we gained a better sense of what a
research must provide data indicating that this is actually proposition should consist of and how many one would like
the case. The proposition guides the researcher to produce to have for a good validation argument. Moreover, in the
supporting evidence. These propositions are to serve as hy- absence of such guidelines we found that the propositions we
potheses about score interpretations, which would provide were generating were influenced by the validation research
guidance about the types of validity evidence required. In that had been completed, and were therefore unlikely to help
the Standards, six examples of propositions are given for a identify areas where more research was needed. On the one
mathematics achievement test used to test readiness for an hand, contextualizing research was precisely what needed to
advanced course. They include statements such as “that cer- be done, but on the other hand this process seemed to start
tain skills are prerequisite for the advanced course” and “that from the perspective of completed research rather than from
test scores are not unduly influenced by ancillary variables” the perspective of score meaning. In short, our examination
(p. 9). of the Standards and the materials that had led up to them
Drawing on these examples, we developed the following demonstrated the need for more explicit guidance on how
propositions: to formulate an intended interpretation and the propositions
1. Certain language skills defined as listening, reading, speak- that are supposed to point to the types of evidence that would
ing and writing both independently and in combination are ultimately contribute to the TOEFL validity argument.
necessary (but not sufficient) for students to succeed in
advanced academic settings.
2. The content domain of the tasks on the TOEFL requires Inferences and Their Assumptions
the English language skills students need to succeed in Kane’s approach to identifying framing statements is to con-
English-speaking North American university settings. nect them to the inferences in the interpretive argument
3. Each of the skills—listening, reading, speaking, and through the use of two types of statements: warrants and as-
writing—is composed of a set of subskills. sumptions. Taking the example in Figure 2, the evaluation
4. Test tasks comprising each skill score exhibit internal con- inference would be supported by a warrant such as “Observa-
sistency. tions of performance on the speaking task are evaluated to
5. Each of the four skills is distinct enough from each other provide a score reflective of the relevant language abilities.”
to be measured independently, but the skills are related by Here, the warrant is the generally held principle that hesita-
some core competencies. tions and mispronunciations are characteristics of students
6. Test performance is not affected by test-taking processes with low levels of speaking ability who would have trouble
irrelevant to the constructs of interest. at an American university. Such a warrant is a statement
7. Test scores are arrived at through judgments of appropriate which rests on assumptions that need to be supported in
aspects of learners’ performance. order for the inference to be made. A warrant is a law, a
generally held principle, rule of thumb, or established proce- that the observed score reflects the expected score across
dure. Assumptions would be, for example, that the rubric for tasks, occasions and raters. Rather than state all of these
scoring the responses was appropriate for providing the rel- grounds and claims which are linked in a formulaic way to
evant evidence of ability. Assumptions prompt research that types of inferences, Table 2 focuses on the warrants and as-
focuses on particular issues. In this case, the research would sumptions which need to be generated by the researcher to
need to provide evidence for the accuracy and relevance of guide the validity research. Discussion of the inferences as
the rating of the student’s performance. the central building blocks for the interpretive argument ap-
As shown in Table 2, we identified six inferences, each pears in Kane et al. (1999), and the specific statements used
with a warrant and assumptions, that form the basis for the as grounds and claims in the TOEFL validity argument ap-
TOEFL interpretive argument. Each of the inferences is used pear in Chapelle (2008). The intended score interpretation
to move from grounds to a claim; each claim becomes grounds is based on a domain description inference, which has a war-
for a subsequent claim. For example, a generalization in- rant that observations of performance on the TOEFL reveal
ference connects the grounds of an observed score which relevant knowledge, skills, and abilities in situations repre-
reflects the relevant aspects of performance with a claim sentative of those in the target domain of language use in the
Spring 2010 7
English-medium institutions of higher education. Kane does the linguistic knowledge, processes, and strategies required
not define domain description as an inference, but he points to successfully complete tasks vary across tasks in keeping
out that “. . .if the test is intended to be interpreted as a mea- with theoretical expectations; (2) task difficulty is systemat-
sure of competence in some domain, then efforts to describe ically influenced by task characteristics; (3) performance on
the domain carefully and to develop items that reflect the do- new test measures relate to performance on other test-based
main (in terms of content, cognitive level, and freedom from measures of language proficiency as expected theoretically;
potential sources of systematic errors) tend to support the (4) the internal structure of the test scores is consistent
intended interpretation” (Kane, 2004, p. 141). Because this with a theoretical view of language proficiency as a number
inference was critical to score interpretations for the TOEFL, of highly interrelated components; and (5) test performance
we included it in the TOEFL interpretive argument. The valid- varies according to the amount and quality of experience in
ity of that inference rests on the assumptions that assessment learning English.
tasks that are representative of the academic domain can be Fifth, the interpretive argument includes an extrapolation
identified, that critical English language skills, knowledge, inference with the warrant that the construct of academic
and processes needed for study in English-medium colleges language proficiency as assessed by TOEFL accounts for the
and universities can be identified, and that assessment tasks quality of linguistic performance in English-medium institu-
that require important skills and are representative of the tions of higher education. This is consistent with the Kane
academic domain can be developed. et al. (1999) description of extrapolation: the inference that
The second inference in the argument is evaluation, which is made when the test takers’ expected scores are interpreted
has the warrant that observations of performance on TOEFL as indicative of performance and scores that they would re-
tasks are evaluated to provide observed scores reflective of ceive in the target setting. The assumption underlying this
targeted language abilities. The quality of evaluation infer- inference in the TOEFL interpretive argument is that per-
ences rests on the assumption that “the criteria used to score formance on the test is related to other criteria of language
the performance are appropriate and have been applied as proficiency in the academic context.
intended and second, that the performance occurred under Finally, utilization links the target score to the decisions
conditions compatible with the intended score interpreta- about examinees for which the score is used. This “decision-
tion” (Kane et al., 1999, p. 9). In other words, if the test making” is viewed by Kane as an inference, which underscores
administration conditions are not favorable or if they vary for the need for evidence to support it in an overall validity ar-
test takers, then the intended interpretation of a test taker’s gument. Kane distinguishes the “descriptive interpretations”
score may not be supported. The assumptions underlying this (Kane, 2002, p. 32) of the other inferences from the policy-
inference are that rubrics for scoring responses are appropri- based interpretations inherent in test score use. The latter
ate for providing evidence of targeted language abilities, that “involve assumptions supporting the decision procedure’s
task administration conditions are appropriate for providing suitability as a policy [e.g., requiring a TOEFL score as one
evidence of targeted language abilities, and that the statis- part of the application of international students at American
tical characteristics of items, measures, and test forms are universities], and policies are typically justified by claims
appropriate for norm-referenced decisions. about their consequences” (Kane, 2002, p. 32). Bachman
Third, the interpretive argument includes an inference of (2005) referred to decision-making as “utilization,” which
generalization with the warrant that observed scores are es- is the term we adopted in view of its grammatical parallelism
timates of expected scores over the relevant parallel versions with the terms used for the other inferences. The utilization
of tasks and test forms and across raters, as presented by inference in the TOEFL interpretive argument is made on the
Kane et al. (1999, p. 10). In the example shown in Figure 2, basis of assumptions that the meaning of the test scores is
the expected score on similar test tasks would be a “2.” The clearly interpretable by admissions officers, test takers, and
“2” across all tasks was not observed, but instead was in- teachers, and that the test will have a positive influence on
ferred on the basis of the single observed “2.” Of course, in how English is taught.
most test settings, one would not attempt to generalize on This interpretive argument with six inferences helped to
the basis of one task. In the TOEFL interpretive argument, surmount the challenges presented by our attempt to gener-
generalization rests on the assumptions that (1) a sufficient ate relevant propositions as advised by the Standards. Based
number of tasks are included on the test to provide stable on the six inferences shown in Table 2, we were able to
estimates of test takers’ performances, (2) the configuration identify the assumptions in need of support in the TOEFL
of the measures in terms of the tasks included is appropriate validity argument. The assumptions in the third column of
for the intended interpretation, (3) appropriate scaling and Table 2 are similar in type and form to the propositions we
equating procedures for test scores are used, and (4) task had brainstormed under the guidance of the Standards, but
and test specifications are well defined so that parallel tasks the important difference is that each assumption is tied to
and test forms are created. a particular inference. In other words, assumptions are de-
Fourth, the explanation inference in the TOEFL interpre- veloped in a more principled way as a piece of an inter-
tive argument has a warrant that the expected scores are pretive argument. In the initial brainstorming session of
attributed to a construct of academic language proficiency. propositions, we tended to look to the existing evidence,
Kane (2001) maintains that an interpretive argument can be and work backwards to formulate a proposition. Working
developed without including explanation, but suggests that backwards from the existing evidence is not a good way of
in cases where a construct is used in score interpretation, ending up with a fair and balanced validity argument which
as is often the case in language assessment (e.g., Bachman, can identify areas where additional research is needed. The
2002), it should be specified as an explanation inference. An number of possible assumptions associated with each in-
explanation inference is included in the interpretive argu- ference can vary, but the issues they should focus on are
ment so that assumptions associated with it can be identified constrained more productively than propositions guided
and investigated. The assumptions identified were that (1) by construct theory alone. Moreover, a finite number of
Spring 2010 9
CONCLUSION: The test score reflects the ability of the test taker to use and understand
English as it is spoken, written, and heard in college and university settings. The score is useful
for aiding in admissions and placement decisions and for guiding English-language instruction.
Expected Score
1. Results from reliability and generalizability studies indicated the
number of tasks required.
2. A variety of task configurations was tried to find a stable
configuration. Generalization
3. Various rating scenarios were examined to maximize efficiency.
4. An equating method was identified for the listening and the
reading measures.
5. An ECD process yielded task shells for producing parallel tasks.
Observed Score
1. Rubrics were developed, trialed, and revised based on expert
consensus.
2. Multiple task administration conditions were developed, trialed,
and revised based on expert consensus. Evaluation
3. Statistical characteristics of tasks and measures were monitored
throughout the test development and modifications in tasks and
measures were made as needed.
Observation
1. Applied linguists identified academic domain tasks. Research
showed teachers and learners thought these tasks were important.
2. Applied linguists identified language abilities required for
academic tasks. Domain Description
3. A systematic process of task design and modeling was engaged
by experts.
GROUNDS: The
target language use
domain
method was identified for the listening and the reading mea- tic and item difficulty studies, concurrent correlational stud-
sures. The speaking and writing measures are not equated ies, analysis of reliability and factor analysis, and comparison
statistically due to technical and test security constraints. studies of group differences. Examination of task completion
Other methods of monitoring comparability of these test sec- processes and discourse supported the development of and
tions across forms are used (Educational Testing Service, justification for specific tasks. The correlations among TOEFL
2008). In addition, a specific procedure was developed and measures and other tests found expected relationships with
implemented for carrying out evidence-centered design (Mis- the other measures. Correlations among measures within the
levy et al., 2003) in a manner intended to produce parallel test and expected factor structure confirmed expectations
tasks. These types of support led to the conclusion that the ob- as well. Results showing expected relationships with English
served scores reflect the expected scores across tasks, forms, learning were also found. In the TOEFL validity argument,
and raters. these results linked to the test construct were not considered
Moving from this conclusion, the explanation inference, the whole validity argument, but rather were the backing for
which links the expected score to the construct of academic one inference.
English proficiency, is supported through several types of The extrapolation inference links the construct to the
studies typically associated with construct validation: studies target score which signifies examinee’s performance in
analyzing discourse and cognitive processes, task characteris- the academic contexts of interest. In the TOEFL validity
Spring 2010 11
CLAIM: The student’s
English speaking abilities are
inadequate for study in an
English-medium university.
SO
REBUTTAL: The
UNLESS
topic required highly
technical and unfamiliar
vocabulary.
WARRANT:
Hesitations and SINCE
mispronunciations are
characteristic of students
with low levels of English
speaking abilities.
BECAUSE OF
FIGURE 4. The structure for a validity argument including a rebuttal. (Adapted from Chapelle, Enright, and Jamieson, 2008, p. 7. Reprinted
courtesy of Taylor and Francis.)
on the test construct as a starting point in the Standards this and other examples of validity arguments will be consid-
was problematic in the TOEFL project, not because the test ered by the committee charged with the next revision of the
was solely focused on performance without any interest in a Standards.
construct, but because the construct was extremely complex.
The argument-based approach to validity provided a place for
the construct, without relying on it as a starting point or the References
central piece. The interpretive argument provides a means AERA, APA, & NCME (1999). Standards for educational and psycho-
of specifying the multiple inferences underlying score inter- logical testing. Washington, DC: AERA.
pretation and use, thereby removing the enormous burden Alderson, J. C., Clapham, C., & Wall, D. (1995). Language testing
that might otherwise be placed on an imprecise theoretical construction and evaluation. Cambridge: Cambridge University
construct. Press.
The discussion of theory and practice of validation in ed- Bachman, L. F. (1990). Fundamental considerations in language test-
ing. Oxford: Oxford University Press.
ucational measurement since at least the 1950s (Cronbach Bachman, L. F. (2002). Alternative interpretations of alternative assess-
& Meehl, 1955) has been concerned with establishing princi- ments: Some validity issues in educational performance assessments.
pled procedures for evaluating the test-based inferences and Educational Measurement: Issues and Practice, 21(3), 5–18.
uses. Discussions about how much and what kind of validity Bachman, L. F. (2005). Building and supporting a case for test use.
evidence is needed to support inferences and uses of test Language Assessment Quarterly, 2(1), 1–34.
scores has been ongoing since that time as has validation Bachman, L. F. (2007). What is the construct? The dialectic of abilities
research in practice. Our attempt to put the discussion and and contexts in defining constructs in language assessment. In J. Fox,
guidelines into practice revealed areas of tension between the M. Wesche, & D. Bayliss (Eds.), What are we measuring? Language
guidelines in the Standards and our specific case in practice testing reconsidered (pp. 41–71). Ottawa: University of Ottawa Press.
which was better addressed by the approach offered in Kane’s Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice.
Oxford: Oxford University Press.
article in the fourth edition of Educational Measurement. Borsboom, D. (2006). The attack of the psychometricians. Psychome-
In the case of the TOEFL, the argument-based approach to trika, 71(3), 425–440.
validation represented a difference from that presented in Brennan, R. L. (Ed.) (2006). Perspectives on the evolution and
the Standards and a clear improvement as well. We hope that future of educational measurement. In R. Brennan (Ed.),
Spring 2010 13