Chapelle 2010

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Educational Measurement: Issues and Practice

Spring 2010, Vol. 29, No. 1, pp. 3–13

Does an Argument-Based Approach


to Validity Make a Difference?

Carol A. Chapelle, Iowa State University, Mary K. Enright, Educational Testing


Service, Joan Jamieson, Northern Arizona University

Drawing on experience between 2000 and 2007 in developing a validity argument for the
high-stakes Test of English as a Foreign LanguageTM (TOEFL R
), this paper evaluates the differences
between the argument-based approach to validity as presented by Kane (2006) and that described
in the 1999 AERA/APA/NCME Standards for Educational and Psychological Testing. Based on an
analysis of four points of comparison—framing the intended score interpretation, outlining the
essential research, structuring research results into a validity argument, and challenging the
validity argument—we conclude that an argument-based approach to validity introduces some new
and useful concepts and practices.

Keywords: claim, construct, high-stakes, inference, interpretive argument, Standards, TOEFL, validity argument

A ttempts to improve professional practice in validation


have resulted in seemingly new perspectives appearing
in the fourth edition of Educational Measurement. In the
Psychological Testing (1999), and therefore throughout the
TOEFL revision, research was conducted that would yield
the appropriate evidence for a validity argument in line with
introduction to the volume, Brennan (2006) contrasts the the perspectives presented in the Standards. By the late
presentation on validity given by Kane (2006) with that in 1990s, a considerable amount of such research needed to be
the previous edition: in 1989 the chapter was an extensive synthesized into a validity argument, and so we turned to
scholarly treatment rather than one “that provides much the Standards for guidance. This process yielded consider-
specific guidance to those who would undertake validation able insight into shortcomings in the guidance provided by
studies” (p. 2). In contrast, the chapter in the 2006 edi- the Standards for synthesizing research results into a validity
tion “aims at making validation a more accessible enterprise argument. Because the project began with an attempt to use
for educational measurement practitioners” (p. 3). Practi- the Standards and ended up instead drawing on the perspec-
tioners reading the chapter will notice a change from what tives of Kane (1992, 2001, 2002, 2006) and Kane, Crooks, and
appears in both the third edition of Educational Measure- Cohen (1999), it revealed contrasts in the two approaches
ment (Linn, 1989) and the 1999 AERA/APA/NCME Standards that may be informative in future attempts to update the
for Educational and Psychological Testing. The introduction chapter on validity in the Standards. The validity argument
of new concepts and frameworks in the 2006 chapter raises resulting from this process is available elsewhere (Chapelle,
the question of whether or not Kane really offers new insights Enright, & Jamieson, 2008), but this paper describes the con-
into validation, and achieves the aim of making validation trasts as four points of comparison between the Standards
more accessible. We address this question in this paper by and Kane’s approach as summarized in Table 1.
drawing on our experience between 2000 and 2007 in de-
veloping a validity argument for the Test of English as a
Foreign LanguageTM (TOEFL R
), a test used for high-stakes
decisions about admissions to English-medium universities, Framing the Intended Score Interpretation
which requires substantial validity evidence supporting score
interpretation and use. The Standards frame the intended interpretation of the test
A revision of the TOEFL was undertaken by Educational as a construct which can be defined from many different
Testing Service (ETS) from 1990 through 2005. ETS has perspectives. Multiple perspectives were brought to bear on
standards for validity evidence that are in line with those the discussion of construct definition in the TOEFL project
of the AERA/APA/NCME Standards for Educational and as the developers attempted to define a construct of aca-
demic English language proficiency based on work in applied
Carol A. Chapelle is a Professor, Department of English, Iowa
linguistics. A strong construct theory—which would place
State University, 203 Ross Hall, Ames, IA 50011; carolc@iastate.edu. academic English proficiency within a nomological network
Mary K. Enright is Research Director, Center for Validity Research, that could be used to generate clear hypotheses about test
Educational Testing Service, Rosedale Road MS10-R, Princeton, NJ performance relative to other constructs and behaviors—did
08541. Joan Jamieson is a Professor, English Department, Northern not result from this process and therefore the construct itself
Arizona University, Box 6032, Flagstaff, AZ 86011. was not a good basis for subsequent research. However, Kane’s

Copyright 
C 2010 by the National Council on Measurement in Education 3
Table 1. Key Aspects in the Process of Validation in the Standards (1999) and in Educational
Measurement (Kane, 2006)
Four Aspects Characterizing
Approaches to Validity Standards (1999) Kane (2006)
Framing the intended score interpretation A construct An interpretive argument
Outlining the essential research Propositions consistent with the Inferences and their assumptions
intended interpretation
Structuring research results into a validity Listing types of evidence Series of inferences linking grounds with
argument conclusions
Challenging the validity argument Counterevidence for propositions Refuting the argument

organizing concept of an “interpretive argument,” which does Chalhoub-Deville, 1997, 2001; Oller, 1979; McNamara, 1996).
not rely on a construct, proved to be useful. Nevertheless, most would agree that limiting a construct of
language proficiency to a trait such as knowledge of vocabu-
A Construct lary or listening is too narrow for the interpretations that test
users want to make for university admissions decisions. In-
In the Standards, the term “construct” refers broadly to “the stead, test users are typically interested in examinees’ ability
concept or characteristic that a test is designed to measure” to use a complex of knowledge and processes to achieve par-
(1999, p. 5) and “proposed interpretation” is used inter- ticular goals. Therefore, strategies or processes of language
changeably with “construct” (e.g., 1999, p. 9). This approach use have been included in constructs of language proficiency,
is consistent with views in educational measurement in the called communicative competence (Canale & Swain, 1980)
early 1990s that a theoretical construct should provide the ba- or communicative language ability (Bachman, 1990). Applied
sis for score interpretation for a large-scale, high-stakes test linguists would also agree that language proficiency needs to
(e.g., Messick, 1994). Like experts in other areas, language be defined in terms of contexts of performance because sec-
assessment experts have attempted to capture the dynamic ond language learners can be proficient in some contexts
and context-mediated nature of the construct measured by but lack proficiency in others (Cummins, 1983). A concep-
a test such as the TOEFL. Discussions of how to do so (e.g., tualization of language proficiency that recognizes one trait
Stansfield, 1986) ultimately led to the launch of a project (or even a complex of abilities) as responsible for perfor-
to develop a revised TOEFL that would be based on modern mance across all contexts fails to account for the variation in
conceptions of language proficiency. Designers of the new performance observed across these different contexts of lan-
TOEFL agreed that theoretical rationales underlying score guage use (Bachman & Palmer, 1996; Chalhoub-Deville, 1997;
interpretation needed to come from a construct of language Chapelle, 1998; McNamara, 1996; Norris, Brown, Hudson, &
proficiency and that therefore this would serve as a basis for Bonk, 2002; Skehan, 1998). As a result, language proficiency
test design. constructs of interest are difficult to define in a precise way
The idea that a language proficiency construct should un- (e.g., as construct representation; Embretson, 1983). Bach-
derlie test development and validation has strong support in man (2007) provides an analysis of the issues and the various
language testing. For example, Alderson, Clapham and Wall perspectives on construct definition that have been presented
(1995) tie the construct to both the test specifications and over the past fifty years in applied linguistics.
validation. “For validation purposes, the test specifications This historical perspective reveals the challenge the
need to make the theoretical framework which underlies the TOEFL project faced in attempting to use a construct def-
test explicit, and to spell out relationships among its con- inition developed by applied linguists as a basis of the va-
structs, as well as the relationship between the theory and lidity argument. But this challenge is not unique to con-
the purpose for which the test is designed” (p. 17). Bach- structs in applied linguistics. Borsboom (2006) points out
man and Palmer (1996) include definition of the construct that “psychological theories are often simply too vague to mo-
in the fourth step of language test development: “ . . . a the- tivate psychometric models” (p. 437), and without a specified
oretical definition of the construct . . . provides the basis for psychometric model underlying score interpretation, what
considering and investigating the construct validity of the kind of evidence should be sought for the validity argument?
interpretations we make of test scores. This theoretical def- When the construct underlying test score interpretation is so
inition also provides a basis for the development . . . of test complex, how can a validity argument be formulated? After
tasks” (p. 89). Weir (2005) ties the construct to validation: struggling with these questions, we welcomed the different
“The more fully we are able to describe the construct we perspective offered by Kane for approaching the problem of
are attempting to measure at the a priori stage the more score interpretation.
meaningful might be the statistical procedures contribut-
ing to construct validation that can subsequently be applied
to the results of the test . . . . We can never escape from the An Interpretive Argument
need to define what is being measured, just as we are obliged Kane’s approach does not require a construct per se but rather
to investigate a test in operation” (p. 18). an explicitly stated interpretation called an interpretive ar-
Despite agreement on the need to define the construct as gument (Kane, 1992, p. 527; 2001; Kane et al., 1999). He
a basis for test development, no agreement exists concerning describes the interpretive argument as consistent with the
a single best way to define constructs of language proficiency general principles accepted for construct validity that appear
to serve as a defensible basis for score interpretation (e.g., in the Standards: “Validation requires a clear statement of
Bachman, 1990; Bachman & Palmer, 1996; Chapelle, 1998; the proposed interpretations and uses” (Kane, 2006, p. 23).

4 Educational Measurement: Issues and Practice


Steinberg, and Almond, 2003), which are based on Toulmin’s
CLAIM: The student’s
English speaking abilities are
(2003) description of informal or practical arguments, such
inadequate for study in an as those used in nonmathematical fields like law.
English-medium university. Interpretive arguments for test scores are constructed
through the use of particular types of inferences such as
the evaluation inference shown in Figure 2. In this example,
the grounds consist of the observations made of a student’s
speaking performance on a task in class. The evaluation in-
ference is made in awarding a score of “2” to the student for
SO that performance. The interpretive argument states explic-
itly that such an inference is the basis for awarding a score
of “2.” This statement about the inference provides the basis
for planning and interpreting validity evidence. Validity ev-
idence is needed to support such inferences, which, in this
example, could come from evidence showing the rationales
GROUNDS: A student’s presentation
for developing the scoring rubric and the consistency with
to the class on an assigned topic was
which it was applied.
characterized by hesitations and This simple, hypothetical example illustrates the basic ap-
mispronunciations. proach to an interpretive argument that states the basis of
an interpretation without defining a construct. Rather than
terms referring to examinee characteristics, the tools of the
FIGURE 1. Illustration of an interpretive argument comprising an interpretive argument are inferences that are typically made
inference about ability based upon grounds. in the process of measurement. When we pursued this ap-
proach in the TOEFL project, we did not entirely eliminate
the construct but rather it became one part of an overall chain
Rather than relying on formal theories, however, the inter- of inferences. The set of inferences, which we were able to
pretive argument “specifies the proposed interpretations and specify in terms that had direct implications for research, be-
uses of test results by laying out the network of inferences came the organizing units underlying score interpretation. If
and assumptions leading from the observed performances test developers can work with a set of such inferences rather
to the conclusions and decisions based on the performances” than solely with the complex constructs that can be defined
(Kane, 2006, p. 23). “The validity argument provides an eval- in many different ways, the basis for score interpretation
uation of the interpretive argument” (Kane, 2006, p. 23). In becomes more manageable.
the simplest terms, then, a validity argument is an interpre-
tive argument in which backing has been provided for the
assumptions. In view of the complexity of the construct the- Outlining the Essential Research
ory underlying the TOEFL interpretation, we were open to
exploring a means of establishing a basis for score interpre- In a general sense, the construct or interpretive argument
tation that did not rely solely on the concepts of tasks and provides a starting point for planning research to be used in
abilities from psychology or on applied linguistics’ discussion a validation argument, but more specifically, within the Stan-
of language ability constructs. Instead, the building blocks dards framework, how does one move from a construct defini-
of interpretive arguments are the types of inferences iden- tion of something like “reading comprehension” or “speaking
tified by Kane. Test developers and researchers wishing to ability” to the design of a study whose results will show that
support score interpretation need to identify the inferences the scores should be interpreted as intended? The Standards
upon which score interpretation is to be based. advise researchers to begin to consider validation research
To illustrate the meaning of inference, a simplified inter- by generating propositions that would be expected to be true
pretive argument for speaking performance in an English if test scores did in fact reflect the intended construct. In the
language classroom is shown in Figure 1. It begins with TOEFL project we attempted to follow this advice, ending up
grounds such as the observation that a student’s presentation with a list of propositions. Taking Kane’s approach, in con-
to a class on an assigned topic was characterized by hesita- trast, propositions, called warrants, were also generated but
tions and mispronunciations. “Grounds” is the term used by the important difference lies in how those propositions are
Toulmin, Rieke and Janik (1984) to denote the basis for mak- generated and what they are likely to consist of.
ing a claim; “data” was used by Toulmin (2003) and has been
used by others to refer to the same functional unit of the argu-
ment. The claim one might make on the basis of that perfor- Propositions Consistent with the Intended Interpretation
mance is that the student’s speaking abilities are inadequate The Standards direct test developers and researchers to
for study in an English-medium university. The arrow extend- gather “kinds of evidence” that are needed to evaluate the
ing from the grounds to the claim represents an inference. “intended interpretation” of test scores. This guidance, of
The inference allows for a conclusion, which, in the example, course, is recognizable as a way of summarizing the dominant
is the claim. The point is that the observation itself cannot views on test validation throughout the 1990s that assumed
mean that the student is unprepared. Instead, an interpretive that multiple types of evidence should support score interpre-
argument specifies the interpretation to be drawn from the tation (e.g., Cronbach, 1988; Messick, 1989). The Standards
grounds to a claim by an inference. Such inferences play a point out “many lines of evidence can contribute to an under-
critical role in current approaches toward developing inter- standing of the construct meaning of test scores” (p. 5) and
pretive and validity arguments (Kane, 1992, 2001; Mislevy, suggests that those lines of evidence can consist of the familiar

Spring 2010 5
Observed Score: The student’s spoken response received a score of 2.

Evaluation

Observations: When asked to discuss the relationship between information


presented in a brief lecture and a short reading passage a student replies:
“Oh,xxx he xxx the author xxx that alter energy xxx needed in United States and
the woman on the film he spoke about new kind of automobile. xxx It use fewer
gasoline.” Note: xxx represents unintelligible speech.

FIGURE 2. Example grounds and conclusion for an evaluation inference. (Adapted from Chapelle, Enright, and Jamieson, 2008, p. 11.
Reprinted courtesy of Taylor and Francis.)

categories of evidence based on test content, response pro- 8. Test performance is not affected by examinees’ familiarity
cesses, internal structure, relations to other variables, and with computer use.
consequences, as outlined by Messick (1989). Each of these 9. Test performance is not affected inappropriately by back-
lines of evidence suggests methodologies, and there are plenty ground knowledge of the topics represented on the test.
of examples of these in language testing and elsewhere. How- 10. The test assesses second language abilities independent of
ever, as Shepard (1993) pointed out, the idea that “many lines general cognitive abilities.
of evidence can contribute” offers a large set of options rather 11. Criterion measures can validly assess the linguistic aspects
than guidance to an efficient and effective path for validation. of academic success.
According to the Standards, “The decision about what types 12. Test scores are positively related to criterion measures of
of evidence are important for validation in each instance can success.
be clarified by developing a set of propositions that support the 13. Use of the test will result in positive washback in ESL/EFL
proposed interpretation for a particular purpose of testing” instruction, such as increased emphasis on speaking and
(AERA/APA/NCME, 1999, p. 9). For example, if one wishes to writing and focus on academic language.
make the proposition that the test score distinguishes among We stopped at number 13, recognizing that the list could
examinees at different English ability levels, the validation go on and on unless we gained a better sense of what a
research must provide data indicating that this is actually proposition should consist of and how many one would like
the case. The proposition guides the researcher to produce to have for a good validation argument. Moreover, in the
supporting evidence. These propositions are to serve as hy- absence of such guidelines we found that the propositions we
potheses about score interpretations, which would provide were generating were influenced by the validation research
guidance about the types of validity evidence required. In that had been completed, and were therefore unlikely to help
the Standards, six examples of propositions are given for a identify areas where more research was needed. On the one
mathematics achievement test used to test readiness for an hand, contextualizing research was precisely what needed to
advanced course. They include statements such as “that cer- be done, but on the other hand this process seemed to start
tain skills are prerequisite for the advanced course” and “that from the perspective of completed research rather than from
test scores are not unduly influenced by ancillary variables” the perspective of score meaning. In short, our examination
(p. 9). of the Standards and the materials that had led up to them
Drawing on these examples, we developed the following demonstrated the need for more explicit guidance on how
propositions: to formulate an intended interpretation and the propositions
1. Certain language skills defined as listening, reading, speak- that are supposed to point to the types of evidence that would
ing and writing both independently and in combination are ultimately contribute to the TOEFL validity argument.
necessary (but not sufficient) for students to succeed in
advanced academic settings.
2. The content domain of the tasks on the TOEFL requires Inferences and Their Assumptions
the English language skills students need to succeed in Kane’s approach to identifying framing statements is to con-
English-speaking North American university settings. nect them to the inferences in the interpretive argument
3. Each of the skills—listening, reading, speaking, and through the use of two types of statements: warrants and as-
writing—is composed of a set of subskills. sumptions. Taking the example in Figure 2, the evaluation
4. Test tasks comprising each skill score exhibit internal con- inference would be supported by a warrant such as “Observa-
sistency. tions of performance on the speaking task are evaluated to
5. Each of the four skills is distinct enough from each other provide a score reflective of the relevant language abilities.”
to be measured independently, but the skills are related by Here, the warrant is the generally held principle that hesita-
some core competencies. tions and mispronunciations are characteristics of students
6. Test performance is not affected by test-taking processes with low levels of speaking ability who would have trouble
irrelevant to the constructs of interest. at an American university. Such a warrant is a statement
7. Test scores are arrived at through judgments of appropriate which rests on assumptions that need to be supported in
aspects of learners’ performance. order for the inference to be made. A warrant is a law, a

6 Educational Measurement: Issues and Practice


Table 2. Summary of the Inferences, Warrants in the TOEFL Validity Argument with Their
Underlying Assumptions
Inference Warrant Licensing the Inference Assumptions Underlying Inferences
Domain description Observations of performance on the TOEFL reveal 1. Critical English language skills, knowledge, and
relevant knowledge, skills, and abilities in processes needed for study in English-medium
situations representative of those in the target colleges and universities can be identified.
domain of language use in the English-medium 2. Assessment tasks that require important skills and
institutions of higher education. are representative of the academic domain can be
simulated.
Evaluation Observations of performance on TOEFL tasks are 1. Rubrics for scoring responses are appropriate for
evaluated to provide observed scores reflective providing evidence of targeted language abilities.
of targeted language abilities. 2. Task administration conditions are appropriate for
providing evidence of targeted language abilities.
3. The statistical characteristics of items, measures,
and test forms are appropriate for
norm-referenced decisions.
Generalization Observed scores are estimates of expected scores 1. A sufficient number of tasks are included on the
over the relevant parallel versions of tasks and test to provide stable estimates of test takers’
test forms and across raters. performances.
2. Configuration of tasks on measures is appropriate
for intended interpretation.
3. Appropriate scaling and equating procedures for
test scores are used.
4. Task and test specifications are well defined so
that parallel tasks and test forms are created.
Explanation Expected scores are attributed to a construct of 1. The linguistic knowledge, processes, and
academic language proficiency. strategies required to successfully complete tasks
vary across tasks in keeping with theoretical
expectations.
2. Task difficulty is systematically influenced by task
characteristics.
3. Performance on new test measures relates to
performance on other test-based measures of
language proficiency as expected theoretically.
4. The internal structure of the test scores is
consistent with a theoretical view of language
proficiency as a number of highly interrelated
components.
5. Test performance varies according to amount and
quality of experience in learning English.
Extrapolation The construct of academic language proficiency as Performance on the test is related to other criteria of
assessed by TOEFL accounts for the quality of language proficiency in the academic context.
linguistic performance in English-medium
institutions of higher education.
Utilization Estimates of the quality of performance in the 1. The meaning of test scores is clearly interpretable
English-medium institutions of higher education by admissions officers, test takers, and teachers.
obtained from the TOEFL are useful for making 2. The test will have a positive influence on how
decisions about admissions and appropriate English is taught.
curricula for test takers.
Adapted from Chapelle, Enright, and Jamieson, 2008, pp. 19–21. Reprinted courtesy of Taylor and Francis.

generally held principle, rule of thumb, or established proce- that the observed score reflects the expected score across
dure. Assumptions would be, for example, that the rubric for tasks, occasions and raters. Rather than state all of these
scoring the responses was appropriate for providing the rel- grounds and claims which are linked in a formulaic way to
evant evidence of ability. Assumptions prompt research that types of inferences, Table 2 focuses on the warrants and as-
focuses on particular issues. In this case, the research would sumptions which need to be generated by the researcher to
need to provide evidence for the accuracy and relevance of guide the validity research. Discussion of the inferences as
the rating of the student’s performance. the central building blocks for the interpretive argument ap-
As shown in Table 2, we identified six inferences, each pears in Kane et al. (1999), and the specific statements used
with a warrant and assumptions, that form the basis for the as grounds and claims in the TOEFL validity argument ap-
TOEFL interpretive argument. Each of the inferences is used pear in Chapelle (2008). The intended score interpretation
to move from grounds to a claim; each claim becomes grounds is based on a domain description inference, which has a war-
for a subsequent claim. For example, a generalization in- rant that observations of performance on the TOEFL reveal
ference connects the grounds of an observed score which relevant knowledge, skills, and abilities in situations repre-
reflects the relevant aspects of performance with a claim sentative of those in the target domain of language use in the

Spring 2010 7
English-medium institutions of higher education. Kane does the linguistic knowledge, processes, and strategies required
not define domain description as an inference, but he points to successfully complete tasks vary across tasks in keeping
out that “. . .if the test is intended to be interpreted as a mea- with theoretical expectations; (2) task difficulty is systemat-
sure of competence in some domain, then efforts to describe ically influenced by task characteristics; (3) performance on
the domain carefully and to develop items that reflect the do- new test measures relate to performance on other test-based
main (in terms of content, cognitive level, and freedom from measures of language proficiency as expected theoretically;
potential sources of systematic errors) tend to support the (4) the internal structure of the test scores is consistent
intended interpretation” (Kane, 2004, p. 141). Because this with a theoretical view of language proficiency as a number
inference was critical to score interpretations for the TOEFL, of highly interrelated components; and (5) test performance
we included it in the TOEFL interpretive argument. The valid- varies according to the amount and quality of experience in
ity of that inference rests on the assumptions that assessment learning English.
tasks that are representative of the academic domain can be Fifth, the interpretive argument includes an extrapolation
identified, that critical English language skills, knowledge, inference with the warrant that the construct of academic
and processes needed for study in English-medium colleges language proficiency as assessed by TOEFL accounts for the
and universities can be identified, and that assessment tasks quality of linguistic performance in English-medium institu-
that require important skills and are representative of the tions of higher education. This is consistent with the Kane
academic domain can be developed. et al. (1999) description of extrapolation: the inference that
The second inference in the argument is evaluation, which is made when the test takers’ expected scores are interpreted
has the warrant that observations of performance on TOEFL as indicative of performance and scores that they would re-
tasks are evaluated to provide observed scores reflective of ceive in the target setting. The assumption underlying this
targeted language abilities. The quality of evaluation infer- inference in the TOEFL interpretive argument is that per-
ences rests on the assumption that “the criteria used to score formance on the test is related to other criteria of language
the performance are appropriate and have been applied as proficiency in the academic context.
intended and second, that the performance occurred under Finally, utilization links the target score to the decisions
conditions compatible with the intended score interpreta- about examinees for which the score is used. This “decision-
tion” (Kane et al., 1999, p. 9). In other words, if the test making” is viewed by Kane as an inference, which underscores
administration conditions are not favorable or if they vary for the need for evidence to support it in an overall validity ar-
test takers, then the intended interpretation of a test taker’s gument. Kane distinguishes the “descriptive interpretations”
score may not be supported. The assumptions underlying this (Kane, 2002, p. 32) of the other inferences from the policy-
inference are that rubrics for scoring responses are appropri- based interpretations inherent in test score use. The latter
ate for providing evidence of targeted language abilities, that “involve assumptions supporting the decision procedure’s
task administration conditions are appropriate for providing suitability as a policy [e.g., requiring a TOEFL score as one
evidence of targeted language abilities, and that the statis- part of the application of international students at American
tical characteristics of items, measures, and test forms are universities], and policies are typically justified by claims
appropriate for norm-referenced decisions. about their consequences” (Kane, 2002, p. 32). Bachman
Third, the interpretive argument includes an inference of (2005) referred to decision-making as “utilization,” which
generalization with the warrant that observed scores are es- is the term we adopted in view of its grammatical parallelism
timates of expected scores over the relevant parallel versions with the terms used for the other inferences. The utilization
of tasks and test forms and across raters, as presented by inference in the TOEFL interpretive argument is made on the
Kane et al. (1999, p. 10). In the example shown in Figure 2, basis of assumptions that the meaning of the test scores is
the expected score on similar test tasks would be a “2.” The clearly interpretable by admissions officers, test takers, and
“2” across all tasks was not observed, but instead was in- teachers, and that the test will have a positive influence on
ferred on the basis of the single observed “2.” Of course, in how English is taught.
most test settings, one would not attempt to generalize on This interpretive argument with six inferences helped to
the basis of one task. In the TOEFL interpretive argument, surmount the challenges presented by our attempt to gener-
generalization rests on the assumptions that (1) a sufficient ate relevant propositions as advised by the Standards. Based
number of tasks are included on the test to provide stable on the six inferences shown in Table 2, we were able to
estimates of test takers’ performances, (2) the configuration identify the assumptions in need of support in the TOEFL
of the measures in terms of the tasks included is appropriate validity argument. The assumptions in the third column of
for the intended interpretation, (3) appropriate scaling and Table 2 are similar in type and form to the propositions we
equating procedures for test scores are used, and (4) task had brainstormed under the guidance of the Standards, but
and test specifications are well defined so that parallel tasks the important difference is that each assumption is tied to
and test forms are created. a particular inference. In other words, assumptions are de-
Fourth, the explanation inference in the TOEFL interpre- veloped in a more principled way as a piece of an inter-
tive argument has a warrant that the expected scores are pretive argument. In the initial brainstorming session of
attributed to a construct of academic language proficiency. propositions, we tended to look to the existing evidence,
Kane (2001) maintains that an interpretive argument can be and work backwards to formulate a proposition. Working
developed without including explanation, but suggests that backwards from the existing evidence is not a good way of
in cases where a construct is used in score interpretation, ending up with a fair and balanced validity argument which
as is often the case in language assessment (e.g., Bachman, can identify areas where additional research is needed. The
2002), it should be specified as an explanation inference. An number of possible assumptions associated with each in-
explanation inference is included in the interpretive argu- ference can vary, but the issues they should focus on are
ment so that assumptions associated with it can be identified constrained more productively than propositions guided
and investigated. The assumptions identified were that (1) by construct theory alone. Moreover, a finite number of

8 Educational Measurement: Issues and Practice


inferences are typically made in measurement, and there- Figure 3 shows the structure of the validity argument
fore the inferences and warrants that a researcher has to for the TOEFL. It consists of the inferences summarized in
work with have bounds. Table 2 as well as the backing for their assumptions. In the di-
agram, the inferences are used as short-hand for referring to
Structuring Research Results into a Validity Argument the inference and its warrant. For example, the evaluation in-
ference is made on the basis of the warrant that observations
“I invite you to think of ‘validity argument’ rather than ‘val- of performance on TOEFL tasks are evaluated to provide ob-
idation research’” (p. 4). These important framing words of served scores reflective of targeted language abilities. Such
Cronbach (1988) over 20 years ago were used to highlight a warrant is made on the basis of assumptions, including
the importance of the many audiences for which a validity that rubrics for scoring responses are appropriate for provid-
argument might be made. Audiences interested in, for exam- ing evidence of targeted language abilities. These warrants
ple, the functional, political, and explanatory dimensions of and assumptions outlined in Table 2 prompt the selection of
tests place demands on some of what should be included in a backing that is shown in Figure 3. In Kane’s terminology, the
validity argument or how it might best be presented. However, interpretive argument lays out the intended inferences and
none of these audiences explicitly informs the internal logic warrants. The process of building a validity argument requires
of the argument itself. In the 1999 Standards, researchers the researcher to examine the assumptions underlying each
are advised to develop a “sound validity argument,” but the of the warrants in order to focus on obtaining the relevant
technical issue of how one actually puts together a validity backing. Only inferences and backing are noted in the sum-
argument with all of the relevant pieces of research was not mary of the validity argument appearing in Figure 3, but these
developed, leaving researchers to produce lists rather than are the result of the process of laying out inferences, war-
arguments. rants and assumptions, as well as gathering the appropriate
backing.
The diagram summarizing the TOEFL validity argument is
Listing Types of Evidence read starting at the bottom of the page with the grounds—
The validity chapter in the Standards enumerates various in this case the target language use domain, the academic
sources of validity evidence including those based on con- contexts, where the examinees will have to use English in
tent, responses processes, internal structure, relations to order to succeed. The backing required to move across the
other variables, and consequences of testing. The Standards domain description inference comes from a combination of
specify that “a sound validity argument integrates various systematic domain analysis conducted by applied linguists,
strands of evidence into a coherent account of the degree empirical studies of stake holders’ views of important tasks
to which existing evidence and theory support the intended and academic language, and task modeling. These are summa-
interpretation of test scores for specific uses” (p. 17). How- rized in Chapelle et al. (2008), which refers to the documents
ever, the source of the coherence is not spelled out. Thus, describing how applied linguists identified academic domain
this approach identifies the types of research that would typ- tasks and language abilities required to complete them, re-
ically be done, but the validity evidence remains a listing of search that showed teachers and learners thought these tasks
research results, which researchers need to integrate in a nar- were important, as well as the systematic process of task de-
rative, and which others need to evaluate in view of their own sign and modeling engaged by experts. The conclusion, that
experience. a relevant observation of examinee performance could be ob-
Working within these perspectives, the TOEFL program tained from the test tasks, serves as the grounds for the next
maintains a taxonomy consisting of types of research that inference, evaluation.
have been conducted (Educational Testing Service, 2007). The backing for the evaluation inference was obtained from
One could look at such a taxonomy and argue that strong systematic rubric development, prototyping studies, and item
evidence supported various propositions that one might make and test analysis, as described by Chapelle et al. (2008).
about the test. Such a list could also be used to identify areas Rubrics and multiple task administration conditions were
where more research is needed. However, a taxonomy is not developed, trialed, and revised based on expert consensus.
an argument, and in working with a taxonomy one is not Statistical characteristics of tasks and measures were moni-
prompted to look at the strength of the evidence or to organize tored throughout the test development and modifications in
it in a way that presents a validity argument. tasks and measures were made as needed. This process pro-
vided the backing needed to move to the conclusion that the
observed score is appropriate.
Series of Inferences Linking Grounds with Conclusions The observed score is the grounds for the generalization
Kane’s approach depicts the validity argument as a step-by- inference, which was supported by backing consisting of gen-
step movement across “bridges” (the metaphor referring to eralizability and reliability studies as well as scaling and
inferences) to a conclusion about test score use. To extend equating studies. Because generalization is a way of concep-
this bridge metaphor, one can think of a valid token being re- tualizing reliability, in some cases assumptions underlying
quired to cross each bridge, and tokens being obtained when generalization inferences can be supported through reliabil-
adequate backing is provided for the assumptions associated ity estimates. Other support of generalization comes from
with each inference. The idea of a chain of inferences and standardization of task characteristics and test administra-
implications is consistent with Toulmin, Rieke, and Janik’s tion conditions (Kane et al., 1999) and from score equating
(1984) conception of an argument, which is “liable to become (Kane, 2004). In the TOEFL validity argument, results from
the starting point for a further argument; this second argu- reliability and generalizability studies were used to determine
ment tends to become the starting point for a third argument, the number of tasks required. A variety of task configurations
and so on. In this way, arguments become connected together was tried to find a stable configuration. Various rating sce-
in chains” (p. 73). narios were examined to maximize efficiency. An equating

Spring 2010 9
CONCLUSION: The test score reflects the ability of the test taker to use and understand
English as it is spoken, written, and heard in college and university settings. The score is useful
for aiding in admissions and placement decisions and for guiding English-language instruction.

1. Educational Testing Service has produced materials and held


test user information sessions.
2. Educational Testing Service has produced materials and held Utilization
information sessions to help test users set cut scores.
3. The first phases of a washback study have been completed.
Target Score

Results indicate positive relationships between test performance


and students’ academic placement, test takers’ self-assessments of
their own language proficiency, and instructors’ judgments of Extrapolation
students’ English language proficiency.

1. Examination of task completion processes and discourse Construct


supported the development of and justification for specific tasks.
2. Expected correlations were found among TOEFL measures and
other tests.
3. Correlations were found among measures within the test and Explanation
expected factor structure.
4. Results showed expected relationships with English learning.

Expected Score
1. Results from reliability and generalizability studies indicated the
number of tasks required.
2. A variety of task configurations was tried to find a stable
configuration. Generalization
3. Various rating scenarios were examined to maximize efficiency.
4. An equating method was identified for the listening and the
reading measures.
5. An ECD process yielded task shells for producing parallel tasks.
Observed Score
1. Rubrics were developed, trialed, and revised based on expert
consensus.
2. Multiple task administration conditions were developed, trialed,
and revised based on expert consensus. Evaluation
3. Statistical characteristics of tasks and measures were monitored
throughout the test development and modifications in tasks and
measures were made as needed.
Observation
1. Applied linguists identified academic domain tasks. Research
showed teachers and learners thought these tasks were important.
2. Applied linguists identified language abilities required for
academic tasks. Domain Description
3. A systematic process of task design and modeling was engaged
by experts.

GROUNDS: The
target language use
domain

FIGURE 3. Structure of the validity argument for the TOEFL.

method was identified for the listening and the reading mea- tic and item difficulty studies, concurrent correlational stud-
sures. The speaking and writing measures are not equated ies, analysis of reliability and factor analysis, and comparison
statistically due to technical and test security constraints. studies of group differences. Examination of task completion
Other methods of monitoring comparability of these test sec- processes and discourse supported the development of and
tions across forms are used (Educational Testing Service, justification for specific tasks. The correlations among TOEFL
2008). In addition, a specific procedure was developed and measures and other tests found expected relationships with
implemented for carrying out evidence-centered design (Mis- the other measures. Correlations among measures within the
levy et al., 2003) in a manner intended to produce parallel test and expected factor structure confirmed expectations
tasks. These types of support led to the conclusion that the ob- as well. Results showing expected relationships with English
served scores reflect the expected scores across tasks, forms, learning were also found. In the TOEFL validity argument,
and raters. these results linked to the test construct were not considered
Moving from this conclusion, the explanation inference, the whole validity argument, but rather were the backing for
which links the expected score to the construct of academic one inference.
English proficiency, is supported through several types of The extrapolation inference links the construct to the
studies typically associated with construct validation: studies target score which signifies examinee’s performance in
analyzing discourse and cognitive processes, task characteris- the academic contexts of interest. In the TOEFL validity

10 Educational Measurement: Issues and Practice


argument this inference is supported by criterion-related Refuting the Argument
validity studies. Results indicated positive relationships be- Kane’s approach to the validity argument provides a place
tween test performance and students’ academic placement, for counterevidence. Figure 4 illustrates the counterevi-
test takers’ self-assessments of their own language profi- dence as a rebuttal which is added to the argument shown
ciency, and instructors’ judgments of students’ English lan- in Figure 1. In the example, the warrant that allows the
guage proficiency. inference is that hesitations and mispronunciations are
Finally, the utilization inference, which links the target characteristic of students with low levels of English speak-
score to the conclusion that the score is useful for aiding in ing abilities, and it is supported by the experience of an ESL
admissions and placement decisions and for guiding English- teacher with such students. The ESL teacher’s experience
language instruction, is supported by backing that comes from is used in the interpretive argument as backing for the as-
standard-setting studies, score interpretation materials, the sumption. The argument can also include the rebuttal, which
availability of instructional materials, and washback studies. appears in the example as the fact that the topic required
As a step toward the needed research, Educational Testing highly technical and unfamiliar vocabulary. Such a fact could
Service has produced materials and held test user information weaken the argument by providing additional information
sessions about the test and about setting cut scores. The first that would suggest that an inference about the student’s lack
phase of a washback study (Wall & Horák, 2006) has been of preparedness should not be made on the basis of a class-
completed to examine test effects of the former version of room presentation on a particular topic.
the TOEFL as a point of comparison. The second phase (Wall Even when the warrant is supported with backing, excep-
& Horák, 2008) monitored six teachers in five countries as tions may be relevant or other circumstances may undermine
they learned about changes in the TOEFL and began to think the inference, thereby rebutting the force of the interpretive
about how these might impact their teaching in the future. argument. For example, the rebuttal in the example is that
In a report on Phases 3 and 4, Wall and Horák (submitted for the assigned topic for the oral presentation required the stu-
publication) analyze the impact of the new TOEFL on test dent to use highly technical and unfamiliar vocabulary. This
preparation materials and describe how three of the teachers rebuttal weakens the inferential link between the grounds—
modified their teaching in response to changes in the TOEFL that the oral presentation contained many hesitations and
test. More research is needed to provide additional backing mispronunciations—and the claim that the student’s speak-
for this inference. ing ability was at a level that would not allow him to succeed at
The structure and content of the TOEFL validity argument an English-medium university. Such a rebuttal would pertain
is based on the interpretive argument that specifies the in- to an important link in the argument for everyone perform-
tended test score interpretation and use. It includes both the ing the task. Rebuttals can also pertain to particular cases,
theoretical and empirical backing for the intended interpre- thereby delimiting the conditions under which the argument
tation and use. This presentation provides a framework for would hold.
synthesizing existing research conducted throughout the re- Development of the TOEFL validity argument thus far has
vision and development and points to areas where additional not yet reached the stage at which we can evaluate its utility
research efforts should be directed. for discussion of either type of rebuttal. The validity argument
to date represents the first stage, what Briggs (2004) called
“design validity,” which necessarily reflects a confirmationist
Challenging the Validity Argument approach. Our aim was to express the intended interpretation
and use of the TOEFL scores, as well as the relevant backing
In developing the validity argument for the TOEFL, we focused as clearly as possible. This presentation is open to challenges
on the backing required to support intended interpretations. to “the appropriateness of the proposed goals and uses of the
However, part of the impetus for constructing such a validity testing program, the adequacy of the interpretive argument,
argument is that it provides sufficient clarity to create a space or the plausibility of the interpretive argument” (Kane, 2004,
for productive discussion of weakness in the argument. In p. 166). Time will tell whether future researchers will be able
the Standards, the issue of challenging a validity argument to pick up the validity narrative and add to it with additional
is described in terms of counterevidence for propositions, backing or challenge it with rebuttals.
whereas Kane’s approach refers to refuting the argument.

Did the Argument-Based Approach Make a Difference?


Counterevidence for Propositions The timing and mission of the TOEFL project offered a unique
The Standards advise those who are conducting validation opportunity to compare the approach to validation argumen-
research as follows: “Identifying the propositions implied by a tation presented in the Standards with the more recent pre-
proposed test interpretation can be facilitated by considering sentation in the fourth edition of Educational Measurement,
rival hypotheses that may challenge the proposed interpre- which had unfolded in papers by Kane over the past decade.
tation” (p. 10). The Standards advise that a good way to As a high-stakes test, the TOEFL required the most serious at-
look for rival hypotheses is by considering the possibilities of tention in constructing a validity argument and therefore we
construct underrepresentation and construct irrelevant vari- attempted to follow the guidelines provided in the Standards.
ance. Thus, the development of counterevidence is also tied An understanding of those guidelines had proven useful in
to the construct from this perspective. Moreover, it is not codifying a shared understanding of what validity evidence
clear how counterevidence should be interpreted relative to could consist of throughout test development. However, when
positive evidence, or how one might systematically seek coun- it came to putting that evidence together, the approach sug-
terevidence pertaining to particular aspects of the construct gested in the Standards lacked the structure required to
definition. move forward with decisiveness. In particular, the reliance

Spring 2010 11
CLAIM: The student’s
English speaking abilities are
inadequate for study in an
English-medium university.

SO

REBUTTAL: The
UNLESS
topic required highly
technical and unfamiliar
vocabulary.
WARRANT:
Hesitations and SINCE
mispronunciations are
characteristic of students
with low levels of English
speaking abilities.

BECAUSE OF

BACKING: This is based


on teacher’s training and
previous experience.

GROUNDS: A student’s presentation


to the class on an assigned topic was
characterized by hesitations and
mispronunciations.

FIGURE 4. The structure for a validity argument including a rebuttal. (Adapted from Chapelle, Enright, and Jamieson, 2008, p. 7. Reprinted
courtesy of Taylor and Francis.)

on the test construct as a starting point in the Standards this and other examples of validity arguments will be consid-
was problematic in the TOEFL project, not because the test ered by the committee charged with the next revision of the
was solely focused on performance without any interest in a Standards.
construct, but because the construct was extremely complex.
The argument-based approach to validity provided a place for
the construct, without relying on it as a starting point or the References
central piece. The interpretive argument provides a means AERA, APA, & NCME (1999). Standards for educational and psycho-
of specifying the multiple inferences underlying score inter- logical testing. Washington, DC: AERA.
pretation and use, thereby removing the enormous burden Alderson, J. C., Clapham, C., & Wall, D. (1995). Language testing
that might otherwise be placed on an imprecise theoretical construction and evaluation. Cambridge: Cambridge University
construct. Press.
The discussion of theory and practice of validation in ed- Bachman, L. F. (1990). Fundamental considerations in language test-
ing. Oxford: Oxford University Press.
ucational measurement since at least the 1950s (Cronbach Bachman, L. F. (2002). Alternative interpretations of alternative assess-
& Meehl, 1955) has been concerned with establishing princi- ments: Some validity issues in educational performance assessments.
pled procedures for evaluating the test-based inferences and Educational Measurement: Issues and Practice, 21(3), 5–18.
uses. Discussions about how much and what kind of validity Bachman, L. F. (2005). Building and supporting a case for test use.
evidence is needed to support inferences and uses of test Language Assessment Quarterly, 2(1), 1–34.
scores has been ongoing since that time as has validation Bachman, L. F. (2007). What is the construct? The dialectic of abilities
research in practice. Our attempt to put the discussion and and contexts in defining constructs in language assessment. In J. Fox,
guidelines into practice revealed areas of tension between the M. Wesche, & D. Bayliss (Eds.), What are we measuring? Language
guidelines in the Standards and our specific case in practice testing reconsidered (pp. 41–71). Ottawa: University of Ottawa Press.
which was better addressed by the approach offered in Kane’s Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice.
Oxford: Oxford University Press.
article in the fourth edition of Educational Measurement. Borsboom, D. (2006). The attack of the psychometricians. Psychome-
In the case of the TOEFL, the argument-based approach to trika, 71(3), 425–440.
validation represented a difference from that presented in Brennan, R. L. (Ed.) (2006). Perspectives on the evolution and
the Standards and a clear improvement as well. We hope that future of educational measurement. In R. Brennan (Ed.),

12 Educational Measurement: Issues and Practice


Educational Measurement (4th ed., pp. 1–16). Westport, CT: Green- Kane, M. (2004). Certification testing as an illustration of argument-
wood Publishing. based validation. Measurement, 2(3), 135–170.
Briggs, D. C. (2004). Comment: Making an argument for design va- Kane, M. (2006). Validation. In R. Brennan (Ed.), Educational Measure-
lidity before interpretive validity. Measurement: Interdisciplinary ment (4th ed.) (pp. 17–64). Westport, CT: Greenwood Publishing.
Research and Perspectives, 2(3), 171–174. Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of per-
Canale, M., & Swain, M. (1980). Theoretical bases of communicative formance. Educational Measurement: Issues and Practice, 18(2),
approaches to second language teaching and testing. Applied Lin- 5–17.
guistics, 1, 1–47. Linn, R. L. (Ed.) (1989). Educational measurement (3rd ed.). New
Chapelle, C. A. (1998). Construct definition and validity inquiry in York: Macmillan.
SLA research. In L. F. Bachman & A. D. Cohen (Eds.), Interfaces McNamara, T. (1996). Measuring second language performance. Lon-
between second language acquisition and language testing research don: Longman.
(pp. 32–70). Cambridge, UK: Cambridge University Press. Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measure-
Chapelle, C. A. (2008). The TOEFL validity argument. In C. A. Chapelle, ment (3rd ed., pp. 13–103). New York: Macmillan.
M. K. Enright, & J. Jamieson (Eds.), Building a validity argument Messick, S. (1994). The interplay of evidence and consequences in
for the Test of English as a Foreign Language (pp. 319–352). London: the validation of performance assessments. Educational Researcher,
Routledge. 23(2), 13–23.
Chapelle, C. A., Enright, M. K., & Jamieson, J. (Eds.) (2008). Building Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure
a validity argument for the Test of English as a Foreign Language. of educational assessments. Measurement: Interdisciplinary Re-
London: Routledge. search and Perspectives, 1, 3–62.
Chalhoub-Deville, M. (1997). Theoretical models, assessment frame- Norris, J. M., Brown, J. D., Hudson, T. D., & Bonk, W. (2002). Examinee
works, and test construction. Language Testing, 14, 3–22. abilities and task difficulty in task-based second language perfor-
Chalhoub-Deville, M. (2001). Task-based assessments: Characteristics mance assessment. Language Testing, 19, 395–418.
and validity evidence. In M. Bygate, P. Skehan, & M. Swain (Eds.), Oller, J. (1979). Language tests at school. London: Longman.
Researching pedagogical tasks: Second language learning, teaching Shepard, L. A. (1993). Evaluating test validity. Review of Research in
and testing (pp. 210–228). Harlow, UK: Longman. Education, 19, 405–450.
Cronbach, L. (1988). Five perspectives on validity argument. In H. Skehan, P. (1998). A cognitive approach to language learning. Oxford:
Wainer & H. Braun (Eds.), Test validity (pp. 3–17). Hillsdale, NJ: Oxford University Press.
Lawrence Erlbaum. Stansfield, C. W. (Ed.) (1986). Toward communicative competence
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological testing: Proceedings of the second TOEFL invitational conference.
tests. Psychological Bulletin, 52, 281–302. TOEFL Research Reports, No. 21. Princeton, NJ: Educational Testing
Cummins, J. P. (1983). Language proficiency and academic achieve- Service.
ment. In J. Oller (Ed.), Issues in language testing research Toulmin, S. E. (2003). The uses of argument (updated ed.). Cambridge,
(pp. 108–126). Rowley, MA: Newbury House. UK: Cambridge University Press.
Educational Testing Service (2007). TOEFL research: Ensuring Toulmin, S. E., Rieke, R., & Janik, A. (1984). An introduction to reason-
test quality. Retrieved August 13, 2008, from http://www.ets.org/ ing. New York: Macmillan.
Media/Research/pdf/Framework Recent TOEFL Research.pdf Wall, D., & Horák, T. (2006). The impact of changes in the TOEFL exam-
Educational Testing Service (2008). Reliability and comparability of ination on teaching: Phase I, The baseline study. TOEFL-Monograph
TOEFL R
iBT scores. Retrieved October 2, 2008, from http://www. Series Report No. 34. Princeton, NJ: Educational Testing Service.
ets.org/Media/Tests/TOEFL/pdf/TOEFL_iBT_Reliability.pdf Wall, D., & Horák T. (2008). The impact of changes in the TOEFL exam-
Embretson, S. (1983). Construct validity: Construct representa- ination on teaching and learning in Central and Eastern Europe:
tion versus nomothetic span. Psychological Bulletin, 93(1), 179– Phase 2, Coping with change. TOEFL iBT Report No. TOEFLiBT-05.
197. Princeton, NJ: Educational Testing Service.
Kane, M. T. (1992). An argument-based approach to validity. Psycho- Wall, D., & Horák, T. (submitted for publication) The impact of changes
logical Bulletin, 112, 527–535. in the TOEFL R
examination on teaching and learning in a sample
Kane, M. T. (2001). Current concerns in validity theory. Journal of of countries in Europe: Phase 3, The role of the course book, and Phase
Educational Measurement, 38, 319–342. 4, Describing change. Princeton, NJ: Educational Testing Service.
Kane, M. T. (2002). Validating high-stakes testing programs. Educa- Weir, C. J. (2005). Language testing and validation: An evidence-based
tional Measurement: Issues and Practice, 21, 31–41. approach. Basingstoke, UK: Palgrave Macmillan.

Spring 2010 13

You might also like