Professional Documents
Culture Documents
Construct Irrelevant Variance in High-Stake Testing - Haladyina.x
Construct Irrelevant Variance in High-Stake Testing - Haladyina.x
in High-Stakes Testing
Thomas M. Haladyna, Arizona State University West
Steven M. Downing, University of Illinois at Chicago
developing abilities (Messick, 1984),
and learned abilities (Sternberg, 1998).
There are many threats to validity in high-stakes achievement Each cognitive ability involves contex-
testing. One major threat is construct-irrelevant variance (CIV). tualized mental models, schemas, or
This article defines CIV in the context of the contemporary, unitary frames, and complex performance that
view of validity and presents logical arguments, hypotheses, may have multiple correct pathways that
depend on knowledge and skills. These
and documentation f o r a variety of CIV sources that commonly abilities are slow growing. They are
threaten interpretations of test scores. A more thorough study difficult to teach and learn. The item
o f CIV is recommended. format with greatest fidelity to mea-
sure this kind of construct is perfor-
Keywords: construct-irrelevant variance, high-stakes testing, validity mance. Messick (1984) referred to this
measurement approach as construct-
referenced because the performance
test focuses on the ability itself and not
C urrently, achievement testing can
be characterized as driven by con-
tent standards that affect the planning
Part 1: Validity
The most fundamental step in valida-
tion is defining the construct. Cronbach
on the domain of knowledge and skills
that support it. Mislevy (1996) traced
the history of achievement testing and
and delivery of instruction and the de- and Meehl(l955) called this construct forecasted that cognitive psychologywill
sign of student assessments. The align- formulation. Two kinds of achievement lead us in a direction that merges learn-
ment of content standards, instruction, constructs seem represented in state ing theory and test theory. For our pur-
and the tests used to evaluate student and national content standards. poses both kinds of achievement con-
learning are commonly held paradigms The first kind of achievement con- structs are susceptible to CIV, but as we
in education. Most states either have struct can be envisioned as a large will see in different ways.
in place or are developing comprehen- domain of knowledge and skills, some- According to the Standards for
sive assessment systems that have these times called declarative and procedural Educational and Fsyc~o~ogic~l Testing
features. knowledge. Any achievement test is in- (Standards) (American Educational
In many states, test scores can have tended to be a representative sample Research Association [ AERA] , Ameri-
several highstakes uses. For instance, from that domain (Haertel & Calfee, can Psychological Association [ APA] , &
students must pass tests to graduate 1983).Although test specifications guide National Council on Measurement in
from high school or to be promoted the design of these tests, the sample of Education [ NCME] , 1999): “Validity
to the next grade. Schools are evalu- this domain is usually small. Each stu-
ated based on test scores and annual dent’s test score is intended to show
progress. Low-performing schools may status in this domain. This type of
be subject to intervention. Teachers and achievement construct can be viewed Thomas M. Haladyna is Professor of
school leaders may be evaluated based as traditional.Tests of domains of knowl- Educational Psychology, College of Teacher
on student test performance, and their edge and skills are consistent with the Education and Leadership, Arizona Slate
employment and pay may be affected by “criterion-referenced movement of the 1Jniversity West, FAB 240 South, 4701 West
this evaluation. 1970s and 1980s. Multiple-choice (MC) Thunderbird Road, PO Box 37100, Phoenix,
This article addresses a serious prob- formats work well to measure this kind AZ 85069-7100;thomas.haladyna@asu.edu.
lem in high-stakes testing: construct- of achievement construct. He specialixes in developing and evaluating
irrelevant variance (CIV). Part 1 reviews A second kind of achievement con- testing programs and in item develop-
the contemporary view of validity, with struct focuses on a cognitive ability, ment and validation.
its emphasis on construct validity. In such as reading, writing, or mathemati- Steven M. Downing isAssociate Professor
cal problem solving. We can conceive of of Medical Education, University of Illinois
Part 2, CIV is defined. In Part 3, a tax- at Chicago, College ofMedicine, Department
onomy is presented that organizes this ability as consisting of a domain of ofMedica1Education (MC591), 808s. Wood
sources of CIV. Evidence is presented complex tasks. The theoretical ratio- Street, Chicago, IL 60612; sdowning@uic.
of CIV’s extensiveness in high-stakes nale for any cognitive ability comes edu. He specializes in test development
testing. Some of these sources of CIV from cognitive psychology. Other terms and psychometric issuesJor achievement
have received more research attention used to signify a cognitive ability in- and credentialing examinations in the
than others. clude Juid abilities (Lohman, 1993) , professions.
Spring 2004 17
refers to the degree to which evidence lidity (e.g., Crooks, et al. 1996; Messick, lated with true and observed scores. The
and theory support the interpretations 1984, 1989). At least five major threats expected value of random error across a
of test scores entailed by proposed uses to validity stand out: construct under- set of test scores is zero.
of tests” (p. 9). Validation is an inves- representation arising form poorly con- Systematic error is not random, but
tigative process bywhich we (a) create ceptualized or inadequately operational- group- or person specific. Construct-
a plausible argument regarding a de- ized constructs, faulty logic of the causal irrelevant easiness refers to a contami-
sired interpretation or use of test scores, inference regarding test scores, negative nating influence on test scores that
(b) collect and organize validity evi- consequences of test score interpreta- tends to systematically increase test
dence bearing on this argument, and tions and uses, lack of reproducibility of scores for a specific examinee or agroup
(c) evaluate the argument and the ev- test scores, and CIV. Although all of of examinees; construct-irrelevant dif-
idence concerning the validity of the these threats deserve attention in vali- ficulty does the opposite. It systemati-
interpretation. Kane (1992) described dation research, this article concen- cally decreases test scores for a specific
the process €or establishing a plausible trates on CIV. examinee or a group of examinees. Lord
argument and criteria we might use to and Novick (1968, p. 43) discussed sys-
evaluate this argument. Kane (2002) tematic error as an undesirable change
also described some nuances of de- Part 2: ConstrucMrrelevant in true score. The change is caused by a
scriptive and policy-based interpreta- Variance variable that is unrelated to the con-
tions in high-stakes student achieve- The major point here is that educa- struct being measured. Thus, the change
ment testing. He argued that current tional achievement tests, at best, in test score is construct irrelevant.
validation does not go far enough to reflect not only the psychological con- Although random error is variable from
justify the full array of interpretations structs of knowledge and skills that examinee to examinee, systematic error
and uses. are intended to be measured, but in- is not. It is predictable. It also manifests
As the stakes for testing increase, the variably a number of contaminants. in two types.
need for validity evidence also increases These adulterating influences include The first type of systematic error is
(Linn, 2002). The quest for validity evi- a variety of other psychological and constant error for all members of a pa;.
dence can be very complex. This evi- situational factors that technically
constitute either construct-irrelevant ticular group. This kind of error is char-
dence will likely consist of documenta- test difficulty or construct-irrelevant acterized by members of a specific group
tion of procedures in test developnient, contamination in score interpreta- having systematic over- or underestima-
administration, scoring and report- tion. (Messick, 1984,p. 216) tion of their true scores. A good example
ing, and empirical studies (Downing & of this type of constant error is rater
HaIadyna, 1996; Haladyna, 2002a). The CIV is error variance that arises from severity in a performance test of a cogni-
Standards (AERA, APA, & NCME, 1999) systematic error. A good way to think tive ability. Two raters are consis-
present categories of validity evidence about systematic error is to compare it tently harsh. Student papers evaluated
that include content, cognitive pro- with random error. If we were to write by these two raters will likely result in
cesses, internal structure of item re- the linear model representing what we systematic error that lowers their test
sponses or ratings of performance, know about random and systematic er- scores. The group being scored by these
reliability, relations of test scores to rors, the model would be: two harsh raters gets lower scores than
other variables, and consequences. Es- they deserve. Test form difficulty is
says by Messick (1995a, 1995b) also pro- y = t + e, + e,, another example of group-specific CIV.
vide suggestions for types of validity ev- Those taking the more difficult test form
idence and their importance. where y is the observed test score for will have underestimated scores unless
Although we may view validation as a any student, t is the true score, err test score equating is carried out.
process for strengthening an argument is random error, and e, is systematic The second type of systematic error is
about the validity of a particular inter- error due to CIV. Lord and Novick the over- or underestimation of individ-
pretation or test score use, Cronbach (1968, pp. 43-44) developed this idea ual examinee scores due to a CIV source
(1988) noted that validation should and presented systematic error as a that affects examinees differentially.
also include the testing of alternate redefined true score that is essentially Messick (1989) cited reading ability
hypotheses concerning the validity of biased. on subject-matter tests as an example
an interpretation. Crooks, Kane, and Random error is the difference be- of this kind of CIV. If performance
Cohen (1996) provided a comprehen- tween any observed and correspond- on a test of a construct that is not
sive model for the study of threats to ing true score for each examinee. Both reading comprehension is strongly de-
validity. They identified eight linked classical and generalizability theory pendent on one’s level of reading com-
inferences and argued that a weakness (Brennan, 2001) present methods to prehension, then reading comprehen-
in any link weakens the entire chain. study random error. Random error can sion is construct-irrelevant, because
As we acquire and evaluate validity ev- be large or small, positive or negative. the definition of achievement does
idence, we may conclude that some ev- We never know the size of random error not include reading comprehension. For
idence is weak or negative. By elimi- for any examinee. Reliability is the ratio instance, two students having equal
nating or reducing these threats to of true-score variance and observed- science achievement may differ in their
validity, we can increase our confi- score variance for a set of test scores. test performance because one is a better
dence that a desired test score inter- Random error and the observed scores reader than the other. It is reading com-
pretation or use is more valid. are random variables, whereas the true prehension that differentiates these two
Several writers have proposed ways of score is a constant (Lord & Novick, students, not science achievement. By
organizing and describing threats to va- 1968, p. 35). Random error is uncorre- increasing the reading comprehension
Spring 2004 19
Table 1. A Taxonomy for the Study of Systematic Errors Associated with CIV
~~~ ~ ~~ ~
construct associated with the source and (b) increases in test scores should priate and effective instruction, but to
of CIV (either domain, ability, or both), be correlated with a corresponding in- the fact that some students received
give a subjective appraisal of the ade- crease in student learning (Popham, test preparation and others did not.
quacy of research (abundant, adequate, 1991). A second type of CIV associated with
or needed), and identify the type of CIV There are many aspects to test prepa- test preparation is its extensiveness.
error (individual or group specific). ration, including (a) giving advice to There should be some evidence that all
parents, (b) instructing students based students received uniform test prepara-
Ungomity and Types of on the curriculum represented by the tion. For instance, Nolen, Haladyna, and
Test Preparation test, (c) providing examples of different Haas (1992) reported considerable vari-
As noted previously, AERA (2000) has test item formats, (d) motivating stu- ation in the amount of test preparation
provided a useful set of guidelines dents to do their best, and (e) teaching by teachers in one state. Lomax, West,
regarding high-stakes testing that in- testwiseness-test-takingstrategies that Harmon, Viator, and Madaus (1995) pro-
cludes advice about alignment of con- include efficient time use, error avoid- vided evidence of excessive test prepara-
tent and cognitive processes, instruc- ance, informed guessing, and deductive tion with educationally disadvantaged
tion, and assessment. These guidelines reasoning. students.
also address opportunity to learn and Whether or not students received A third type of CIV involves unethical
the providing of remedial opportuni- test preparation can be a source of CIV. types of test preparation. In an article
ties. After assuring that these guide- If some students in a reportable unit of on preparing for a performance test,
lines have been met, we should also analysis, such as a school or school dis- Mehrens, Popham, and Ryan (1998)
consider the issue of uniform and ethi- trict, have received test preparation offered a set of guidelines that seems ap-
cal test preparation. Most testing spe- and another group of these students plicable to all high-stakes tests. Their
cialists recommend test preparation has not, how does this difference in test first guideline has to do with criterion
(e.g., Nitko, 2001, chapter 14). Two preparation affect the validity of test performance being task- or domain spe-
guiding principles in test preparation score interpretations? Differences in cific. Their second guideline is that if
are (a) no test preparation should vio- performance might not be attributable the criterion performance is domain
late ethical standards of our profession to sound curriculum design and appro- specific, we should not teach to the ex-
Spring 2004 21
scores and by so doing misrepresent the sues related to CBT, such as student need for it” (AERA, APA, & NCME, 1999,
achievement of a class, school, or even proficiency in taking a computerized p. 115). Indeed, an epidemic of scar-
a school district. Thus, differences test, computer platform familiarity, user ing errors has arisen throughout the
in group performance may not be based interface, speededness, and test anxi- United States. For example, in Min-
on actual achievement differences but ety. They also noted the potential of in- nesota, 47,000 students received in-
on who was sampled and excluded. correct estimates of student scores due correct scores, leading to serious nega-
The problem is more serious when to problems with scoring algorithms. tive consequences for these students
one considers the recent policy where Another potential problem with com- and to subsequent lawsuits (Henriques
schools are labeled as failing as called puterized adaptive testing is the heavy & Steinberg, 2001). More than 20 states
for in the new Elementary and Secondary demand on mid-difficult items that pro- have been affected by scoring errors.
Education Act-No Child Left Behind vide maximum information. Since these In Arizona, 12,000 students received
(http://www.ed.gov/nclb/landing.jhtml). items are the most frequentlyused, these incorrect scores due to an error in
A report from the Nation’s Report Card items quicltly become overexposed, the scoring key (Bowman, 2001). In
(NCES, 2002) for NAEP Science shows which is another source of CIV. The po- Washington, 204,000 essays had to be
participation rates by states range tential threats of CIV in the CBT envi- rescored. Scoring errors or delays also
considerably from national averages. ronment have only begun to be explored occurred in California, Florida, Georgia,
A National Assessment of Educational at this time. Indeed, Standard 12.19 of Indiana, Mississippi, New York City,
Progress (NAEP) report showed partici- the Standards (AERA, APA, & NCME, Nevada, North Carolina, South Carolina,
pation rates for students with disabilities 1999) provides specific warning about Tennessee, and Wisconsin. In the Edu-
canvaryby state from 2.6%to 6.7%.Given the dangers of CIVrelated to computer- cation Week on the WebArchives (2004),
that these students tend to be low scor- ized testing. Besides research reports there are 8 listings for scoring error in-
ing, greater fluctuations in participation addressing these problems, technical cidents. In high-stakes testing, espe-
can contribute sizably to CIV (Grissmer, reports on such testing programs offer cially where critical pass-fail decisions
Flanagan, Kawata, & Williamson, 2000). an opportunity to document that CBT are made, we need stronger, more inde-
Large disparities in participation rates does not contribute CIV. pendent assurance of score accuracy
for students with disabilities have also and additional documentation of extra
been observed (Erickson, Ysseldyke, & Calculators in testing. The role of scrutiny in scoring.
Thurlow, 1996). They stated that such calculators in testing has been an ac-
variability in participation rates may be tive research topic in item development Sanitking answer sheets. “Cleaning
due to the need for accountability and and test design (Haladyna, 2004). The up answer sheets” is a practice that is
achieving high test scores. Erickson, plausible hypothesis is that students recommended. For instance the National
et al. concluded: who have calculators have an added ad- Association of Test Directors (2004) pro-
vantage over those without calculators vides specific examples of how answer
Such variability prohibits valid com- in mathematics tests and in other con- sheets should be sanitized: “Erase
parisons between states, and prevents tent that may require calculation. A re- all stray marks, darken light marks,
policy-relevant findings to be drawn
about how students with disabilities cent report in The Nation’s Report Card and clean up incomplete erasures.”
are benefitting from their educational (NCES, 2002) presented results from Volunteer parents may be asked to
experiences. the 2000 NAEP showing an interaction of “clean up” answer sheets before scor-
grade level with calculator usage. More ing. For example, they might make
Without a doubt, there is an urgent need frequent use of calculators was corre- incomplete erasures more thorough,
to ensure through policies and proce- lated with lower scores in grade four, but since double-marked items are scanned
dures that standardization exists in test the opposite was true at grades eight as incorrect. That some schools and dis-
participation and exclusion. Variations and 12. Also, some item types seem tricts may sanitize answer sheets while
in these rates directly contribute to CIV more susceptible to better performance other schools and school districts intro-
when comparisons are made within any by using calculators. Thus, calculator duce potential CIV. The solution to this
unit of analysis. Policies that provide usage seems associated with CIV and validity threat is to have all answer
clear guidelines regarding participation the type of item being offered. The use sheets sanitized as is recommended by
and exclusion coupled with research of calculators would seem to enhance nearly all test scoring services. This
and documentation of uniform practices testing of many types of achievement by threat to validity is primed for studies
would help alleviate this threat to valid providing a higher fidelity experience. that explore the frequency of sanitizing
interpretations of achievement scores At the same time, the use of calculators and its consequences on test scores.
for schools, school districts, and state. must not be permitted to increase CIV.
Thus, research is constantly needed to Testform comparability. The equat-
Computer-based testing. We would address each new application involving ing of test forms is a standard practice
not offer computer-based testing (CBT) calculators or any other technological in testing programs. There are many
to any student if we thought the results innovation. methods for adjusting test scores so that
would be lower than those obtained by one test form is no more difficult or easy
paper-and-pencil administration. There Test Scoring than any other test form. However, it is
is increasing use of CBT, but less fre- Scoringewors. Standard 11.10 reads, possible that errors can occur in equat-
quently do we see documented evidence “Test users should be alert to the pos- ing studies. Although research on equat-
of the equivalence of CBT and paper- sibility of scoring errors; they should ing methods is active and important, we
and-pencil testing. Huff and Sireci arrange for rescoring if individual have few mechanisms and little docu-
(2001) raised several important CIV is- scores or aggregated data suggest the mentation for ensuring that equating is
Spring 2004 23
With so many students being deficient motivational strategies work, then test often co-mingled populations: students
in these verbal abilities, the threat of scores contain CIV, because not all stu- with disabilities, LEP students, stu-
CIV in these challenging performance dents or schools receive uniform motiva- dents living in poverty, and students liv-
tests suggests that research on this tional stimulation from school leaders. ing in cultural isolation.
problem is very much needed. What may be accounting for differences
among schools or school districts might Students with Disabilities or LEP
not be real learning, but more effective Federal guidelines and the new Stand-
Test Anxiety, Motivation, and Fatigue motivational techniques. Although these ards (AERA, APA, & NCME, 1999) give
We know that test anxiety can increase motivational techniques are desirable, considerable attention to the necessity
test performance but more generally these techniques should be uniformly of altering the administration conditions
lowers test performance. In a meta- applied to ensure that motivation does or the test itself to eliminate a disabil-
analysis of 562 studies, the pattern of not become a source of CIV. ity as a source of CIV. As discussed pre-
student performance in relation to test While there is no research to report viously, reading comprehension may be
anxiety is unmistakable and conclusive about fatigue in testing, we hypothe- a serious source of CIV. With LEP stu-
(Hembree, 1988). Test anxiety can be size that young students may be more dents, this type of CIV is likely to occur.
pernicious in three ways. First, it af- susceptible to fatigue in long testing Chapters 9 and 10 of the Standards
flicts many test takers. Test anxiety situations than older students, and the provide considerable discussion and
is estimated to include about 25% of conditions for test administration may offer many standards bearing on what
the general population (Hill & Wigfield, interact with different types of stu- is needed to eliminate CIV when testing
1984). Second, test anxiety can be exac- dents. The effects of fatigue are not well students with disabilities and students
erbated or reduced by imposing certain understood or studied, but should we be with LEP. Policies of excluding these
conditions on the examinees. Hancock concerned with the energy level of stu- students from assessments vary not
(2001) provided experimental evidence dents as they take long, high-stakes onlywithin classrooms and schools, but
in a study with college students that an tests? Or is fatigue not a factor in test also across school districts and states.
evaluative threat can increase test anx- performance? A related area of concern Federal law requires that students with
iety. Zohar (1998) also provided com- is the extent to which students eat be- disabilities be included in assessments,
plex experimental evidence that dispo- fore testing and are allowed breaks and but the law does not explain which ac-
sition to anxiety and the high-stakes snacks during a long testing day. commodations are acceptable or spec-
situation contribute to test anxiety. Although we have a promising emerg- ify the criteria for accommodation. If
Third, test anxiety can have conse- ing science of person-fit analysis (Meijer such accommodations are carried out
quences. For example, Thornton (2001) & Sjitsma, 1995), we do not routinely uniformly in all school districts and
reported that teachers in training in study item response patterns of students states, then differences in performance
Great Britain have been so intimidated to find out if students’ response patterns will not be due to this source of CIV.
by teacher testing that they are drop- suggest anxiety, poor motivation, or Until we have full participation and
ping out of their teacher education pro- fatigue. Some students are plodders more uniformity in the way accommo-
grams and making alternative plans. As who work slowly and correctly but do dations are offered, comparisons of per-
we can see, not only is test anxiety a not finish tests in the allotted time. formance of students with disabilities
powerful source of CIV, it also affects Studies of examinee fit ought to be and LEP among units of analysis such
students and preservice teachers. routine in large-scale, high-stakes as- as classrooms, schools, and states can-
The motivational level of students sessments, and evidence supporting not be considered reasonable or validly
may affect test score performance, no any of these student sources of CIV interpreted.
matter the achievement level of the should invalidate the scores or cause us
student. The manifestation of low mo- to look for reasons for underperfor- Students Living in Cultural Isolation
tivation may be non-compliance with mance other than inadequate learning. The measurement of achievement of
the test-taking protocol. Students may Another aspect of this problem is non- students living in culturally homoge-
seriously underperform, make random response, items omitted or not reached neous, isolated communities can be
marks on the answer sheets, omit an- (Koretz, Lewis, Skewes-Cox, & Burstein, affected in many ways. For instance,
swers, or not finish the test. The fre- 1993). The frequency of omitted and students living on Native American
quency of omitted responses and items not-reached items should signal poten- reservations have a variety of charac-
not reached are signals of low motiva- tial problems with test anxiety, motiva- teristics that work against effective test
tion and non-compliance. Paris, Lawton, tion, fatigue, or timing. Yet, there is performance (Haladyna, 2002b). These
Turner, and Roth (1991) found that surprisingly little research on threats to students, too, need accommodations in
younger students take large-scale tests validity. . testing and, in some circumstances, al-
more seriously than older students. ternative assessments. The same case
Schools and school districts take very might be made for racial or ethnic com-
different approaches to motivating stu- Unique Problems of munities that live in isolation from the
dents to perform on these tests. Tactics Special Populations rest of society. While test scores may
include threats, parties, prizes, awards, Keeping in mind the admonitions of traditionally be low for these groups, the
and pep rallies. Whether the tactic is Messick (1984) that CIV can contami- lack or failure of accommodations and
positive or negative, knowing the ex- nate both interpretations of test scores modifications in assessments might ac-
tensiveness of these practices and the and implications we make from knowl- count for some of this low performance.
degree of the influence of each of these edge of test scores, we confront the The threat of CIV for these populations
motivational tactics is important. If the unique problems associated with four is similar to that of students with disabil-
Spring 2004 25
Chronicle ofHigher Education (2002, Nov- processes: A review of research in the Kane, M. T. (2002). Validating high-stakes
ember 21). Two students arrested for United States. Review of Educational testing programs. Educational Measure-
alleged high tech cheating on the GRE. Research, 65, 145-190. ment: Issues and Practices, 21 (1) 31-41.
~
Retrieved March 17,2004,from http:// Garcia, G. E. (1991). Factors influencing Koretz, D., Lewis, E., Skewes-Cox, T., &
chronicle.com/free/2002/11/2002112102t. the English reading test performance Burstein, L. (1993). Omitted and not-
htm of Spanish-speaking Hispanic children. reached items in mathematics in the
Cizek, G. J. (1999). Cheating on tests: How Reading Research Quarterly,26, 371-391. 1990National Assessment of Educational
to do it, detect it, andprevent it. Mahwah, Grissmer, D. W., Flanagan, A. E., Kawata, Progress (CRE: Technical Report 347).
NJ: Erlbaum. J. H., &Williamson, S. (2000).1mproving Los Angeles, CA: Center for Research
Cole, N. S., & Moss, P. A. (1989). Bias in test student achievement: What state NAEP on Evaluation, Standards, and Student
use. In R. L. Linn (Ed.) Educational mea-
~
test scores tell us. Santa Monica, CA Rand Testing.
surement (3rd ed., pp. 201-220). New Corporation. Li, K. (2003, March 5). Fraudelent TOEFL
York: American Council on Education and Haertel, E., & Calfee, R. (1983). School takers face possible deportation. Daily
Macmillan. achievement: Thinking about what to test. Princetonian. Retrieved on March 17,
Cronba,ch, L. J. (1988). Five perspectives on Journal of Educational Measurement, 2004, from http://www.dailyprinceton-
validity argument. In H. Wainer & H. I. 20(2),119-131. ian.com/archives/2003/03/05/news/7516.
Braun (Eds.), Test vdidity (pp. 3-17). Haladyna, T. M. (2002a). Supporting docu- shtml
Hillsdale, NJ: Erlbaum. mentation: Assuring more valid test score Linacre, J. M., & Wright, B. D. (2004).
Cronbach, L. J., & Meehl, P. E. (1955). interpretations and uses. In G. Tindal & FACETS: Computer program for many-
Construct validity in psychological tests. T. M. Haladyna (Eds.), Large-scale as- faceted Rasch measurenzent. [Computer
Psychological Bulletin, 52, 281-302. sessment for all students: Va,lidity, tech- Software]. Chicago: MESA Press.
Crooks, T. J., Kane, M. T., & Cohen, A. S. nical adequacy, and implementation Linn, R. L. (2002). Validation of the uses
(1996). Threats to valid use of assessment. (pp. 89-108). Mahwah, NJ: Erlbaum. and interpretations of results of state as-
Assessment in Education, 9(3), 265-285. Haladyna, T. M. (2002b). Standardized sessment and accountability systems. In
DeMars, C. E. (1998). Gender differences in achievement testing: Validity and ac- G. Tindal & T. Haladyna (Eds.), Large-
mathematics and science on a high school countability. Boston: Allyn & Bacon. scale assessment programs f o r all stu-
proficiency exam: The role of response for- Haladyna, T. M. (2004). Developing and dents: Development, implementation,
mat.Applied Measurement in Education, validating multiple-choice test items and analysis. Mahwah, NJ: Erlbaum.
11(3), 279-299. (3rd ed.). Mahwah, NJ: Erlbaum. Linn, R. L., Baker, E. L., & Dunbar, S. B.
Downing, S. M. (2002a). Construct-irrele- Haladyna,T. M., Downing, S. M., &Rodriguez, (1991). Complex performance assess-
vant variance and flawed test questions: M. C. (2002). A review of multiple-choice ment: Expectations and validation cri-
Do multiple-choice item writing princi- item-writing guidelines for classroom teria. Educational Researcher, 20(8) ~
Spring 2004 27