Construct Irrelevant Variance in High-Stake Testing - Haladyina.x

Construct4rrelevant Variance
in High-Stakes Testing
Thomas M. Haladyna, Arizona State University West
Steven M. Downing, University of Illinois at Chicago
developing abilities (Messick, 1984),
and learned abilities (Sternberg, 1998).
There are many threats to validity in high-stakes achievement Each cognitive ability involves contex-
testing. One major threat is construct-irrelevant variance (CIV). tualized mental models, schemas, or
This article defines CIV in the context of the contemporary, unitary frames, and complex performance that
view of validity and presents logical arguments, hypotheses, may have multiple correct pathways that
depend on knowledge and skills. These
and documentation f o r a variety of CIV sources that commonly abilities are slow growing. They are
threaten interpretations of test scores. A more thorough study difficult to teach and learn. The item
o f CIV is recommended. format with greatest fidelity to mea-
sure this kind of construct is perfor-
Keywords: construct-irrelevant variance, high-stakes testing, validity mance. Messick (1984) referred to this
measurement approach as construct-
referenced because the performance
test focuses on the ability itself and not
C urrently, achievement testing can
be characterized as driven by con-
tent standards that affect the planning
Part 1: Validity
The most fundamental step in valida-
tion is defining the construct. Cronbach
on the domain of knowledge and skills
that support it. Mislevy (1996) traced
the history of achievement testing and
and delivery of instruction and the de- and Meehl(l955) called this construct forecasted that cognitive psychologywill
sign of student assessments. The align- formulation. Two kinds of achievement lead us in a direction that merges learn-
ment of content standards, instruction, constructs seem represented in state ing theory and test theory. For our pur-
and the tests used to evaluate student and national content standards. poses both kinds of achievement con-
learning are commonly held paradigms The first kind of achievement constructs are susceptible to CIV, but as we
in education. Most states either have struct can be envisioned as a large will see in different ways.
in place or are developing comprehen- domain of knowledge and skills, some- According to the Standards for
sive assessment systems that have these times called declarative and procedural Educational and Fsyc~o~ogic~l Testing
features. knowledge. Any achievement test is in- (Standards) (American Educational
In many states, test scores can have tended to be a representative sample Research Association [ AERA] , Ameri-
several highstakes uses. For instance, from that domain (Haertel & Calfee, can Psychological Association [ APA] , &
students must pass tests to graduate 1983).Although test specifications guide National Council on Measurement in
from high school or to be promoted the design of these tests, the sample of Education [ NCME] , 1999): “Validity
to the next grade. Schools are evalu- this domain is usually small. Each stu-
ated based on test scores and annual dent’s test score is intended to show
progress. Low-performing schools may status in this domain. This type of
be subject to intervention. Teachers and achievement construct can be viewed Thomas M. Haladyna is Professor of
school leaders may be evaluated based as traditional.Tests of domains of knowl- Educational Psychology, College of Teacher
on student test performance, and their edge and skills are consistent with the Education and Leadership, Arizona Slate
employment and pay may be affected by “criterion-referenced movement of the 1Jniversity West, FAB 240 South, 4701 West
this evaluation. 1970s and 1980s. Multiple-choice (MC) Thunderbird Road, PO Box 37100, Phoenix,
This article addresses a serious prob- formats work well to measure this kind AZ 85069-7100;thomas.haladyna@asu.edu.
lem in high-stakes testing: construct- of achievement construct. He specialixes in developing and evaluating
irrelevant variance (CIV). Part 1 reviews A second kind of achievement con- testing programs and in item develop-
the contemporary view of validity, with struct focuses on a cognitive ability, ment and validation.
its emphasis on construct validity. In such as reading, writing, or mathemati- Steven M. Downing isAssociate Professor
cal problem solving. We can conceive of of Medical Education, University of Illinois
Part 2, CIV is defined. In Part 3, a tax- at Chicago, College ofMedicine, Department
onomy is presented that organizes this ability as consisting of a domain of ofMedica1Education (MC591), 808s. Wood
sources of CIV. Evidence is presented complex tasks. The theoretical ratio- Street, Chicago, IL 60612; sdowning@uic.
of CIV’s extensiveness in high-stakes nale for any cognitive ability comes edu. He specializes in test development
testing. Some of these sources of CIV from cognitive psychology. Other terms and psychometric issuesJor achievement
have received more research attention used to signify a cognitive ability in- and credentialing examinations in the
than others. clude Juid abilities (Lohman, 1993) , professions.
Spring 2004 17
refers to the degree to which evidence lidity (e.g., Crooks, et al. 1996; Messick, lated with true and observed scores. The
and theory support the interpretations 1984, 1989). At least five major threats expected value of random error across a
of test scores entailed by proposed uses to validity stand out: construct under- set of test scores is zero.
of tests” (p. 9). Validation is an inves- representation arising form poorly con- Systematic error is not random, but
tigative process bywhich we (a) create ceptualized or inadequately operational- group- or person specific. Construct-
a plausible argument regarding a de- ized constructs, faulty logic of the causal irrelevant easiness refers to a contami-
sired interpretation or use of test scores, inference regarding test scores, negative nating influence on test scores that
(b) collect and organize validity evi- consequences of test score interpreta- tends to systematically increase test
dence bearing on this argument, and tions and uses, lack of reproducibility of scores for a specific examinee or agroup
(c) evaluate the argument and the ev- test scores, and CIV. Although all of of examinees; construct-irrelevant dif-
idence concerning the validity of the these threats deserve attention in vali- ficulty does the opposite. It systemati-
interpretation. Kane (1992) described dation research, this article concen- cally decreases test scores for a specific
the process €or establishing a plausible trates on CIV. examinee or a group of examinees. Lord
argument and criteria we might use to and Novick (1968, p. 43) discussed sys-
evaluate this argument. Kane (2002) tematic error as an undesirable change
also described some nuances of de- Part 2: ConstrucMrrelevant in true score. The change is caused by a
scriptive and policy-based interpreta- Variance variable that is unrelated to the con-
tions in high-stakes student achieve- The major point here is that educa- struct being measured. Thus, the change
ment testing. He argued that current tional achievement tests, at best, in test score is construct irrelevant.
validation does not go far enough to reflect not only the psychological con- Although random error is variable from
justify the full array of interpretations structs of knowledge and skills that examinee to examinee, systematic error
and uses. are intended to be measured, but in- is not. It is predictable. It also manifests
As the stakes for testing increase, the variably a number of contaminants. in two types.
need for validity evidence also increases These adulterating influences include The first type of systematic error is
(Linn, 2002). The quest for validity evi- a variety of other psychological and constant error for all members of a pa;.
dence can be very complex. This evi- situational factors that technically
constitute either construct-irrelevant ticular group. This kind of error is char-
dence will likely consist of documenta- test difficulty or construct-irrelevant acterized by members of a specific group
tion of procedures in test developnient, contamination in score interpreta- having systematic over- or underestima-
administration, scoring and report- tion. (Messick, 1984,p. 216) tion of their true scores. A good example
ing, and empirical studies (Downing & of this type of constant error is rater
HaIadyna, 1996; Haladyna, 2002a). The CIV is error variance that arises from severity in a performance test of a cogni-
Standards (AERA, APA, & NCME, 1999) systematic error. A good way to think tive ability. Two raters are consis-
present categories of validity evidence about systematic error is to compare it tently harsh. Student papers evaluated
that include content, cognitive pro- with random error. If we were to write by these two raters will likely result in
cesses, internal structure of item re- the linear model representing what we systematic error that lowers their test
sponses or ratings of performance, know about random and systematic er- scores. The group being scored by these
reliability, relations of test scores to rors, the model would be: two harsh raters gets lower scores than
other variables, and consequences. Es- they deserve. Test form difficulty is
says by Messick (1995a, 1995b) also pro- y = t + e, + e,, another example of group-specific CIV.
vide suggestions for types of validity ev- Those taking the more difficult test form
idence and their importance. where y is the observed test score for will have underestimated scores unless
Although we may view validation as a any student, t is the true score, err test score equating is carried out.
process for strengthening an argument is random error, and e, is systematic The second type of systematic error is
about the validity of a particular inter- error due to CIV. Lord and Novick the over- or underestimation of individ-
pretation or test score use, Cronbach (1968, pp. 43-44) developed this idea ual examinee scores due to a CIV source
(1988) noted that validation should and presented systematic error as a that affects examinees differentially.
also include the testing of alternate redefined true score that is essentially Messick (1989) cited reading ability
hypotheses concerning the validity of biased. on subject-matter tests as an example
an interpretation. Crooks, Kane, and Random error is the difference be- of this kind of CIV. If performance
Cohen (1996) provided a comprehen- tween any observed and correspond- on a test of a construct that is not
sive model for the study of threats to ing true score for each examinee. Both reading comprehension is strongly de-
validity. They identified eight linked classical and generalizability theory pendent on one’s level of reading com-
inferences and argued that a weakness (Brennan, 2001) present methods to prehension, then reading comprehen-
in any link weakens the entire chain. study random error. Random error can sion is construct-irrelevant, because
As we acquire and evaluate validity ev- be large or small, positive or negative. the definition of achievement does
idence, we may conclude that some ev- We never know the size of random error not include reading comprehension. For
idence is weak or negative. By elimi- for any examinee. Reliability is the ratio instance, two students having equal
nating or reducing these threats to of true-score variance and observed- science achievement may differ in their
validity, we can increase our confi- score variance for a set of test scores. test performance because one is a better
dence that a desired test score inter- Random error and the observed scores reader than the other. It is reading com-
pretation or use is more valid. are random variables, whereas the true prehension that differentiates these two
Several writers have proposed ways of score is a constant (Lord & Novick, students, not science achievement. By
organizing and describing threats to va- 1968, p. 35). Random error is uncorre- increasing the reading comprehension
18 Educational Measurement: Issues and Practice

demand on this science test, reading some sources of CIV arising from MC tested and before we make decisions
comprehension potentially contami- or performance tests. These include based on these test scores.
nates the measure of science achieve- ConQuest (Adams, Wu, &Wilson, 1998), Cole and Moss (1989) described a du-
ment and its subsequent interpretation Facets (Linacre & Wright, 2004), alitywith the term bias. The first conno-
or use. By lowering the reading demand Parscale (Muraki & Bock, 2003), and tation of bias involves social justice and
on this test, CIV’s threat due to reading Rumm (Andrich, Lyne, Sheridan, &, equal treatment of students, such as is
comprehension is lessened. Other exam- Luo, 1997). Generalizability theory embodied in t,he idea of opportunity to
ples of person-specific CIV include moti- (Brennan, 2001) offers a useful way to learn. The second connotation is a more
vation to perform on a test, test anxiety, study the precision of test scores and technical issue that relates to the test as
and fatigue. TheStandards (AERA, APA, sources of random error, but it also pro- a cause of differences between groups
& NCME, 1999) also give some examples vides a basis for studying CIV as well. believed to be equal in achievement.
of potential sources of CIV, such as read- However, not all sources of CIV are so This second meaning seems to resem-
ing comprehension, item formats, anxi- easily identifiable, and adjustment to ble what we think of as CIV. In fact,
ety, and test administration conditions. reduce or eliminate its influence will be the Standards (AERA, APA, & NCME,
Both Messick (1989) and Lord and a considerable challenge to measure- 1999, p. 76) stated, “The term bias in
Novick (1968) offer examples. This arti- ment specialists. tests and testing refers to construct-
cle extends and expands these discus- irrelevant components that result in
sions of sources of CIV. systematically higher or lower scores
Related Ideas in the for identifiable groups of examinees.”
Contrasts and Comparisons of Measurement Literature Neither the Standards nor Cole and
Random and Systematic Error Messick (1989) devoted only several Moss gave us detailed, specific infor-
Some contrasts and comparisons be- paragraphs to CIVin his influential essay mation about categories of CIV.
tween random and systematic error may on validity. Lord and Novick (1968) rec-
further clarify this definition of CIV. ognized this problem and incorporated it
While random error is uncorrelated to into their discussion of true scores. The Part 3: A Proposed Taxonomy for
true and observed scores, systematic Standards (AERA, APA, & NCME, 1999, Studying CIV and Some Evidence
error is correlated to both true and ob- p. 10) identifies only six standards bear- We begin this section by providing a
served scores. This is true because both ing on CIV, and provides a brief discus- simple taxonomy for classifying vari-
individuals and groups are either af- sion. The six standards and their points ables that produce CIV and then pro-
fected or unaffected by CIV. The un- of discussion are: vide logical arguments, hypotheses, and
affected individuals or groups have Standard 7.1-subgroup differ- empirical evidence for each source.
observed scores that are closer to true ences The extensiveness of CIV weakens the
scores. Although the expected value of Standard 7.2-the validity of in- validity of interpretations and uses of
random error is zero, if CIV is present, ferences when CIV is present test scores in high-stakes accountabil-
then the expected value of systematic Standard 7.3-differential item ity systems. Our purpose is to present
error is a non-zero value. Systematic functioning, one source of CIV some documentation of sources of CIV
error is quantifiable. The larger this error Standard 7.10-the importance of as a serious threat to validity and call
variance, the more serious the threat studying systematic error among groups for needed research or point out where
to validity. Lord and Novick (1968) con- where no differences are believed to research is being conducted.
tended that if the estimation of system- exist The documentation has many ori-
atic error is impossible, then test scores Standard 12.19-the interpreta- gins. While we rely on scholarly reports
may not be validly interpreted and, tion of test results when CIV may be as a primary source, secondary sources
thus, be of no use to the test sponsor. present include newspaper articles, and other
Standard 13.18-concerns of CIV non-scholarly periodicals. One of the
Estimating CIV and Adjusting Test in computerized testing. most productive secondary sources is
Scores to Eliminate CIV These standards are not very specific as the World Wide Web. Another useful
The estimation of systematic error can to the many sources of CIV identified in secondary source is an in-house research
be accomplished by using the general this article. Considering the seriousness report, such as those from the extensive
linear model. The test score serves of the CIV threat to validity for high- collection of the Educational Testing
as the dependent variable, and a CIV stakes test score interpretations and Service. States and testing organiza-
source serves as the independent vari- uses, it is important to expand our un- tions often provide such information in
able. The percentage of accounted vari- derstanding of CIV. Chapter 7 of the their archives and on Web pages. We
ance provides a measure of effect size Standards (AERA, APA, & NCME, 1999) believe the mix of evidence for each
for the source of CIV. With individual- addresses fairness in testing. As pointed source of CIV lends credibility to the
specific error, the product-moment out in that chapter, there is no common claims made about the extensiveness of
correlation provides this index. With meaning or technical definition for fair- CIV in high-stakes achievement testing.
group-specific error, R2 provides this ness. Fairness has four connotations: Occasionally,we suggest viable research
index. Analysis of variance also provides (a) all tests should be free from bias, hypotheses to explore a particular source
a basis for computing the effect size as (b) all examinees are treated equitably, of CIV and indicate what research needs
well as the appropriate statistical test (c) the test scores should be equal for to be done.
of significance. groups thought to be equal, and (d) stu- Table 1presents the taxonomy in four
There are many computer programs dents being tested have equal and fair broad areas. In the three columns to the
for the adjustment and elimination of opportunities to learn before being right, we name the type of achievement
Spring 2004 19
Table 1. A Taxonomy for the Study of Systematic Errors Associated with CIV
~~~ ~ ~~ ~
Category Instances Construct Need for Research' TY Pe

Uniformity and 1 . Whether or not students Both Adequate Group
types of test get test preparation
preparation 2. The extensiveness of Both Adequate Group
test preparation
3. Unethical test preparation Domain Needed Group
Test development, Test Development
administration, 1. Item quality Domain Needed Group
and scoring 2. Test item format Both Adequate Group
3. Differential item functioning Both Abundant Group
Test A dmin istration
1. Location of test site Both Needed Group
2. Altering the administration Both Needed Group
3. Participation and exclusion Both Needed Group
4. Computer-based testing Domain Needed Group
5. Calculators in testing Domain Needed Group
Test Scoring
1. Scoring errors Domain I nadequate Group
2. Sanitizing answer sheet Domain Inadequate Group
3 . Test form Comparability Domain Inadequate Group
4. Rater severity and prompt choice Ability Adequate Group
5. Accuracy of passing scores Both Needed Group
Students 1 . The influence of verbal
abilities on test performance Both Needed Ind.
2. Test anxiety, motivation,
and fatigue Both Needed Ind.
3 . Accommodations for special
student populations Both Needed Ind.
Cheating 1. Institutional Both Needed Group
2. Individual Both Needed Group
'Rating Scale: Abundant research exists; research on this topic i s adequate; more research is needed. Ind. = individual.
construct associated with the source and (b) increases in test scores should priate and effective instruction, but to
of CIV (either domain, ability, or both), be correlated with a corresponding in- the fact that some students received
give a subjective appraisal of the ade- crease in student learning (Popham, test preparation and others did not.
quacy of research (abundant, adequate, 1991). A second type of CIV associated with
or needed), and identify the type of CIV There are many aspects to test prepa- test preparation is its extensiveness.
error (individual or group specific). ration, including (a) giving advice to There should be some evidence that all
parents, (b) instructing students based students received uniform test prepara-
Ungomity and Types of on the curriculum represented by the tion. For instance, Nolen, Haladyna, and
Test Preparation test, (c) providing examples of different Haas (1992) reported considerable vari-
As noted previously, AERA (2000) has test item formats, (d) motivating stu- ation in the amount of test preparation
provided a useful set of guidelines dents to do their best, and (e) teaching by teachers in one state. Lomax, West,
regarding high-stakes testing that in- testwiseness-test-takingstrategies that Harmon, Viator, and Madaus (1995) pro-
cludes advice about alignment of con- include efficient time use, error avoid- vided evidence of excessive test prepara-
tent and cognitive processes, instruc- ance, informed guessing, and deductive tion with educationally disadvantaged
tion, and assessment. These guidelines reasoning. students.
also address opportunity to learn and Whether or not students received A third type of CIV involves unethical
the providing of remedial opportuni- test preparation can be a source of CIV. types of test preparation. In an article
ties. After assuring that these guide- If some students in a reportable unit of on preparing for a performance test,
lines have been met, we should also analysis, such as a school or school dis- Mehrens, Popham, and Ryan (1998)
consider the issue of uniform and ethi- trict, have received test preparation offered a set of guidelines that seems ap-
cal test preparation. Most testing spe- and another group of these students plicable to all high-stakes tests. Their
cialists recommend test preparation has not, how does this difference in test first guideline has to do with criterion
(e.g., Nitko, 2001, chapter 14). Two preparation affect the validity of test performance being task- or domain spe-
guiding principles in test preparation score interpretations? Differences in cific. Their second guideline is that if
are (a) no test preparation should vio- performance might not be attributable the criterion performance is domain
late ethical standards of our profession to sound curriculum design and appro- specific, we should not teach to the ex-

tent that the inference from the test and test characteristics? A study by tions for new methodology for the study
score to the domain of knowledge or Downing (2002a) showed that flawed of DIF. Thus, DIF as a source of CIV
skills or a cognitive ability is inaccurate. MC test items were about 7 percentage applies to both kinds of achievement
By focusing test preparation on a subset points more difficult than non-flawed constructs, domain-based and abilities.
of the domain that happens to include items measuring the same content. Research on DIF seems abundant, and
items appearing on the test, the test Lower achieving students had greater the analysis of test items for DIF seems
scores would be higher than deserved. difficulty with these flawed items than to be standard practice in high-quality
If, however, the achievement construct did higher achieving students. Although testing programs.
is an ability, such as writing, it would be this problem may be more common with
misleading to the consumers of test teacher-made tests, it points to item Test Administration
score information to teach writing in quality as a potential CIV threat to valid- Location of the testing site. Adverse
only the one genre that is assessed in a ity. Most professionally developed, large- testing conditions may be a source of
high-stakes test. These examples are in- scale tests are systematically reviewed CIV. The location in which a test is
stances of construct-irrelevant easiness, and edited to remove such CIV-inducing given is not necessarily standardized.
and may also be related to construct items, but locally developed tests may The classroom may be the most natural
under representation. Specifically, un- have a greater tendency to exhibit this environment, but when students are
ethical test preparation might include source of CIV (Downing, 2002b). displaced from their classrooms to take
(a) developing a curriculum based on a large-scale test, are the results the
test content instead of following the Test itemformat. Although research same as those from comparable students
state’s content standards, (b) present- suggests that item format may not intro- taking the test in their own classroom?
ing items that are similar or identical in duce CIV for boys and girls, the threat is Is there an age-location interaction?
content and format to those on the test, omnipresent that some formats tend Younger students may perform better in
or (c) using published instructional to advantage some groups of students their own classroom, but as they ma-
materials that focus on exactlywhat the while other formats do not. For example, ture, they might be relocated for testing
specific test measures or instructional Beller and Gafni (2000) found a gender- without affecting their performance.
practices directly aimed at the specific by-format interaction in 2 different Research studies are much needed to
content of the test, pejoratively called years of the International Assessment of evaluate the effect of specific testing en-
narrowing the curriculum. All these Educational Progress that essentially re- vironments on test performance.
practices can increase test scores with- versed itself. Upon closer examination of
out materially affecting the broader do- this interaction, they found that difficult Altering tlze administration. We can
main that the test samples. Studies at- performance items were more difficult extend the administration time, hop-
test to different problems associated for girls than for boys. Thus, under some ing that this extended time will help
with test preparation (e.g., Herman & circumstances, performance differed as students who need more time. Scores
Golan, 1993; Mehrens & Kaminski, a function of both the item format and might be higher than comparable groups
1989; Nolen et al., 1992). Mehrens and item difficulty. DeMars (1998) studied where administration time was not ex-
Kaminski voiced the concern that if test the consequences of performance under tended. To what extent do such non-
preparation is unethical, then the public varying item format conditions by gen- standardized time extensions occur?
is led to believe that students achieved der and ethnicity. Gender-by-format What is the effect of this alteration on
more than they really did achieve. interactions were reported. Rodriguez test scores? Nolen et al. (1992) reported
Although test preparation should be (2002) reviewed research on format dif- that 8% of elementary and 3% of sec-
a standardized practice in all class- ferences and found, typically, that format ondary teachers altered the administra-
rooms, thevariations in test preparation is not construct irrelevant. However, tion time. Wodtke, Harper, Schommer,
and the use of unethical test prepara- Martinez (1999) provided evidence that and Brunelli (1989) concluded after
tion constitute sources of CIV. We need occasionally item format can be con- their observation of 10 kindergarten
more research on the unethical types of struct irrelevant. Test item format con- classrooms that standardized tests were
test preparation and better documenta- tinues to be a fruitful area of study of CIV. administered in such a non-standardized
tion in all student achievement testing way as to render them incomparable.
programs of the role of test preparation. Duerential item functioning. An They also stated that non-standardized
All students should receive ethical test item shows differential item functioning administration may result in confusion,
prcparation and the extensiveness of (DIF) if the probability of a correct re- anxiety, behavioral resistance, nega-
this ethical test preparation should be sponse depends on group membership tive attitudes, and other problems that
uniform. and the groups are assumed to be equal caused these researchers to wonder why
with respect to the achievement con- we even bother to test. Research on test
Test Development,Administration, struct that the test measures. Standard administration, more frequent monitor-
and Test Scoring 7.3 specifically provides for research on ing, and independent supervision of test-
Ztem quality. While the principles DIF and the taking of appropriate ac- ing are needed to ensure that altered
of writing effective MC test items are tion to eliminate DIF. Methods for the administration does not contribute CIV.
documented in textbooks and research study of DIF are numerous. Holland
(Haladyna, Downing, & Rodriguez, and Wainer (1993) provided a thorough Participation and exclusion. The
2002), the writing of performance test treatment of this source of CIV for MC percentage of participation in a school,
items has little scientific or research tests. Penfield and Lam (2000) reviewed school district, or state can contribute
basis. What effect, if any, does the use methods of study of DIF with perfor- CIV. By excluding a group of low-scoring
of poorly crafted items have on item mance formats and made recommenda- students, we can raise average test
Spring 2004 21
scores and by so doing misrepresent the sues related to CBT, such as student need for it” (AERA, APA, & NCME, 1999,
achievement of a class, school, or even proficiency in taking a computerized p. 115). Indeed, an epidemic of scar-
a school district. Thus, differences test, computer platform familiarity, user ing errors has arisen throughout the
in group performance may not be based interface, speededness, and test anxi- United States. For example, in Min-
on actual achievement differences but ety. They also noted the potential of in- nesota, 47,000 students received in-
on who was sampled and excluded. correct estimates of student scores due correct scores, leading to serious nega-
The problem is more serious when to problems with scoring algorithms. tive consequences for these students
one considers the recent policy where Another potential problem with com- and to subsequent lawsuits (Henriques
schools are labeled as failing as called puterized adaptive testing is the heavy & Steinberg, 2001). More than 20 states
for in the new Elementary and Secondary demand on mid-difficult items that pro- have been affected by scoring errors.
Education Act-No Child Left Behind vide maximum information. Since these In Arizona, 12,000 students received
(http://www.ed.gov/nclb/landing.jhtml). items are the most frequentlyused, these incorrect scores due to an error in
A report from the Nation’s Report Card items quicltly become overexposed, the scoring key (Bowman, 2001). In
(NCES, 2002) for NAEP Science shows which is another source of CIV. The po- Washington, 204,000 essays had to be
participation rates by states range tential threats of CIV in the CBT envi- rescored. Scoring errors or delays also
considerably from national averages. ronment have only begun to be explored occurred in California, Florida, Georgia,
A National Assessment of Educational at this time. Indeed, Standard 12.19 of Indiana, Mississippi, New York City,
Progress (NAEP) report showed partici- the Standards (AERA, APA, & NCME, Nevada, North Carolina, South Carolina,
pation rates for students with disabilities 1999) provides specific warning about Tennessee, and Wisconsin. In the Edu-
canvaryby state from 2.6%to 6.7%.Given the dangers of CIVrelated to computer- cation Week on the WebArchives (2004),
that these students tend to be low scor- ized testing. Besides research reports there are 8 listings for scoring error in-
ing, greater fluctuations in participation addressing these problems, technical cidents. In high-stakes testing, espe-
can contribute sizably to CIV (Grissmer, reports on such testing programs offer cially where critical pass-fail decisions
Flanagan, Kawata, & Williamson, 2000). an opportunity to document that CBT are made, we need stronger, more inde-
Large disparities in participation rates does not contribute CIV. pendent assurance of score accuracy
for students with disabilities have also and additional documentation of extra
been observed (Erickson, Ysseldyke, & Calculators in testing. The role of scrutiny in scoring.
Thurlow, 1996). They stated that such calculators in testing has been an ac-
variability in participation rates may be tive research topic in item development Sanitking answer sheets. “Cleaning
due to the need for accountability and and test design (Haladyna, 2004). The up answer sheets” is a practice that is
achieving high test scores. Erickson, plausible hypothesis is that students recommended. For instance the National
et al. concluded: who have calculators have an added ad- Association of Test Directors (2004) pro-
vantage over those without calculators vides specific examples of how answer
Such variability prohibits valid com- in mathematics tests and in other con- sheets should be sanitized: “Erase
parisons between states, and prevents tent that may require calculation. A re- all stray marks, darken light marks,
policy-relevant findings to be drawn
about how students with disabilities cent report in The Nation’s Report Card and clean up incomplete erasures.”
are benefitting from their educational (NCES, 2002) presented results from Volunteer parents may be asked to
experiences. the 2000 NAEP showing an interaction of “clean up” answer sheets before scor-
grade level with calculator usage. More ing. For example, they might make
Without a doubt, there is an urgent need frequent use of calculators was corre- incomplete erasures more thorough,
to ensure through policies and proce- lated with lower scores in grade four, but since double-marked items are scanned
dures that standardization exists in test the opposite was true at grades eight as incorrect. That some schools and dis-
participation and exclusion. Variations and 12. Also, some item types seem tricts may sanitize answer sheets while
in these rates directly contribute to CIV more susceptible to better performance other schools and school districts intro-
when comparisons are made within any by using calculators. Thus, calculator duce potential CIV. The solution to this
unit of analysis. Policies that provide usage seems associated with CIV and validity threat is to have all answer
clear guidelines regarding participation the type of item being offered. The use sheets sanitized as is recommended by
and exclusion coupled with research of calculators would seem to enhance nearly all test scoring services. This
and documentation of uniform practices testing of many types of achievement by threat to validity is primed for studies
would help alleviate this threat to valid providing a higher fidelity experience. that explore the frequency of sanitizing
interpretations of achievement scores At the same time, the use of calculators and its consequences on test scores.
for schools, school districts, and state. must not be permitted to increase CIV.
Thus, research is constantly needed to Testform comparability. The equat-
Computer-based testing. We would address each new application involving ing of test forms is a standard practice
not offer computer-based testing (CBT) calculators or any other technological in testing programs. There are many
to any student if we thought the results innovation. methods for adjusting test scores so that
would be lower than those obtained by one test form is no more difficult or easy
paper-and-pencil administration. There Test Scoring than any other test form. However, it is
is increasing use of CBT, but less fre- Scoringewors. Standard 11.10 reads, possible that errors can occur in equat-
quently do we see documented evidence “Test users should be alert to the pos- ing studies. Although research on equat-
of the equivalence of CBT and paper- sibility of scoring errors; they should ing methods is active and important, we
and-pencil testing. Huff and Sireci arrange for rescoring if individual have few mechanisms and little docu-
(2001) raised several important CIV is- scores or aggregated data suggest the mentation for ensuring that equating is

done accurately. Without adequate qual- the population of content experts from 2. A construct where verbal abilities
ity control and technical reports that which a committee was convened. are very important in the perfor-
thoroughly document these procedures, Another dimension to this problem mance test. A good example of this
such errors can go undetected and in- arises when a jurisdiction applies stan- would be an advanced placement
troduce CIV. dards in non-uniform ways. Stakes vary test that requires reading a pas-
A related problem is test-score drift. in the way they identify and label “failing sage, exercising critical thinking,
From the time of first administration of schools.” While Michigan scores above and writing ability.
a new test, test scores tend to increase the national average on the NAEP, it re- 3. A construct where verbal abilities
for comparable groups as a function of ported the most failing schools (40%), are used but are not deemed cru-
test age. It would be easy and presump- while states like Arkansas and Wyoming cial in the performance task. This
tuous to conclude that these increases reported no failing schools. Thus, we construct might involve a domain
are due to improved learning. However, have a clear indication that CIV is of knowledge and skills with a
this growth should be validated by com- present in standard setting. Rothstein low cognitive demand. Minimal
parison to growth on another standard- (September 8, 2002) reported that the reading and writing skills may be
ized achievement test, such as the NAEP. different standards employed in states needed to perform adequately on
In a study by Linn, Graue, and Sanders to designate schools as failing has re- a test of this kind of construct.
(1990), comparisons of publishers’ stan- sulted in some states engaging in more 4. A construct where there is lit-
dardized achievement test results with busing so that students can change tle demand for reading and writ-
NAEP test results suggested that test schools. ing abilities. This construct might
scores may depend on non-achievement These differences in passing stan- involve computation, symbolic
factors, such as item exposure. Test- dards and standard-setting outcomes representation associated with
score drift may be related to unethical cause us to question the extent to chemistryor physics, or one of the
test preparation previously discussed or which the method of standard setting performing arts.
to other methods used to increase scores may contribute CIV. A confounding To what extent do deficits in read-
for some examinees in a construct- factor is the type of achievement test ing interfere with performance on an
irrelevant way. The problem of drift im- used in a state and its connectedness achievement test? Research involving
plies that we have continuing problems to instruction. Based on the disparities students with limited English proficiency
with the accuracy of equating. among states on how failing schools are (LEP) by Abedi, Lord, Hofstetter, and
identified and labeled, one is inclined to Baker (2000) showed that vocabulary
Rater severity and prompt choice. conclude that standards are indeedvery has a powerful influence on test perfor-
The threat of CIV when equating per- arbitrary. Research is sorely needed into mance. They experimented with NAEP
formance tests principally comes from the validity of standards. Part of this va- test items, using simplified vocabulary
rater effects, particularly rater severity, lidityresearch should examine the con- to improve students’ test performance
and the issue of prompt choice in writing sistency of judges who set standards by a.llowing students to use a glossary
assessments. Linn, Baker, and Dunbar and potential bias introduced by panels and permitting students extra testing
(1991, p. 8) stated, “The training and of judges who may differ from the pop- time. Fitzgerald (1995) characterized
calibration of raters is critical in this ulation of potential judges. Consistency students with LEP as slow readers,
regard.” The practice of letting students of ratings is another issue within this whose test performance is obviously im-
choose writing prompts assumes that need for research on standard setting. paired with time limits. Garcia (1991)
prompt, choice does not contribute to showed that LEP students may have
CIV. Testing specialists have argued that trouble with the familiarity of main-
prompt choice indeed introduces CIV Students stream American topics, contributing
(Linn, Betebenner, & Wheeler, 1998; Compared with others sources of CIV, to their lower performance. Thus, we
Wainer & Thissen, 1994). Fortunately, students potentially provide the most se- have increasing evidence that reading
we have an extensive literature on rater rious CIV threat to validity. In these in- comprehension plays a vital role in test
severity and a growing literature on the stances, students comprise individual- performance. Students with LEP are
problem of prompt choice (Engelhard, specific CIV versus the group-specific probably most affected by deficits in
2002). However, this research litera- CIV. reading comprehension. The measure-
ture has not yet affected practice of ment of achievement in other subject
scoring performance tests and remov- matter may be hopelessly contaminated
ing CIV due to rater severity, other The Influence of Verbal Abilities by their deficiency in reading compre-
rater effects, and prompt choice. on Test Performance hension. However, this problem is not
Another important source of CIV may be limited to LEP students. Non-LEP stu-
Accuracy ofpassing scores. The es- the demand for verbal abilities needed in dents can also have low reading compre-
tablishment of a passing score for a pass- the measurement of another ability. By hension. The treatment of this threat to
fail decision or the setting of multiple verbal abilities, we mean reading, writ- validity should include a careful con-
points on a test score scale for setting ing, speaking, and listening. Ryan and sideration of the definition of the con-
benchmarks, (as in the NAEP) is usually Demark (2002) hypothesized four types struct and the role that verbal ability
a judgmental process involving subject- of achievement constructs: plays in this definition. For the mea-
matter experts (SME). The threat of CIV 1. A construct, such as writing, surement of some abilities, reading and
is that the SME group, which recom- where the highest fidelity mea- writing at a high level are necessary,
mends a passing score or set of bench- surement technique is a perfor- and for other abilities, the demand for
marks, may not be representative of mance test. these verbal abilities is less important.
Spring 2004 23
With so many students being deficient motivational strategies work, then test often co-mingled populations: students
in these verbal abilities, the threat of scores contain CIV, because not all stu- with disabilities, LEP students, stu-
CIV in these challenging performance dents or schools receive uniform motiva- dents living in poverty, and students liv-
tests suggests that research on this tional stimulation from school leaders. ing in cultural isolation.
problem is very much needed. What may be accounting for differences
among schools or school districts might Students with Disabilities or LEP
not be real learning, but more effective Federal guidelines and the new Stand-
Test Anxiety, Motivation, and Fatigue motivational techniques. Although these ards (AERA, APA, & NCME, 1999) give
We know that test anxiety can increase motivational techniques are desirable, considerable attention to the necessity
test performance but more generally these techniques should be uniformly of altering the administration conditions
lowers test performance. In a meta- applied to ensure that motivation does or the test itself to eliminate a disabil-
analysis of 562 studies, the pattern of not become a source of CIV. ity as a source of CIV. As discussed pre-
student performance in relation to test While there is no research to report viously, reading comprehension may be
anxiety is unmistakable and conclusive about fatigue in testing, we hypothe- a serious source of CIV. With LEP stu-
(Hembree, 1988). Test anxiety can be size that young students may be more dents, this type of CIV is likely to occur.
pernicious in three ways. First, it af- susceptible to fatigue in long testing Chapters 9 and 10 of the Standards
flicts many test takers. Test anxiety situations than older students, and the provide considerable discussion and
is estimated to include about 25% of conditions for test administration may offer many standards bearing on what
the general population (Hill & Wigfield, interact with different types of stu- is needed to eliminate CIV when testing
1984). Second, test anxiety can be exac- dents. The effects of fatigue are not well students with disabilities and students
erbated or reduced by imposing certain understood or studied, but should we be with LEP. Policies of excluding these
conditions on the examinees. Hancock concerned with the energy level of stu- students from assessments vary not
(2001) provided experimental evidence dents as they take long, high-stakes onlywithin classrooms and schools, but
in a study with college students that an tests? Or is fatigue not a factor in test also across school districts and states.
evaluative threat can increase test anx- performance? A related area of concern Federal law requires that students with
iety. Zohar (1998) also provided com- is the extent to which students eat be- disabilities be included in assessments,
plex experimental evidence that dispo- fore testing and are allowed breaks and but the law does not explain which ac-
sition to anxiety and the high-stakes snacks during a long testing day. commodations are acceptable or spec-
situation contribute to test anxiety. Although we have a promising emerg- ify the criteria for accommodation. If
Third, test anxiety can have conse- ing science of person-fit analysis (Meijer such accommodations are carried out
quences. For example, Thornton (2001) & Sjitsma, 1995), we do not routinely uniformly in all school districts and
reported that teachers in training in study item response patterns of students states, then differences in performance
Great Britain have been so intimidated to find out if students’ response patterns will not be due to this source of CIV.
by teacher testing that they are drop- suggest anxiety, poor motivation, or Until we have full participation and
ping out of their teacher education pro- fatigue. Some students are plodders more uniformity in the way accommo-
grams and making alternative plans. As who work slowly and correctly but do dations are offered, comparisons of per-
we can see, not only is test anxiety a not finish tests in the allotted time. formance of students with disabilities
powerful source of CIV, it also affects Studies of examinee fit ought to be and LEP among units of analysis such
students and preservice teachers. routine in large-scale, high-stakes as- as classrooms, schools, and states can-
The motivational level of students sessments, and evidence supporting not be considered reasonable or validly
may affect test score performance, no any of these student sources of CIV interpreted.
matter the achievement level of the should invalidate the scores or cause us
student. The manifestation of low mo- to look for reasons for underperfor- Students Living in Cultural Isolation
tivation may be non-compliance with mance other than inadequate learning. The measurement of achievement of
the test-taking protocol. Students may Another aspect of this problem is non- students living in culturally homoge-
seriously underperform, make random response, items omitted or not reached neous, isolated communities can be
marks on the answer sheets, omit an- (Koretz, Lewis, Skewes-Cox, & Burstein, affected in many ways. For instance,
swers, or not finish the test. The fre- 1993). The frequency of omitted and students living on Native American
quency of omitted responses and items not-reached items should signal poten- reservations have a variety of charac-
not reached are signals of low motiva- tial problems with test anxiety, motiva- teristics that work against effective test
tion and non-compliance. Paris, Lawton, tion, fatigue, or timing. Yet, there is performance (Haladyna, 2002b). These
Turner, and Roth (1991) found that surprisingly little research on threats to students, too, need accommodations in
younger students take large-scale tests validity. . testing and, in some circumstances, al-
more seriously than older students. ternative assessments. The same case
Schools and school districts take very might be made for racial or ethnic com-
different approaches to motivating stu- Unique Problems of munities that live in isolation from the
dents to perform on these tests. Tactics Special Populations rest of society. While test scores may
include threats, parties, prizes, awards, Keeping in mind the admonitions of traditionally be low for these groups, the
and pep rallies. Whether the tactic is Messick (1984) that CIV can contami- lack or failure of accommodations and
positive or negative, knowing the ex- nate both interpretations of test scores modifications in assessments might ac-
tensiveness of these practices and the and implications we make from knowl- count for some of this low performance.
degree of the influence of each of these edge of test scores, we confront the The threat of CIV for these populations
motivational tactics is important. If the unique problems associated with four is similar to that of students with disabil-

ities and with LEP. Research on the mo- ples. The Educational Testing Service CIV threat, and then eliminate or re-
tivation and preparedness for achieve- (ETS) conducted an investigation of duce it. A primary type of documenta-
ment tests for these special population student cheating on one of its tests that tion is the annual technical report. This
students can reveal much about how led to the arrest of 61 students who were report should contain references to va-
they learn and why their performance is accused of fraudulent test taking (Li, lidity evidence and its relationship to
lower than expected. Such studies can 2002). This type of cheating involved hir- the appropriate standards (AERA, APA,
benefit from methods where students ing professional test takers as substi- & NCME, 1999) and detail specific CIV
are actually interviewed after the test tutes. There have been recent incidents threats and their management.
or “think-aloud” during a test-taking of electronic devices used to copy test Sponsors of state and local high-
session. items or pirate test items. In China, stakes achievement testing programs
64 students had their scores invalidated and test companies that develop these
for copying from other sources in a per- tests should work together to consider
Cheating formance test for the Graduate Record the seriousness of all threats to validity,
Any deception committed to misrepre- Examination (Taipei Times, 2002), Two including CIV. A shared responsibility
sent a group’s or a student’s level of students from Columbia Universitywere and honest examination of these CIV
achievement is cheating. Institutional arrested for using high-tech transmit- threats are likely to improve the effec-
cheating is a deception used to mis- ters and walkie-talkies to cheat on tiveness and increase the validity evi-
represent student achievement for a the Graduate Record Examination dence for these testing programs.
class, school, school district, or state. (Chronicle of Higher Education, 2002).
Some examples of institutional cheat- This small sample of incidents shows
ing include teachers getting an advance that CIV from individual student cheat- References
copy of the test and planning some ing is a worldwide problem. Abedi, J., Lord, C., Hofstetter, C., & Baker, E.
lessons based on specific items or pro- The volume by Cizek (1999) is a (2000). Impact of accommodation strate-
viding students “practice” with actual, milestone in the study and documen- gies on English language learners’ test
secure test items (Popham, 2003); read- tation of cheating as a source of CIV. performance.Educational Measurement:
ing answers to students during the test Nonetheless, more research is needed Issues and Practice, 19(3),16-26.
or helping students select correct an- into the motivation for institutional Adams, R., Wu, M., &Wilson, M. (1998). Con-
swers; giving hints to correct answers in cheating and its extensiveness partic- Quest [ Computer Program], Camberwell,
ularly within high-stakes accountabil- Victoria, Australia: Australian Council
the classroom during the test or chang- for Educational Research.
ing wrong answers to right answers after ity testing programs. American Educational Research Asso-
the test is given. Unintentionally ex- ciation (2000). Position statement of the
cluding low-achieving students from a American Educational Research Asso-
test may raise the average score for a Conclusion ciation concerning high-stakes testing
unit of analysis, but intentionally ex- Every interpretation or use of an in pre K-12 education. Educational Re-
cluding such students can also be con- achievement test score in a high-stakes searcher, 29(8), 24-25.
sidered cheating. This problem was a environment is vulnerable to inany va- American Educational Research Associa-
topic of discussion over a decade ago lidity threats, such as inadequate con- tion, American Psychological Associa-
(Cannell, 1989). One of the most publi- struct definition, construct under repre- tion, &National Council on Measurement
in Education. (1999). Standards f o r
cized cheating scandals occurred in the sentation, illogical reasoning regarding educational and psychological testing.
New Orleans public schools (Meitrodt the causes of student learning, negative Washington, DC: American Educational
& Nabonne, 1997). A variety of score- consequences of test score uses, and Research Association.
boosting techniques were used that in- low reliability of test scores. CIV is one Andrich, D., Lyne, A., Sheridan, B, & Luo, G.
cluded exclusion of low-scoringstudents, of these threats to validity. We have (1997). RUMM 2010: A windows pro-
unethical test preparation, early distri- shown that CIV has many sources, and gram for Rasch unidimensional models
bution of tests and subsequent teaching evidence has been presented of its ex- for measurement [Computer Progr.am] .
of the test, and unusual test administra- tensiveness. We have argued that CIV Duncraig, Western Australia: Murdoch
tion arrangements. We have witnessed needs increased attention, especially in University-Social Measurement Labora-
outbreaks of cheating in Austin, Texas; high-stakes testing programs. tory.
Beller, M.,
& Gafni, N. (2000). Can item for-
Chicago, Illinois; New York City; Reston, Researchers should systematically mat (multiple-choice vs. open-ended)
Virginia; and Rhode Island (Hoff, 2000). address understudied sources of CIV. account for gender differences in mathe-
Cizek (1999, pp. 62-69) provided exten- Identification and assessment of the se- matics achievement? Sex roles, 42(%)$
sive evidence of institutional cheating in riousness of each source in high-stakes 1-22.
schools and school districts. The trust- testing programs is the first step. This Bowman, D. H. (2001, December 12).
worthiness of student achievement data research should build a robust litera- Arizona reports scoring errors on state
should be questioned, particularlywhen ture that provides a clear picture of the exam. Education Week on the Web. Re-
large gains are observed for a school or seriousness of this threat to validity. trieved March 17,2004,from http://www.
school district, as in the New Orleans Documentation is needed to assure edwe~k.or~ew/newstory.cfm?slug=l5ca
scandal. Like other sources, institutional the public that each specific threat to ps.h21
Brennan, R. L. (2001). Generalizability
cheating in a school or school district validity is not serious for the particular theory. New York: Springer.
level introduces CIV that is difficult to score interpretations proposed. In in- Cannell, J. J. (1989). How public educa-
detect and document. stances where this documentation tors cheat on standardized achieve-
Individual cheating is also avery seri- shows that a threat is serious, we need ment tests. Albuquerque, NM: Friends
ous source of CIV. We have many exam- to take appropriate action to study the for Education.
Spring 2004 25
Chronicle ofHigher Education (2002, Nov- processes: A review of research in the Kane, M. T. (2002). Validating high-stakes
ember 21). Two students arrested for United States. Review of Educational testing programs. Educational Measure-
alleged high tech cheating on the GRE. Research, 65, 145-190. ment: Issues and Practices, 21 (1) 31-41.
~
Retrieved March 17,2004,from http:// Garcia, G. E. (1991). Factors influencing Koretz, D., Lewis, E., Skewes-Cox, T., &
chronicle.com/free/2002/11/2002112102t. the English reading test performance Burstein, L. (1993). Omitted and not-
htm of Spanish-speaking Hispanic children. reached items in mathematics in the
Cizek, G. J. (1999). Cheating on tests: How Reading Research Quarterly,26, 371-391. 1990National Assessment of Educational
to do it, detect it, andprevent it. Mahwah, Grissmer, D. W., Flanagan, A. E., Kawata, Progress (CRE: Technical Report 347).
NJ: Erlbaum. J. H., &Williamson, S. (2000).1mproving Los Angeles, CA: Center for Research
Cole, N. S., & Moss, P. A. (1989). Bias in test student achievement: What state NAEP on Evaluation, Standards, and Student
use. In R. L. Linn (Ed.) Educational mea-
~
test scores tell us. Santa Monica, CA Rand Testing.
surement (3rd ed., pp. 201-220). New Corporation. Li, K. (2003, March 5). Fraudelent TOEFL
York: American Council on Education and Haertel, E., & Calfee, R. (1983). School takers face possible deportation. Daily
Macmillan. achievement: Thinking about what to test. Princetonian. Retrieved on March 17,
Cronba,ch, L. J. (1988). Five perspectives on Journal of Educational Measurement, 2004, from http://www.dailyprinceton-
validity argument. In H. Wainer & H. I. 20(2),119-131. ian.com/archives/2003/03/05/news/7516.
Braun (Eds.), Test vdidity (pp. 3-17). Haladyna, T. M. (2002a). Supporting docu- shtml
Hillsdale, NJ: Erlbaum. mentation: Assuring more valid test score Linacre, J. M., & Wright, B. D. (2004).
Cronbach, L. J., & Meehl, P. E. (1955). interpretations and uses. In G. Tindal & FACETS: Computer program for many-
Construct validity in psychological tests. T. M. Haladyna (Eds.), Large-scale as- faceted Rasch measurenzent. [Computer
Psychological Bulletin, 52, 281-302. sessment for all students: Va,lidity, tech- Software]. Chicago: MESA Press.
Crooks, T. J., Kane, M. T., & Cohen, A. S. nical adequacy, and implementation Linn, R. L. (2002). Validation of the uses
(1996). Threats to valid use of assessment. (pp. 89-108). Mahwah, NJ: Erlbaum. and interpretations of results of state as-
Assessment in Education, 9(3), 265-285. Haladyna, T. M. (2002b). Standardized sessment and accountability systems. In
DeMars, C. E. (1998). Gender differences in achievement testing: Validity and ac- G. Tindal & T. Haladyna (Eds.), Large-
mathematics and science on a high school countability. Boston: Allyn & Bacon. scale assessment programs f o r all stu-
proficiency exam: The role of response for- Haladyna, T. M. (2004). Developing and dents: Development, implementation,
mat.Applied Measurement in Education, validating multiple-choice test items and analysis. Mahwah, NJ: Erlbaum.
11(3), 279-299. (3rd ed.). Mahwah, NJ: Erlbaum. Linn, R. L., Baker, E. L., & Dunbar, S. B.
Downing, S. M. (2002a). Construct-irrele- Haladyna,T. M., Downing, S. M., &Rodriguez, (1991). Complex performance assess-
vant variance and flawed test questions: M. C. (2002). A review of multiple-choice ment: Expectations and validation cri-
Do multiple-choice item writing princi- item-writing guidelines for classroom teria. Educational Researcher, 20(8) ~
ples make any difference? Academic assessment. Applied Measurement in 15-21.

Medicine, 77( lo), S103-104. Education, 15(3 ) , 309-334. Linn, R. L., Betebenner, D. W., &Wheeler,
Downing, S.M. (2002b). Threats to the
Hancock, D. R. (2001). Effects of test anxi- K. S. (1998). Problem choice by test tak-
validity of locally developed multiple-
ety and evaluative threat on students’ ers: Implications for comparability and
achievement. ,Journal of Educational construct validity (CSE Technical Report
choice tests in medical education:
Construct-irrelevant variance and con-
Research, 94 (5), 284-290. 485). Boulder: University of Colorado at
Hembree, R. (1988). Correlates, causes, Boulder, Center for Research on Eval-
struct underrepresentation. Advances in effects, and treatment of test anxiety. uation, Standards and Student Testing.
Health Sciences Education, 7, 235-241. Review of Educational Research, 58, Linn, R. L., Graue, M. E., & Sanders, N. M.
Downing, S. M., & Haladyna, T. M. (1996). 47-77. (1990). Comparing state and district re-
Model for evaluating high-stakes testing Henriyues, D. B.,& Steinberg, J. (2001, sults to national norms: The validity of
programs: Why the fox should not guard May 20). Right answer, wrong score: claims that “Everyone is above average.”
the chicken coop. Educational Mea- Test flaws take toll. New York Times. Educational Measurement: Issues and
surement: Issues and Practice, 15(l), Retrieved March 17, 2004,from http: Practice, 9(3), 5-14.
5-12. //www.nytimes.com/200 1/05/20/business/ Lohman, D. F. (1993). Teaching and testing
Education Week on the Web Archives. 20EXAM. html?ex=1079672400&en=fbce to develop fluid abilities. Educational
(2004). Scoring errors. Retrieved March f3a39c75ddbd&ei=5070 Researcher, 22( 1)> 12-23.
17, 2004, from http://www,edweek,org/ Herman, J., & Golan, S. (1993). The ef- Lomax, R. G., West, M. M., Harmon, M. C.,
search/ fects of testing on teaching and schools. Viator, K. A,, & Madaus, G. F. (1995). The
Engelhard, Jr., G. E. (2002). Monitoring Educational Measurement: Issues and impact of mandated standardized testing
raters in performance assessments. In G. Practice, 12, 20-25,41. on minority students. Journal of Negro
Tindal & T. M. Haladyna (Eds.), Large- Hill, K., & Wigfield, A. (1984). Test anxiety: Education, 64, 171-185.
scale asses.mentfor all students: Validity, A major educational problem and what Lord, F. M., & Novick, M. R. (1968).
technical adequacy, and implementation can be done about it. The Elementary Statistical theories of mental test scores.
(pp. 261-287). Mahwah, NJ: Erlbaum. School Journal, 85, 105-126. Reading, M A Addison Wesley.
Erickson, R. N.,Ysseldyke, J. E., & Thurlow, Hoff, D. J. (2000, June 21) As stakes rise Martinez, M. E. (1999). Cognition and the
M. L. (1996). Neglected numerators, definition of cheating:” blurs. Education questions of test item format.Educationa6
drifting denominators, and fractured Week, 19(41), 1. Psychology, 34(4), 207-218.
fractions: Determining participation rates Holland, P. W., & Wainer, H. (Eds.). (1993). Mehrens, W. A,, & Kaminski, J. (1989).
for students with disabilities in state- Differential item functioning. Mahwah, Methods for improving a standardized
wide assessment programs (Synthesis NJ: Erlbaum. test scores: Fruitful, fruitless, or fraudu-
Report No. 23). Minneapolis: University Huff, K. L., & Sireci, S. (2001). Validityissues lent? Educational Measurement: Issues
of Minnesota, National Center on Educa- in computer-based testing. Educational and Practices, 8, 14-22.
tional Outcomes. Retrieved March 17, Measurement: Issues and Practices, Mehrens, W.A., Popham, W. J., & Ryan,
2004, from http://education,umn,edu/ 20(3), 16-25. J. R. (1998). How to prepare students for
NCEO/OnlinePubs/Synthesis23.html Kane, M. T. (1992). An argument-based ap- performance assessments. Educational
Fitzgerald, J. (1995). English-as-a-second- proach to validity. Psychological Bulletin, Measurement: Issues and Practice, 17,
language learners’ cognitive reading 112, 527-535. 18-22.

Meijer, R. R., & Sjitsma, K. (1995). Detection www.natd,org/Case-3-Cherry-Creek- and implementation issues (pp. 211-
of aberrant item score patterns: A re- part-E.PDF 229). Mahwah, NJ: Erlbaum.
view of recent developments. Applied National Center of Educational Statis- Rothstein, R. (2002, September 18). How
Measurement in Education, 8 ( 3 ) ,261- tics. (2002). The nation's report card: U.S.punishes states with higher stan-
272. Mathematics highlights (NCES 2001- dards. The Net0 Yorlc Times. Retrieved
Meitrodt, J., & Nabonne, R. (1997). Scores, 518). Washington, DC: U. S.Department March 17, 2004, from http://www.ny-
testing practices raise suspicions of ex- of Education: Office of Educational Re- times.com/2002/09/18
perts. New Orleans Times-Picayune search and Improvement. Retrieved Ryan, J. M., & Demark, S. (2002). Variation
Special Report. Retrieved March 17, March 17,2004, from (http://nces.ed.gov/ in achievement scores related to gender,
2004, from http://www.nola.com/speced/ nationsreportcard/) item format and content area tested. In
toogood/main.html Nitko, A. J. (2001). Educational assessment G. A. Tindal & T. M. Haladyna (Eds.),
Messick, S. (1984). The psychology of edu- ofstudents (3rd ed.). Upper Saddle River: Large-scale assessment programsfor all
cational measurement. Journal of Edu- NJ: MerriWPrentice Hall. students: Development, implementation,
cational Measurement, 21, 215-237. Nolen, S. B., Haladyna, T. M., & Haas, N. S. and analysG (pp. 67-88). Mahwah, NJ:
Messick, S. (1989). Validity. In R. L. (1992). Uses and abuses of achievement Erlbaum.
Linn (Ed.), Educational measurement test scores. Educational Measurement: Sternberg, R. J. (1998). Abilities are forms
(3rd ed., pp. 13-104). NewYork:American Issues and Practices, 11, 9-15. of developing expertise. Educational
Council on Education and Macmillan. Paris, S. G., Lawton, T. A., Turner, J. C., & Researcher, 27( 3 ) , 11-20.
Messick, S. (1995a). Validity of psychological Roth, J. L. (1991). A developmental per- Taipei Times (2002, August 9). GRE cheat-
assessment: Validation of inferences from spective on standardized achievement ing probe uncovers Asian internet sites.
persons' responses and performances as testing. Educational Researcher, 2U( 1) , Retrieved March 17, 2004, from http://
scientific inquiry into score meaning. 2-7. www.taipeitimes.com/News/front/archiv
American Psychologist, 50, 741-749. Penfield, R. D., & Lam, R. C. M. (2000). es/2002/08/09/159537
Messick, S.(1995b). Standards of validity Assessing differential item functioning in Thornton, K. (2001, May 4). Test pressures
and the validity of standards in perfor- performance assessment: Review and forces trainees to quit. The Times Edu-
mance assessment. Educational Measure- recommendations. Educational Measure- cational Supplement, N. 4427.
ment: Issues and Practice, 14(4), 5-8. ment: Issues and Practice, 19(5), 5-15. Wainer, H., & Thissen, D. (1994). On exami-
Mislevy, R. J. (1996). Test theory recon- Popham, W. J. (1991). Appropriateness of nee choice in educational testing. Reiiiezu
ceived. Journal of Educational Mea- teachers' test-preparation practices. Edu- ofEducationa1 Research, 64( I), 159-195.
surement, 33(4), 379-4 16. cational Measurement: Issues and Prac- Wodtke, K. H., Harper, F., Schommer, M., &
Muraki, E., &Bock, R. D. (2003). PARSCALE: tices, 1U(4), 12-16. Brunelli, P. (1989). How standardized is
IRT based test scoring and item analy- Popham, W. J. (2003). Seeking redemption school testing? An exploratory study of
sis for graded open-ended exercises and for our psychometric sins. Educational standardized group testing in kinder-
performance tests. Version 3.1 plus. Measurement: Issues and Practices, garten. Educational Evaluation and
[Computer Software]. Chicago: Scientific 22(1), 45-48. Policy Analysis, 11( 3 ) , 223-235.
Software. Rodriguez, M. (2002). Choosing an item for- Zohar, D. (1998) An additive model of test
National Association of Test Directors mat. InG. Tindal&T. M. Haladyna (Eds.), anxiety: Role of exam-specific expecta-
(2004). Cleaning up answer sheets. 7. Large-scale assessment programs for all tions. Journal ofEducationa1 Psychology,
Retrieved March 17, 2004, from http:// students: Validity, technical adequacy, 9U( 2) 330-340.
~
Spring 2004 27

Construct Irrelevant Variance in High-Stake Testing - Haladyina.x

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Construct Irrelevant Variance in High-Stake Testing - Haladyina.x

Uploaded by

Copyright:

Available Formats

Construct4rrelevant Variance

18 Educational Measurement: Issues and Practice

Category Instances Construct Need for Research' TY Pe

20 Educational Measurement: Issues and Practice

22 Educational Measurement: Issues and Practice

24 Educational Measurement: Issues and Practice

ples make any difference? Academic assessment. Applied Measurement in 15-21.

26 Educational Measurement: Issues and Practice

You might also like