Educational Testing and Assessment Lessons From The Past, Di

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 89

~ tPergamon Int £ Ed~c. Res. ~t,~l.27, No. 5, pp.

355--445,1997
ID1997ElsevierScienceLtd.All rillht~w.served
Printedin GreatBriwn
0~83-0355/97$32.00

Pll : S0883-0355(97)00039-6

EDUCATIONAL TESTING AND ASSESSMENT:


LESSONS FROM THE PAST,
DIRECTIONS FOR THE FUTURE

LORIN W. ANDERSON
(GtmST EDITOR)

University of South Carolina

CONTENTS
CHAPTER 1 EDITOR'S INTRODUCTION - - Lorin W. Anderson 357

CHAPTER 2 LESSONS FROM THE HISTORY OF INTELLIGENCE TESTING - -


David E Lohman 359
Theme 1. Study the Classics 360
Theme !I. The Role of Personal Beliefs 361
Theme 11I. Psychometrics Versus Psychology 371
Conclusions 375
References 375
Biography 378

CHAPTER 3 FUTURE DIRECTIONS FOR NORM-REFERENCED AND CRITERION-


REFERENCED TESTING - - Ronald K. Hambleton and Stephen G. Sireci 379
New Directions in Achievment Testing: An Overview 379
The Shift Toward Constructed-Response Item Formats 380
Emphasis on Higher-Order Cognitive Skills 381
Demands for Criterion-Referenced Tests and Score Interpretations 383
Technological Advances in Measurement Models 385
Advances in Reliability and Validity Theory 387
The Influence of Computer Technology on Achievement Testing 389
Conclusions 390
References 391
Biographies 393

CHAPTER 4 LARGE-SCALE ASSESSMENT IN SUPPORT OF SCHOOL REFORM: LESSONS


IN THE SEARCH FOR ALTERNATIVE MEASURES - - Joan L, Herman 395
Alternative Assessment: A Rose by Many Names 396
Alternative Assessment as a Key to Reform 396
Issues and Status in Establishing Technical Quality of Alternative Assessments 399

355
356 CONTENTS

Issues and Status in Fairness in Alternative Assessment 404


Consequences of Alternative Assessment 406
What Next? 409
References 410
Biography 413

CHAPTER 5 ASSESSMENT AS A MAJOR INFLUENCE ON LEARNING AND INSTRUCTION


- - Filip J. R. C. Dochy and George Moerkerke 415
The Need for Research on Performance Assessment in Small-Scale Instructional
Settings 415
Changing Education Today: Powerful Learning Environments and the Need for New
Assessment Instruments 422
The Future: Integrating Learning, Instruction, and Assessment Through Flexible and
Transformative Learning in Powerful Learning Environments 424
Conclusions 427
References 429
Biographies 432

CHAPTER 6 EMPIRICISM AND VALUES: TWO FACES OF EDUCATIONAL CHANGE - -


Peter W. Airasian 433
David Lohman 433
Joan Herman 435
Filip Dochy and George Moerkerke 437
Ronald Hambleton and Stephen Sireci 439
Integrating Themes 441
Conclusion 444
References 445
Biography 445
CHAPTER 1

EDITOR'S INTRODUCTION

L O R I N W. A N D E R S O N

University of South Carolina, College of Education, Department of Educational Leadership &


Policies, Columbia, SC 29208, U.S.A.

The idea for this issue oflJER came to me shortly after a discussion I had with a doctoral student
in teacher education about two years ago. The student stopped me in the hall and said she wanted
to do her dissertation research in the area of authentic assessment. She then asked if I could give
her a reading list on the topic that she could use as the starting point for her literature review. I
invited her into my office, sifted through a few recent issues of journals, and gave her the names
and addresses of people who I knew were engaged in research and writing in the field. As she
was leaving, I stopped her and said, "You also might want to do some reading in the areas of
intelligence testing and standardized achievement tests." "Why?" she responded. "Everyone knows
that intelligence tests are culturally biased and standardized achievement tests are invalid." As
she walked away, I found myself pondering two questions. How many other advanced doctoral
students would have made the same comment? What does her comment say about the knowledge
educators have about testing and assessment?
A few months later I was engaged in a discussion with a colleague about the increasing
fragmentation of the educational research community. He gave as an example the ever-widening
rift that he saw developing between those operating from an interpretative perspective and those
operating from a logical-positivist one. (See Smith, 1997, for more recent discussion of this rift,
complete with examples.) My response was that there was a great deal of fragmentation even
among those operating from the same methodological perspective within a single field. The dif-
ference was that this latter type of fragmentation did not seem to generate as much hostility;
rather, one group was simply unaware of what other groups were doing. In defense of my conten-
tion, I used the field of testing and assessment. I said, "If you look at the reference lists of those
working in intelligence testing, standardized achievement testing, and alternative assessment,
you will find they do not overlap."* I then asked, "Don't you think someone working in the field
of educational testing and assessment should have a breadth of understanding that comes from
reading across these areas?" As soon as the question left my lips, I thought back to the graduate
student. If "experts" in our field are so fragmented, how could we expect students to be otherwise?
The next four chapters are intended to provide the needed breadth. In Chapter 2, David Lohman

*The interested reader can examinethe validity of my contentionby perusingthe numerousreferences included in this
issue.
357
358 L.w. ANDERSON

presents an intriguing story of intelligence testing and those who were pioneers in the field. In
Chapter 3, Ronald Hambleton and Stephen Sireci look into their crystal ball and describe what
they see as the future of standardized achievement testing. In Chapter 4, Joan Herman shares the
lessons she has learned from work on alternative assessments and their implications for large-
scale assessment. In Chapter 5, Filip Dochy and George Moerkerke discuss assessment in the
service of learning and instruction, in terms of both lessons learned and future directions. In the
final chapter, Peter Airasian provides a cross-chapter analysis and arrives at two integrating themes:
the role of social context in testing and assessment, and the issue of how much we really know
after decades of thought, research, and practice.
It is my hope that if I gave this issue to the graduate student, she would at least question the
perceptions of intelligence testing and standardized achievement testing she held at the time of
our conversation. She also may be somewhat less naive concerning the flawless nature and
unlimited potential of alternative assessments. My hope for all readers of this issue is the same.

Reference

Smith,J. K. (1997).The storieseducationalresearcherstell about themselves.Educational Researcher, 26(5), 4-11.


CHAPTER 2

LESSONS FROM THE HISTORY OF INTELLIGENCE TESTING

DAVID E LOHMAN

TheUniversity of Iowa, College of Education, 361 Lindquist Ctr. N, Iowa City, IA 52242-1529,
U.S.A.

What can we learn from the history of intelligence testing that might improve future assess-
ments? Much. But how to tell it? Anyone who has read both old and new could tell tales of
insights overlooked or repeatedly rediscovered, of long and carefully conducted research programs
that were lost entirely or reduced to brief, often inaccurate caricatures in modern summaries, of
roads not taken that would have changed the whole. Such a litany will not be attempted here.
Instead, for no better reason than that three seems to be the magic number in the literature on
human intelligence, three general themes are identified and then elaborated in turn. The first and
last points require relatively brief elaboration. The second is more difficult to document and so it
comprises the bulk of this chapter.
The general themes are:
Theme 1. The developers of intelligence tests were not as narrow minded as they are often
made out to be; and, as a necessary corollary, nor are we as clever as some would have it. Those
who have not themselves read widely from the books and articles of luminaries such as Binet,
Spearman, Thorndike, and Stern are not so much condemned to repeat history (as Santayana
claimed) as they are to say and write silly things. Nevertheless, partly because of our ignorance
of the past, but mainly because there are larger tides in the affairs of humankind, controversies
about intelligence repeat themselves. Transitions that look new in the short run often look familiar
in the long run. We may never step into the same river twice, but we do have a habit of rediscover-
ing the same stepping stones and potholes therein.
Theme 2. Then and now, theories of intelligence are the product not only of data and argument,
but also of the personal proclivities and professional experiences of theorists, of their beliefs
about what science is and how it should be conducted, and of the larger social, political, and
religious themes that form the fabric of the cultures in which they live.
Theme 3. Some of the most important changes in the format of intelligence tests were dictated
more by the demands for efficiency and reliability than by psychological theory. Particularly
noteworthy were (a) the shift from individually administered tests to large batteries of group-
administered, paper-and-pencil tests, and (b) the shift from tests that required the examiner to
make judgments about how items or tasks were solved to tests in which examiners merely tabulated
the number of items solved correctly. The underlying tension here was (and continues to be)
between those who place higher priority on the statistical properties of test scores - - particularly
their factor structure and reliability - - than on the psychology of the tasks.
359
360 D. E LOHMAN

Theme I: Study the Classics

The life work of a great and productive scholar cannot be compressed into a few sentences.
Those who take the time to read (or re-read) Spearman or Binet or Thorndike will be struck by
how much larger their views were than even the best summary suggests. The surprise is greatest
if one's reading has been limited to accounts in textbooks and other secondary sources. For
example, I have been concerned for several years now with the effects of practice on mental
tests, particularly tests of spatial abilities. With the notable exception of Ackerman's studies of
individual differences in skill acquisition (see, e.g., Ackerman, 1987), nowhere in that literature
have I seen references to the extensive review of practice and transfer effects that Thorndike
reports in his 1913 Educational Psychology text. If Thorndike is mentioned at all, it is in a fairly
standard comment (often dismissive and usually inaccurate) about the theory of identical ele-
ments. Those who take time to read the original will find much more.
More importantly, Thorndike's studies of practice and transfer were pivotal in his attempts to
develop a better intelligence test than Binet had assembled. In the end, though, he used well-
practiced school tasks in his intelligence tests because his studies showed large practice effects
when students attempted novel problems. So called "performance tests" were particularly labile
to practice. A generation of psychologists trained after the development of the Wechsler scales
seems not to have worried much about such problems, even though practice effects average about
0.6 SD on the performance scales of the WISC-R (Cronbach, 1990, p. 277).
Nor do modem advocates of Galton's methods of measuring Spearman's "g" seem to have
read the extensive prologue to Spearman's (1904a) paper in which he discusses in considerable
detail the effects of differential practice on such measures, such as Binet's studies with reaction
time (RT) and other tasks, or Seashore's studies with pitch and loudness discrimination. Binet
concluded that the relationship between such measures and intelligence is strongest on the first
trial, and diminishes with practice. Many years later, Fleishman and Hempel (1954) and Acker-
man (1988) offered elaborations of this theme. In a related vein, Carroll (1987) argues that the
correlation between the variability of RT's and g in Jensen's (1982) studies reflects not noise in
neural conductivity, but variations in attention. Binet (1903) offered the same hypothesis to explain
why children's best RTs compare favorably with those of adults, even though their mean RT is
much longer.
The important point, however, is not that hypotheses recently advanced have been advanced
before. Rather, it is that the corpus of good writing and research by early developers of intel-
ligence tests is not only much larger, but much more variegated than our simple summaries sug-
gest. Binet was maddeningly eclectic, especially to those like Spearman who were trained to
build coherent psychological theories: "It would seem as if, in thus inconstantly flitting hither
and thither, Binet can nowhere find a theoretical perch satisfactory for a moment even to himself"
(Spearman, 1923, p. 10). Spearman also had wider views than the standard account of the two-
factor theory of intelligence suggests. His classic 1904 paper "'General intelligence,' objectively
determined and measured" alone runs 92 journal pages! And Thorndike was even more prolific.
Even his grandson, Robert M. Thorndike, was surprised by the breadth of his grandfather's work
on human intelligence after he had time to immerse himself in it (see Thorndike & Lohman,
1990, p. v).
Thus, it is not only our individual memories that distort, simplify, and forget. Our collective
memory - - as captured in the brief summaries of past errors and accomplishments litanized in
the introductions of countless articles and chapters - - reinforces key-hole views of a past that is
much more complex and untidy than we would have it. More is at stake here than constructing a
Educational Testing and Assessment 361

better historical legacy. As he completed his massive review of factor-analytic studies of human
abilities, Carroll (1989) was struck by the non-cumulativeness of the enterprise. Important les-
sons learned at one time seemed to be lost on a new generation. W h y ? Perhaps we have been
lulled into believing in the metaphor of scientific progress. If the march is ever upward, then the
value of research varies inversely with its recency. Another possibility is the increasing dominance
of statistical methods over psychological investigation. But that is another part of the story. Here
I can do no more than to encourage readers to read - - and if they have already read - - to re-read
classic and obscure papers. If nothing else, the papers are generally easy - - even fun to read.
Argument and data took precedence over statistics, which were often crude or absent entirely.
Indeed, with a modern personal computer and a good statistics package, one can profitably rework
old data, as Carroll (1993) has shown us.
Finally, although brilliant and creative and sometimes maddeningly inconsistent, the develop-
ers of intelligence tests were firmly rooted in the academic, social, and political climate of their
time. Because of this, and because of the enormous energy most expended on their psychologi-
cal research, non-specialists who took a broader view of society often saw the dangers and hid-
den assumptions o f tests more clearly than did their developers.

[1 am] impressed by the discovery that more than once [in the history of intelligence testing] nonspecial-
ists, like Walter Lippman ... seem to have had a better grasp of the real issues involved, in spite of
misunderstandingsof technical details, than the scientists themselves. War may be too important to be left
to the generals. (Samelson, 1979, p. 141)

Which brings us to the next theme.

Theme II: The Role of Personal Beliefs

There are those who believe that the history of science - - particularly of the social sciences - -
cannot be understood without some insight into the social and psychological factors that affect
scientists. For example, in his controversial critique of intelligence testing, Gould (1981) claims:

My message is not that biological determinists were bad scientists or even that they were always wrong.
Rather, I believe that science must be understood as a social phenomenon, a gutsy human enterprise, not
the work of robots programmed to collect pure information.* (p. 21)

In the same way, but with a less muckraking tone, Fancher (1985) claims that the story of intel-
ligence testing comes into focus only when one begins to understand the impact o f personal
experience and belief on the theories publicly advocated. He begins with the disparate childhood
experiences o f John Stuart Mill, who was educated by his father and shielded from knowledge of
his precocity, and Francis Galton, who early in life learned that his achievements were unusual
and who in his public education constantly compared himself to others. Mill later championed
the environmentalist perspective and argued that one should resort to biological explanations of
individual and group differences only after all reasonable environmental explanations had been

*James Watson (1980) makes a similar point in his account of the discovery of the double helix: "Science seldom
proceeds in the straightforward manner imagined by outsiders. Instead, its steps forward (and sometimes backward) are
often very human events in which personalities and cultural traditions play major roles." p. xi
362 D. E LOHMAN

explored and refuted. Galton, on the other hand, saw individual differences as largely genetic in
origin and eugenics as the path to improvement. These two perspectives on the origin of individual
differences in intelligence can be seen to one extent or another in the personal beliefs of many of
the protagonists in the story of intelligence testing, such as Alfred Binet and Charles Spearman.
Both Binet and Spearman came to psychology after they had finished their university work.
Spearman was an officer in the Royal Engineers; Binet a lawyer who had never practiced law
and who had dropped out of medical school. Spearman later called the decision to join the army
"the mistake of my life" (Spearman, 1930, p. 300). Binet was even more sarcastic about his early
decision to study law, calling it "the career of men who have not yet chosen a vocation" (quoted
in Wolf, 1973, p. 3). Yet their reactions to Mill's associationistic psychology were quite opposite.
Binet called Mill his only master in psychology (Wolf, 1973). Not so for Spearman. He would
later recall that "my initial reaction to [the argument of Mill and other associationists that experi-
ence was the foundation of all knowledge] was intensely negative . . . . My conviction was
accompanied by an emotional heat which cannot . . . be explained on purely intellectual grounds"
(Spearman, 1930, p. 301). Spearman's reaction is not uncommon, although his report is. Scientists
routinely harbor the belief that somehow their views should be justified on a purely rational basis.
Affect is considered a corrupter of cognition; theorists should strive to rise above such distrac-
tors.
Damasio (1994) takes a different view. His investigations of patients with damage to the ven-
tromedial sector of the prefrontal cortex show the positive contributions of affect to rational cogni-
tion, particularly the ability to solve ill-structured problems.
This is not a claim that rational processes are not important; feelings do not solve a problem.

At their best, feelings point us in the proper direction, take us to the appropriate place in a decision-
making space, where we may put the instruments of logic to good use. We are faced by uncertaintywhen
we have to make a moraljudgment, decide on the course of a personal relationship, choose some means
to prevent our being penniless in old age, or plan for the life that lies ahead. (Damasio, 1994, p. xiii)

A good theory invariably reaches beyond the information given; through it the theorist attempts
to impose a new order on what frequently is an ill-structured problem. At the very least, then, the
creation (and, for later readers, acceptance) of a theory is influenced by affect. In the extreme, the
solution may merely provide rational justification for an emotional reaction to an event. More
commonly, affective reactions and the beliefs they entail color the way we interpret ambiguous
data.
Mill, Galton, Binet, and Spearman are by no means the only examples of this phenomenon.
Although exceptions are noteworthy, the rule is that a theorist's earliest pronouncement about the
relative importance of nature or nurture in intelligence differs little if at all from the one made at
the end of a career. For example, in his first published article "Experimental tests of general
intelligence" Burt (1909) concluded that because the thirteen upperclass boys in his study
outperformed the thirty lowerclass boys on tests he thought unaffected by practice, intelligence
must be inherited "to a degree which few psychologists have hitherto legitimately ventured to
maintain" (p. 176). By 1911, Burt had defined intelligence as "allround innate mental efficiency"
(quoted in Hernshaw, 1979, p. 49), a view to which he adhered throughout his career. "It was for
[Burt] almost an article of faith, which he was prepared to defend against all opposition, rather
than a tentative hypothesis to be refuted, if possible, by empirical tests" (Hernshaw, 1979, p. 49).
Terman (1906) showed his hand in his dissertation "Genius and stupidity: A study of some of
the intellectual processes of seven 'bright' and seven 'stupid' boys." Near the end of the disserta-
tion, he speculated: "While offering little positive data on the subject, the study has strengthened
Educational Testingand Assessment 363

my impression of the relatively greater importance of endowment over training, as a determinant


of an individual's intellectual rank among his fellows" (p. 372, italics original) Once again, experi-
ence seemed not to alter these early beliefs, as later clashes with Lippman and the Iowa group
showed.
Exceptions are noteworthy. Brigham (1930) publicly retracted his early, hereditarian interpreta-
tion of ethnic differences in intelligence based on the U.S. Army data from World War I. However,
the retraction came because the data would not support the conclusions, not than because the
conclusions themselves had changed (Cronbach, 1975).

Political and Social Climate

Psychologists (including this one!) are not historians. The tales we tell each other about the
origins and development of mental testing are often remarkable for their failure to consider larger
political and social influences. With increasing regularity, we acknowledge the impact of the
broader culture on cognition and cognitive development. For example, Bronfrenbrenner (1979)
argues that abilities develop through a child's interactions not only with her immediate social
environment but also with the attitudes and ideologies of the broader culture in which she lives.
Cultural relativism, long relegated to the extreme left-wing of psychometrics, is now afforded a
respected place at the table (e.g., Irvine & Berry, 1988; Laboratory of Comparitive Human Cogni-
tion, 1982) and a central role in at least one major theory of intelligence (e.g., Sternberg, 1985).
Yet the stories told about the development of theories within the discipline are often remarkably
devoid of such influences. In the preface of their account of intelligence testing in Britain, Evans
(a psychologist) and Waites (a historian) note:

Most histories of psychologyhave been written by professionalpsychologistswith a strong commitment


to the body of belief accepted as achievedknowledgewithin their profession. . . . Such histories are not
necessarily uncritical, but the critical point of view is very restricted. Past error appearsmerelyas a series
of hurdles successfullyovercomeon the road to current theory. (Evans & Waites, 198!, pvii)

Modern conceptions of intelligence were birthed in the second half of the 19th century. It is
therefore impossible to understand either the theories that describe the construct or the tests
developed to measure it without some understanding of the political and social ideology of the
time. Identifying the starting point, however, is much like stepping into a stream at a particular
point - - perhaps where it rounds a bend - - and declaring "We will take this as the beginning."
Such fictions clearly mislead, but one must begin somewhere. Herbert Spencer provides such a
convenient starting point.

Social Darwinism

Spencer advocated a theory of evolution before Darwin (1859) published his Origin of Spe-
cies. Spencer's argument, based as it was on philosophical, anthropological, and geological
speculation, was largely ignored; Darwin's biologically-based argument was not. Spencer soon
allied himself with Darwin's theory and sought to apply the theory to the full range of human
knowledge and endeavor. The effort resulted in the ten volumes of his Synthetic Philosophy (which
included The Principles of Psychology).
364 D.F. LOHMAN

Spencer saw evolution as the key to all science. Evolution, he said, proceeds from incoherent
homogeneity, as is found in the lowly protozoa, to coherent heterogeneity, as is found in humans
and the higher animals. More importantly, heterogeneity increased unidimensionally. In this belief
both Spencer and Darwin followed the lead of Locke and Leibniz in viewing all life as falling
along a continuous, unbroken scale. Leibniz said it most succinctly: Natura non facit saltum
("Nature does not make jumps"). For Spencer (1897), the tic marks on this scale marked increases
in intelligence: "Gradually differentiated from the lower order of changes constituting bodily
life, this higher order of changes constituting mental life assumes a decidedly serial arrangement
in proportion as intelligence advances" (p. 406).
The idea of a serial order - - not only between species but within humankind - - was brought
about by an unfettered competition among individuals. "Survival of the fittest" is his phrase, not
Darwin's. If species evolved by becoming more intelligent, then to help humankind become more
intelligent was to further the work of evolution. Yet Spencer held firmly to a Lamarckian view of
the heritability of acquired characteristics. His theory was thus optomistic about the value of
education. If parents transmitted to their children through their genes some remnant of their own
education, then problems of social degenercy could be solved in a few generations. Those who
dismissed Lamark, however, saw the answer in eugenics rather than in education.
Galton was the chief exponent of this more structured view of human inequality. Like other
liberals of his day, Galton advocated a meritocracy in which income and position within the
social hierarchy would be based on innate ability rather than on parental social status. In Hereditary
Genius (1869), he argued that mental characteristics were inherited in the same manner as physi-
cal characteristics. What he lacked, though, was a way to measure innate ability so that individu-
als could be properly placed in the hierarchy. Tests could not fulfill this function unless they truly
measured innate ability, and on a single dimension. Those convinced of the need for eugenics
were thus more willing than they otherwise might have been to believe assertions that the new
intelligence tests (which, parenthetically, confirmed their natural superiority) measured innate,
general intelligence.
Advancing the cause of the able was only a part, and for many, a lesser part of the problem.
The specter that most haunted American and European intellectuals during this period was the
prospect the degeneration of humankind and subsequent collapse of society. "Indeed, 'degenera-
tion' was arguably the most potent concept in the medical and biological sciences of the period"
(Wooldridge, 1994, p. 20). In Britain, politicians and military leaders blamed defeat in the Boer
War on the physical unfitness of the masses. In Italy, Lambroso warned about the return of the
primitive, especially "natural criminals" (cf. recent discussions of a permanent "underclass" in
the U.S. in Hernstein & Murray, 1994). In Germany, fear of racial degeneration and the quest for
racial purity found their way into medicine, the social sciences, and political thought. In France,
the medical/psychiatric concept of ddg~nerescence pervaded political debates about national
decline.

From Jaurbs to Maurras, political discourse was obsessed by the question of national defeat and the ensu-
ing chaos; ... alcoholism, habitual crime, and depravity were cast through the image of a social organ-
ism whose capacity for regeneration was in question. National defeat, degeneration and social pathology
appeared to be caught up in an endless reciprocal exchange (Pick, 1989, p. 98).

A declining birthrate in France and stagnant birthrates in Germany and Great Britain seemed to
confirm fears that degeneration had set in. Governments took note because military hegemony
was in peril for a nation that could not raise armies larger than its rivals, especially from a popula-
tion of "stunted, anemic, demoralized slum dwellers" (Wooldridge, 1994, p. 22).
Educational Testing and Assessment 365

The novels of Zola and the plays of Eugene Brieux, whose Damaged Goods inspired English-
speaking authors such as Bernard Shaw and Upton Sinclair, kept these issues squarely before the
public. The proximal causes of degeneration were (a) the higher birth rate among the lower classes
and "races", especially the teeming masses of urban poor, and (b) the movement of"races" outside
of their natural ecologies. Thus, an African

placed outside of his 'proper' place in nature - - too stimulating an intellectual or social environment, or
in a climate unsuited to his 'tropical' nature - - could undergoa further 'degeneration,'causingthe appear-
ance of atavistic or evolutionarily even more primitive behaviors and physical structures (Stephan, 1985,
p. 98)

Similarly, those of the white race who lived in warmer climates risked becoming diseased and
anemic, sexually promiscuous, and culturally backward. Colonizers needed to make all efforts to
maintain the home environment, to avoid unnecessary contact with the native populations, and to
return home whenever possible to restore their "type" and repair degeneracy acquired abroad.
In America, emancipation of the slaves gave new urgency to the question of the boundaries
between whites and black. Blacks who attempted to move away from the warmer southern states
or to advance beyond their natural condition of servitude would degenerate and eventually die
out. The statistician and economist Hoffman argued that blacks were "a race on the road to extinc-
tion," and thus whites - - who thrived in the political and geographical climate of America - -
need have no fears of a newly freed black race (see Stephan, 1985).
The racial mixing also resulted in degeneration. If nature separated humans into ranks, then
mixing them was not only unnatural, but an invitation to atavism. "Unnatural unions" produced
a hybrid that was inferior to either parent. In America, and later in Germany, the doctrine of
rassenhygiene (race hygiene) became the watchword for a group that looked back to a braver,
purer teutonic past as much as it looked forward to a eugenically improved race.
Thus, the spector of a complete collapse of society haunted many American and European
intellectuals during this period. Evidence of the social, moral, and physical decay of humanity
seemed irrefutable. In the U.S., many feared that the massive immigration of Europe's poor and
degenerate was nothing short of complete folly. Nevertheless, the American and European agendas
differed. Whereas European intellectuals (particularly in England) continued to struggle against
inherited privilege, Americans in- and outside of academia were more concerned with the social
upheavals threatened by immigration and the explosive growth of mass education. The first led
to ready acceptance for eugenic proposals, and the second for more efficient ways to conduct the
business of education. Both of these agendas conflicted with the principle of equality etched in
Jefferson's Declaration of Independence, and thus set the stage for later debate about whether
tests that showed differences between ethnic and social classes were, ipso facto, biased. Even
more important, Americans then and now seemed less uniformly committed to the notion of a
single rank order as their British cousins.* Thus, E. L. Thorndike found no conflict in on the one
hand, advocating a genetic basis for human intelligence while, on the other hand, arguing for the
existence of several intellectual abilties rather than a single, general factor.

*French observers of 19th century English society hypothesizedthat, from birth, the English seemed afflicted with la
mentalitdhidrarchique (Tawney, 1952, p. 23).
366 D. E LOHMAN

Educational reforms

Intelligence testing became a part of the British educational system because it advanced a
meritocratic agenda; it became part of the American system because it helped solve practical
problems in an educational system overrun with pupils. Schooling was expanding exponentially.
From 1890 to 1918 the population of the U.S. increased 68%, while high school attendance dur-
ing this period increased 711%. On average, more than one new high school was built every day
during this period (Tyack, 1974, p. 183). However, school curricula still defined a Procrustean
bed in which "the wits of the slow student were unduly stretched and . . . of the quick pupils
amputated" (Tyack, p. 202). Many started, but few finished. Ayres (1909) showed that the number
of students in each grade dropped precipitously between first and eighth grade. "The general
tendency of American cities is to carry all of their children through the fifth grade, to take one
half of them to the eighth grade, and one in ten through high school" (Ayres, 1909, p. 4). And of
those who remained in school, many in the early grades were "laggards" who had been held
back. The culprit was thought to be the failure of the system to adapt itself to the intellectual
abilities of its students.
A variety of methods for classifying students or adapting the pace of instruction had been used
in American schools for many years (Chapman, 1988). But the intelligence test was heralded as
a more scientific and efficient method of performing the task. Tests then as now provided a means
for sorting and classifying people that was ostensibly objective and fair. And what could be fairer
than an educational system that was adapted to the natural ability levels of its students? Intel-
ligence tests helped administrators in a school system newly infatuated with a corporate model
of centralization, bureaucratization, and efficiency, perform and defend the sorting functions they
were asked to perform. Test publishers thus found a ready market for their products. Terman
estimated that probably a million children were given group intelligence tests in 1919-1920, and
two million the next. Even Walter Lippmann (1922, in Block & Dworkin, 1976), who is now
remembered by psychologists chiefly for his debates with Terman about intelligence testing,
applauded the goal.

Nativism

The practical use of tests, however, has always followed a path largely independent of theoreti-
cal debates about the nature of intelligence. In America (as in England) social darwinism was the
dominant view. In fact, "Spencer's writings were so dominant in discussions of society and politics
that they virtually sank into the unconscious of American political deliberation, ceased to be an
argument, became obvious, and became common sense" (White, 1977, pp. 36-37). Mass immigra-
tion of peoples from southern and eastern Europe (on the Atlantic coast), and from China, Japan,
and the Philippines (on the Pacific coast) led American intellectuals to emphasize the racial or
ethnic differences in ability more than Europeans, who were more concerned with social stratifica-
tion within their societies.
The new immigrants were, on the East Coast, poorer and more Roman Catholic than before.
More important, they did not as readily assimilate into the existing order, but instead insisted on
keeping their own language and customs. Roman Catholics set up their own schools and achieved
political control in some cities of the Northeast, thereby reviving Protestant fears of Papal influ-
ence.
These fears were given wide currency in Madison Grant's (1916) The Passing of the Great
EducationalTestingand Assessment 367

Race. One obvious implication was to exclude the inferior races, an end finally achieved in the
restrictive immigration laws of the 1920s, but only after nativist sentiments were augmented by
a post-war isolationism.
But what to do about those already admitted? Ellwood Cubberly, Terman's dean at Stanford,
echoed the feeling of many:

These southernand easternEuropeansare of a very differenttype fromthe north Europeanswho preceded


them. Illiterate,docile,lackingin self-relianceand initiative,and not possessingthe Anglo-Teutonicconcep-
tions of law, order, and government,their coming has served to dilute tremendouslyour nationalstock,
and to corrupt our civic life. . . . Everywhere these people tend to settle in groups and settlements,and to
set up here their nationalmanners,customs, and observances.Our task is to break up these groups or set-
tlements,to assimilateand amalgamatethese people as a part of our Americanrace, and to implantin their
children, so far as can be done, the Anglo-Saxonconceptionof righteousness,law and order, and popular
government,and to awakenin them a reverencefor our democraticinstitutionsand for those thingsin our
nationallife which we as a people hold to be of abidingworth. (Cubberly,1909, pp. 15-16)*

Eugenic proposals

If the survival of a democracy depends on the ability of its citizens to make intelligent deci-
sions, and if intelligence is innate and individuals and groups differ in intellectual competence,
then it was the moral duty of those who would improve (or at least not openly contribute to the
degeneration of) humankind to restrict immigration of such peoples into the country and to stem
their proliferation within society. Alarmist (and later, sober) reports of the dysgenic effects of the
higher birthrate among the poor and the less intelligent led to calls for sterilization of the retarded
and restrictions on immigration (see, e.g., Cattell, 1940).
Although psychologists certainly contributed to the discussion of eugenic proposals, they were
not the only or even the most important voices. Contrary to popular opinion, neither psycholo-
gists nor their army testing data exerted much influence on the restrictive U.S. Immigration Law
of 1924.

[The immigrationlaw of 1924] was the culminationof efforts begunin the 1890s and supported by a far-
flung coalitionof forces from the ImmigrationRestrictionLeague all the way to the AmericanFederation
of Labor. Much of the power of this movementwas based on economicissues. . . . It was predominantly
the biologicalargumentof the eugenicistsand racists under the leadership of Madison Grant and C. B.
Davenport that produced the scientific ... legitimation[for this legislation].(Samelson, 1979, pp. 135-
136)

Eugenics was part of a larger Zeitgeist that had at its core a belief in the improvement of humankind
through the application of the scientific method to the study of people and their institutions.
Interventions ranged from the child study movement and the enactment of child labor laws, to
the application of corporate methods to education, to time-and-motion studies of industrial produc-
tion, to the enactment of eugenic proposals into sterilization laws. It was at root, though, a reac-
tion to a widespread conviction that evolution was going in reverse, that all that was good and
noble and worth saving was slowly, inexorably sinking back into violence and degradation of an
earlier level in the evolution of humankind. It was more reactionary than visionary, more backward
than forward looking, more a product of fear than of hope.

*The trail does not stop here. Stanford's President David Staff Jordan was a well-knownbiologistand leader in the
eugenics movement.
368 D.F. LOHMAN

The religious context

The ground had been prepared for the seed of intelligence testing by an even larger and earlier
cultural movement: the Reformation. Salvation, Luther said, was not to be achieved through good
works but through grace. Those thus saved were the new chosen people, an analogy taken quite
literally in Calvinist Holland (Schama, 1987). If some were elected for salvation, then others
were predestined for damnation - - at least in Calvinist and Puritanical writings.* It is now
acknowledged that these beliefs influenced an astonishing array of o t h e r - - often d i s t a n t - - aspects
of the social, political, and economic activity of these peoples and the cultures they influenced.
For example, Weber (!904/1958) argued that, paradoxically, the belief in predestination faeled
the economic enterprise of ascetic Protestant sects (such as the Puritans). Schama (1987) claimed
that the sense of self-legitimation as a "chosen" people of the sort that pervaded Dutch culture
during the seventeenth and eighteenth centuries also "helps account for the nationalist intransigence
of . . . the Boer trekkers of the South African Veldt, the godly settlers of the early American
frontier, even the agrarian pioneers of Zionist Palestine" (p. 35). If so, then in America, nativism
and manifest destiny were not far behind.
More important, the belief that some are predestined for salvation and others for damnation is
not only compatible with the belief that only some are chosen intellectually, but more significantly
that such gifts might properly be used as an arbiter of individual merit and worth. "We are comfort-
able with the idea that some things are better than others," proclaimed Hernstein and Murray
(1994, p. 534). But they are also comfortable with the fact that some people are better than oth-
ers, that the best measure of better is IQ, and that a meritocracy based on intelligence is "what
America is all about" (p. 512). Perhaps it is, and perhaps that is why the study of individual dif-
ferences in general and of intelligence in particular has been more popular a topic in countries
where Calvinism and Puritanism once flourished than in countires where such beliefs never
attained a significant following.

The meritocracy

In education (Tyack, 1974), mental testing (Cronbach, 1975), and politics (Gardner, 1961) it is
one of the strange ironies of history that reformers often misjudge the consequences of their
reforms. Those who sought the abolition of a society artificially stratified by hereditary privilege
with one based on merit did not foresee that a meritocracy would create new problems, or that
judgments of merit would lead to new inequalities. Who would have guessed that the tests which
were heralded in one generation for opening the doors of higher education to all would, in the
next, be attacked by some as artificial gatekeepers?
Tests do predict outcomes, and they are surely a fairer way of allocating scarce resources than
any other alternative yet devised. But they see through the glass only dimly. Popular and profes-
sional misconceptions that tests measure (or presently might measure) ability uncontaminated by
culture or social class or motivation continue to plague interpretations of test scores, as the recent

*It is useful to distinguish between the simple predestination of Paul (some are predestined for salvation) from the
double predestination of Augustine, Luther, and Calvin (some are predestined for salvation, others for damnation). The
latter view is more congenial with eugenic proposals to eliminate the unfit. However,Luther's emphasis on the equality
of all men before God had a more profound impact on the societies that followed his lead rather than Calvin's.
EducationalTestingand Assessment 369

debate on affirmative action in the U.S. shows. Few understand complexities such as the social
class bias in schooling itself (Davis, 1949), the verbal-educational bias in many criterion measures
of job performance, particularly test scores or supervisor ratings gathered immediately after train-
ing (Frederiksen, 1981), or how the generally weak relationships between predictor and criterion
in personnel selection mitigates arguments against proportional within-group hiring (Cronbach
& Schaeffer, 1981). Selecting those most likely to succeed in the current system helps perpetu-
ate that system. This may be good, but it is no small irony that the same tests which help liberate
talent also help conserve the system. Renewed arguments for the use of measures of general
ability in personnel selection (Schmidt, Hunter, & Pearlman, 1981) are particularly important
since a "single-rank-order selection is only a shade less conservative than the aristocratic selec-
tion it replaced, since to a significant degree it also perpetuates advantage of birth" (Cronbach &
Snow, 1977, p. 8).
The yardstick by which merit is measured reflects the needs and demands of society at a
particular point in time. If exquisite penmanship is your forte, then you have missed your century.
But even a valued competency may go undeveloped in those who are unwilling or unable to
compete. Competitive social structures deemed most fair by those who promote meritocratic selec-
tion do not develop valued excellencies in those who prefer more cooperative structures.
The meritocratic reformers saw less cooperation and more competition in nature and in society.
Early Calvinist dichotomies of elect and damned gave way to the unbroken evolutionary scale of
Spencer. This simple reading of evolutionary theory seemed to offer an absolute definition of
potential for merit. Those low on the scale of intelligence were no longer indispensable parts of
the body politic, but at best "democracy's ballast, not always useless but always a potential
liability" (Terman, 1922, p. 658).
Tests emphasize, even magnify human differences. Those who trade in the currency of
individual differences thus rarely consider questions of human equality. "Clever men," says Taw-
ney (1952) "are impressed by their difference from their fellows; wise men are conscious of their
resemblance to them" (p. 81). Equality takes several forms. Most commonly it refers to political
equality, which renders all equal before the law and guarantees all an equal right to participate in
and influence government, and to social equality, which negates social class distinctions. It was
precisely this sort of political and social equality that most impressed the young De Tocqueville
(1945) about the America he visited in the 1830s.
But such equalitarian attitudes, even if they were not as widespread as De Tocqueville believed,
have surely shrunk. Hernstein and Murray (1994) claim that the culprit is the confluence of abil-
ity and wealth in U.S. society that has occurred in the past 30 years. Kaus (1992) claims that
economic disparity is but one symptom of the loss of a more fundamental of equality of worth,
or what Gardner (1961) calls equality of respect. In its most exalted, usually religious form, this
meant that all humans were not simply equal but equally precious, and thus equally worthy of
respect no matter what their external circumstances. It was this type of equality - - based as it
was on a qualitative difference between man and other animals - - that Darwin's theory seemed
to dispel. If humans were not equally fit for survival, then differences in accomplishment and
thus in social status need not be as capricious as egalitarian reformers believed.

Philosophy of Science

The construct of intelligence as innate ability was firmly rooted in the Zeitgeist of the period
during which the first tests were developed. But scientists also had beliefs about the scientific
370 D.F. LOHMAN

enterprise. Indeed, beliefs about how knowledge is acquired and how conflicts among competing
explanations are resolved form core assumptions of methods of inquiry of a discipline at a
particular point in time. Collectively, these methods define (or assume) a particular philosophy
of science. The question of how competing claims are arbitrated is a somewhat narrower issue of
epistemological values, and will be discussed as such.

Positivism versus realism

Logical positivism was the dominant philosophy of science during the late 19th century when
the foundations of modem psychology were laid (Koch, 1959). Positivism is often distinguished
from an older view of science, realism. Proponents of realism hold that the methods of science
allow direct access to reality. Scientific explanations, in this view, describe the world as it really
is. Positivism is a more moderate position. According to this view, scientists form models or
theories of the world based on observed regularities. Although constructs and the laws that relate
them usefully explain these regularities, they are not necessarily real. However, as Slife and Wil-
liams (1995) note, it is difficult for positivists not to take the next conceptual step and begin to
believe that the constructs formed to explain regularities in the world (e.g., gravity or intel-
ligence) are in some sense real.
From Spearman to the present, those who report factor analyses of correlations among tests
have routinely slipped from careful statements about factors representing convenient "patterns of
covariation" or "functional unities" to entities that exist in some concrete fashion in the brains of
those who responded to the tests. Cognitive psychologists are even less careful about reifying
their constructs. Those steeped in the information-processing metaphor who have enjoined the
debate about the meaning of intelligence have invoked limitations in the capacity of working
memory, or speed of information processing to explain observed differences in performance. But
"working memory" is a construct, not a thing; and information-free mental processes are no more
than convenient fictions.
Like the earlier realists, positivists put great emphasis on observation. It is hard to read Wolf's
(1973) biography of Binet or Joncich's (1968) biography of E. L. Thorndike (appropriately titled
The Sane Positivist) without feeling some of the enthusiasm both felt for observation and
experimentation. Application of the methods of science to human behavior promised to ameliorate
many social ills. There was no worry about the extent to which theory contaminated or shaped
their observations, since, in their worldview, facts existed independent of and prior to theory. Nor
was there a concern whether the methods of science, rather than providing an avenue for the
discovery of truth, might actually presume a certain set of beliefs about the world.
The logical positivism of the turn of the century has been replaced by a less comforting set of
philosophies. One of the more important contributions was made by Popper (1963) when he
pointed out the logical asymmetry of proof and disproof. A thousand confirming instances does
not prove the statement "All swans are white"; but one instance of a black swan disconfirms the
statement. The implication is that theories can not be proven correct, only disproven. But Pop-
per's science was still evolutionary; progress came in a thousand small attempted refutations.
Kuhn (1970), on the other hand, argued that progress in science was more often discontinuous.
A new paradigm would revolutionize thinking in a domain. An even more extreme view is taken
by social constructivists (scientific "constructs" are simply shared understandings within a
particular community in a particular culture at a particular point in time) and their postmodernist
Educational Testing and Assessment 371

allies (there is no way to secure knowledge of a universal and objective reality; rather, knowledge
is contextual and constructed through social and linguistic convention).
Social scientists trained in the first half of this century seemed often to believe their task was
to put forth a good theory and then defend it against all attacks. Those trained after Popper and
Kuhn are more likely to see their task differently. For example, Anderson (1983), in the preface
to a book describing a theory of cognition, announced his intention to "break" the theory. The
increasingly widespread acceptance of the notion of intelligence as a cultural construct is grounded
in an even more constructivist philosophy.

Epistemological issues

A central epistemological issue for scientists is how to choose among competing explanations
or theories: Should it be parsimony? utility? meaningfulness? perceived truth value? Test users
have generally opted for utility. In the U.S., psychological meaningfulness prevailed over
parsimony in the theoretical debate at least until the 1960's. Then parsimony reasserted its chal-
lenge. This epistemological struggle underpins the most enduring debate about intelligence: is it
one or many things? The controversy has a long history, and promises to have an even longer
one. In large measure this is because the debate is not only about evidence but also about value,
such as whether parsimony, utility, or psychological meaningfulness should be given priority.
Hierarchical theories offer a compromise, but, as Vernon (1973) pointed out, may better meet
statistical than psychological criteria.
By emphasizing parsimony over psychological meaningfulness, such theories have enhanced
the status of broad factors such as fluid intelligence (Gf), crystallized intelligence (Gc), and spatial
visualization (Gv), and diminished the status of narrower factors, such as most of Thurstone's
(1938) primary abilities, and all of Guilford's factors. This may or may not be a good thing.
Certainly there is less tendency to attribute effects to special ability constructs that could more
parsimoniously be attributed to general ability. However parsimony is only one of several criteria
that may be used to arbitrate such decisions. Psychological meaningfulness is perhaps equally
important, but has been given less weight of late.
Indeed, one could argue that psychological clarity declines as factor breadth increases. In other
words, the broadest individual difference dimension - - although practically the most useful - -
is also psychologically the most obscure. There has never been the sort of controversy over the
meaning of factors such as verbal fluency or spatial ability that routinely attends discussion of G
(see Lohman & Rocklin, 1995). On the other hand, tests of narrower abilities have never fared as
well as tests of broader abilities when utility was the criterion. It is unlikely that new tests will
fare better, in spite of the fact that they are more firmly grounded in theory than many of the
older classics. Nevertheless, newer tests (such as the Woodcock-Johnson - - Revised) are a boon
for researchers, and may someday show utility as aptitude variables that interact with instructional
or other treatment variables.

Theme III: Psychometrics Versus Psychology

Tests must meet multiple, often conflicting standards of excellence. Standards that are clear
and quantifiable tend to be enforced more rigorously than standards that are vague and not quantifi-
able. From a handful of simple assumptions about how latent scores and errors combine to produce
372 D.F. LOHMAN

observed scores (Spearman, 1904b), an elegant, elaborate, and complex psychometric theory has
been developed over the years. Those who would master this theory - - and ancillary develop-
ments in multivariate statistics (particularly factor analysis) and scaling - - must devote many
years to its study. Few have time to develop the level of sophistication in psychological theory
that Spearman, Thorndike, or Thurstone also achieved. And even if psychometric and statistical
theory took no longer to master now than it did at the turn of the century, psychological research
and theory have grown exponentially in the interim. There is thus an increasing conflict between
the psychometrics of intelligence tests and the psychology of human intelligence.
Psychometricians sometimes hold naive beliefs about learning and cognition (Shepard, 1991),
protestations to the contrary notwithstanding (Cizek, 1993). And experimental psychologists are
often even more uninformed about psychometrics.* Because of this, the topic of this section
could be included as a subheading under the general heading of beliefs about the nature of mental
tests, and thereby treated as a part of the previous section. However, the issues are sufficiently
unique to warrant separate treatment.
At the simplest level, the conflict is between those who emphasize the statistical properties of
test scores and those who emphasize their psychological meaningfulness. The conflict has actu-
ally been more of a rout, since at every turn those who have advocated statistical theories of
mental tests have won the battle. Inevitably, though, psychological assumptions about the mean-
ing of test scores must be made and defended. As Cronbach (1990) put it, "Sooner or later every
tester has to go behind the experience table and behind the test content, to say what processes
seem to account for the responses observed" (p. 159). By that point, however, the test has been
so molded by psychometric principles that it is often difficult - - and sometimes nigh near impos-
sible - - to wrest psychological meaning from the scores. Explaining why this is so requires a
brief reconsideration of the history of ability measurement.
In a poem that once was committed to memory by most American school children, Robert
Frost tells how, in looking back on his life, "two roads diverged in a wood" early in the journey.
That the poet took one road rather than the other "made all the difference" in the course of his
life. Early in this century, two roads - - two approaches - - for developing intelligence tests
diverged. Few seemed to notice at the time, but with hindsight it is clear that the path taken was
not the only option. One of the few who seemed to notice was E. L. Thorndike:

All scientificmeasurementsof intelligencethat we have at presentare measuresof some productproduced


by the person or animal in question,or of the way in which some product is produced. A is rated more
intelligentthan B becausehe producesa betterproduct, essay written,answerfound,choice made,comple-
tion supplied or the like, or produced an equally good product in a better way, more quickly or by infer-
ence rather than by rote memory, or by more ingenious use of the material at hand. (Thorudike,Bregman,
Cobb, & Woodyard. 1926, p. 11-12,emphasis added)

The crucial difference here is between tasks that allow inferences about ability from h o w m a n y
items are solved and tasks that allow inferences about ability from h o w items are solved.

*On several occasions when Snow and I were writinga long monographon implicationsof cognitivepsychology for
educationalmeasurement,Snow remarked how it mightbe even more fruitfulto turn the problemaroundand write about
the implicationsof educational measurementfor cognitivepsychology. It is unlikely,though, that very many cognitive
psychologists would attend to such a document.In attemptingto explainwhy experimentalpsychologistshave generally
ignored the link between psychophysicsand mentaltests, Guilford(1954) observed that "'the experimentalpsychologist
has been very slow in realizingthat he uses mentaltests as measuringinstruments.He has associated mentaltests with
individualdifferences,failing to recognize that they also measure 'occasion differences'in the same individual."(p. 4).
Educational Testingand Assessment 373

The first approach is most informative if ability can be described with a trait model in which
individuals vary in their location along a common scale. Number of items solved, or some
transformation of this score, typically defines the scale. Growth - - if it occurs at all - - involves
quantitative rather than qualitative changes. The second approach is more informative if abilities
can be described by stage-like models. Growth involves qualitative changes in how tasks are
solved, not simply whether they can be solved. Although one can construct a continuous scale
from such data, interpretation is not the same at all points on the scale. While the stage approach
is now common among developmental psychologists, it was once a viable approach for the assess-
ment of intelligence.
In one of the betterknown early summaries of intelligence testing, Freeman (1926) claimed
that many early attempts to measure intelligence (including the Binet scale of 1905) "did not
emphasize the objective score which the child made so much as his general behavior and the way
in which he went about the tasks which were set (before) him" (p. 108, italics added). Indeed the
notion of mental age as a continuous variable (as Terman envisioned it) was foreign to Binet's
views. The term Binet used was "niveau intellectualle" or "intellectual level." Furthermore,
children at the same intellectual level might behave in quite different ways. His 1903 book, L'Etude
Experimentale de l'Intelligence (The Experimental Study of Intelligence), "primarily dealt with
qualitative differences in personality or mental functioning" (Fancher, 1985, p. 65), not incremental
improvement along a common scale. Even Wilhelm Stern, who translated Binet's niveau intel-
lectualle into intelligenzalter or "mental age," was quick to argue that individuals who exhibited
the same mental age (or "teleological" intelligence) might approach situations quite differently,
thereby exhibiting "phenomenologically" distinct intelligences.
The stage-like view of intellectual development is also evident in the type of task Binet found
most congenial for the measurement of intelligence. "We place [this task} above all others, and
if we were obliged to retain only one, we should not hesitate to select this one" (Binet & Simon,
1908/1916, p. 189). What was the task? The child was shown three pictures, one at a time, and
simply asked "What is this?" or "Tell me what you see here." Three types of responses were
distinguished: (a) an enumeration response ("a man, a cart . . . . ") (b) a descriptive response ("There
is an old man and a little boy pulling a cart."), and (c) an interpretive response ("There is a poor
man moving his household goods.") Note that all children are shown the same stimuli; intel-
lectual level is inferred from the type of response given rather than from the number of pictures
correctly identified.
This was the sort of task favored by Piaget (who studied in the laboratory that Binet founded)
and other developmental psychologists. The theories of intelligence that emerged from reflec-
tions on children's responses to such tasks were rich in process and description, but generally
ignored psychometric properties of test most emphasized by those who used tests to rank order
individuals. Put another way, the concern of the developmentalist was for understanding what
intelligence is and how it develops rather than for identifying more and less intelligent individu-
als. Those concerned with the latter issue understandably preferred tests in which the examiner
(or scorer) merely had to judge whether examinees gave (or chose) a keyed response. Judgments
about process are much more difficult to make and to defend.
But this was not simply an issue of the reliability of examiner judgments. The two approaches
make quite different assumptions about the nature of the measurement scale. Those who believed
that mental measurements could be modeled after physical measurements adopted trait models
of ability. Because of this, they were quick to question the meaningfulness of scales that appeared
to be measuring something different at the high end of the scale than at the low end. Of the
several critics of the Binet scale who had raised the issue, it was Robert Yerkes who was the most
374 D.F. LOHMAN

vocal. Yerkes and his associates (Yerkes & Anderson, 1915; Yerkes, Bridges, & Hardwick, 1915)
proposed an alternative they called the "point" scale. Although Binet sometimes used the same
task at more than one age, a variety of different tasks were presented at each level. Yerkes argued
that because of this variety there was no guarantee that the same intellectual functions were
required at every age. The Point Scale he proposed consisted of twenty subscales, each contain-
ing items of a particular type ordered by difficulty. This was the format used in the Army Alpha
and Beta (Yerkes chaired the committee which supervised their development) and their succes-
sors (i.e., the Wechsler scales and the homogeneous tests used in factor analytic investigations of
abilities).
The seemingly reasonable assumption that a test should measure the same thing at all levels
led to a rather dramatic shift in the type of test administered. Furthermore, because one item is
much like the next, such tests appear to be psychologically transparent. There is a tendency to
substitute a simple labeling of type of response required by the test for a theory of cognitive
processing. For example, E. L. Thorndike preferred to describe the intelligence measured by his
test by the names of the subtests themselves. The test had four subtasks - - Completion, Arithmetic,
Vocabulary, and Directions - - and Thorndike referred to the construct measured by the test as
"intellect CAVD." Those who followed the factor-analytic route were only slightly less behavior-
istic. Factors were generally labeled after an inspection of the content of tests that loaded on
them. Although examination of the content of tests is a useful first step in understanding process,
it is not a very good nth step.
Two further observations. First, the psychological transparency of ability tests composed of
similar items generally reveals very little about the construct itself even though total scores on
the test may be good measures of the construct. Consider, for example, the difference between
the cognitive psychologist's understanding of the construct "working memory" [as elaborated,
for example, in Baddeley's (1986) book-length monograph] and the differential psychologist's
understanding of the "memory span" factor (or factors). McNemar (1964) saw things clearly
when he observed that there seemed to be no way of even beginning to construct a model of the
former from scores on the latter (see Lohman, & Ippei, 1993). Second, even seemingly homogene-
ous tests often elicit different response strategies from different subjects, or for the same subject
on different items (Kyllonen, Lohman, & Woitz, 1984). Such tasks, however, are rarely constructed
in ways that make such variation observable. Thus, important differences about processing strategy
routinely go undetected when using tasks not designed to reveal them.
Thus, the adoption of tests that estimated ability by number of items of a particular type cor-
rectly solved led to the gradual abandonment of tasks that elicited qualitatively different patterns
of responses associated with different levels of mental development. Increasingly sophisticated
statistical analyses were performed on the scores derived from these homogeneous tests.
Psychological theorizing was relegated to debates about the organization of factors in different
models rather than about the nature of intellectual functioning itself. Each new generation of dif-
ferential psychologists was required to spend an ever larger portion of its graduate training master-
ing an ever expanding catalog of statistical methods, under the tacit (and sometimes explicit)
assumption that new methods would provide insights old methods could not even approximate.
This changed the discipline not only by redirecting the efforts of those interested in solving ill-
structured psychological problems to well-structured problems in methodology, but also by attract-
ing to it those more interested in statistical methods than in psychology.
The dominance of methodology over psychology has a subtler aspect. Like a child looking for
things to bit with his new hammer, there is a tendency to find problems that can be explored with
the newly mastered methodology rather than to find or invent methodologies that best address a
Educational Testing and Assessment 375

psychological issue. Even those who know better find it difficult not to rely on ever refined mul-
tivariate methods to carry the burdens of careful experiment and clear thinking.

Conclusions

What, then, can we learn from the past that would help us in our quest to understand and
measure human intelligence? First and foremost, we can learn that we all see through lenses that
are formed by belief and affect. Those who are convinced that the "real" intelligence is innate, or
conversely, that all differences between us are caused by culture and experience, need to ask why
they find such beliefs congenial. "Follow to its source/Every event in action or in thought" advised
Yeats (1949). Those who do this well invariably find more than rational argument at the root of
their beliefs about intelligence. Second, we need routinely to attend to the larger social and politi-
cal issues raised by our attempts to define and measure intelligence. To label someone or something
as "intelligent" is to make a value judgment. Such judgments have important social and political
- as well as psychological - - consequences. My limited reading in the history of intelligence
-

testing confirms Samelson's (1979) observation that those most closely allied with intelligence
testing were often least able to see these larger issues with much clarity. Perhaps one literally
cannot see the forest for the trees. Thus, we need not only to listen more attentively to those who
have considered the broader currents in the history and sociology of ideas, but actively to seek
their input. Third, we can learn that much of what we are doing has been done before, which
hopefully will enable us to see that new measures of intelligence that are not redundant with the
old must either follow new theory (as in Sternberg's 1990, attempts to assess "practical" intel-
ligence) or use new methods of assessment.
To come full circle, then, "What can we learn from the history of intelligence testing that
might inform future assessments?" Much, but only if we are willing to explore some unfamiliar
paths. The view from the well-trodden path comforts more and challenges less than it should.
The more difficult question, then, is not whether we can learn, but whether we will learn. Experi-
ence says fundamental change is unlikely; hope says it must be possible. Science is grounded not
only in belief, but in hope. Therefore, I will hope that, as we embark upon the second century of
the scientific study and measurement of human intelligence, we will explore our past more fully,
our motives and beliefs more honestly, our psychological theories and psychometric methods
more critically, and our options for new ways of conceptualizing, measuring, and developing
intelligence more creatively. That is my hope, at least.

References

Ackerman, E L. (1987). Individual differences in skill learning:An integration of psychometricand informationprocess-


ing perspectives. Psychological Bulletin, 102, 3-27.
Ackerman, E L. (1988). Determinants of individual differences during skill acquisition: Cognitiveabilities and informa-
tion processing. Journal of Experimental Psychology: General, 117, 288-318.
Anderson, J. R. (1983). The architecture of cognition. Cambridge, MA: Harvard University Press.
Ayres, L. E (1909). Laggards in our schools: A study of retardation and elimination in city school systems. New York:
Charities Publication Committee.
Baddeley, A. D. (1986). Working memory. Oxford, England: Clarendon.
Binet, A. (1903). L'etude experimentale de l'intelligence (The experimental study of intelligence). Pads: Schleicher.
Binet, A., & Simon, T. ( 1916). The developmentof intelligence in the child. In H. Goddard (Ed.) (E. S. Kite, Trans.), The
development ofinteUigence in children (pp. 181-273). Baltimore: Williams & Wilkins. (reprinted from L'Anee Psy-
chologique, 1908, 14, 1-94).
376 D. E LOHMAN

Brigham, C. C. (1930). Intelligence tests of immigrant groups. Psychological Review, 37, 158-165.
Bronfrenbrenner, U. (1979). The ecology t~/'human development. Cambridge, MA: Harvard University Press.
Butt, C. (1909). Experimental tests of general intelligence. British Journal of Psychology, 3, 94-177.
Carroll, J. B. (1987). Jensen's mental chronometry: Some comments and questions. In S. Modgil & C. Modgil (Eds.),
Arthur Jensen: Consensus and controversy (pp. 297-307). New York: The Falmer Press.
Carroll, J. B. (1989). Factor analysis since Spearman: Where do we stand? What do we know? In R. Kanfer, P. L. Acker-
man, & R. Cudeck (Eds.), The Minnesota symposium on learning and individual differences; Abilities, motivation, and
methodology (pp. 43-57), Hillsdale, N J: Lawrence Erlbaum
Carroll, J. B. (1993). Human cognitive abilities: A survey ~f.f~tctor-analytic studies. Cambridge, England: Cambridge
University Press.
Cattell, R. B. (1940). Effects of human fertility trends upon the distribution of intelligence and culture. In G. M. Whipple
(Ed.), The thirty-ninth yearbook ¢~'the National SocietyJbr the Study of Education. Intelligence: Its nature and nurture
(pp. 221-234). Bloomington, IL: Public School Publishing Company.
Chapman, P. D. (1988). Schools as sorters: Lews M. Terman, applied psychology, and the intelligence testing movement,
1890-1930. New York: New York University.
Cizek, G. J. (1993). Rethinking psychometricians' beliefs about learning. Educational Researcher, 22, 4-9.
Cronbach, L. J. (1975). Five decades of public controversy over mental testing. American Psychologist, 30, 1-14.
Cronbach, L. J. (1990). Essentials of psychological testing (5th ed.). New York: Harper & Row.
Cronbach, L. J., & Schaeffer, G. A. ( 1981 ). Extensions of personnel selection theory to aspects of minority hiring (Report
8 l-A2). Stanford, CA: Stanford University, Institute for Educational Finance and Governance.
Cronbach, L. J., & Snow, R. E. (1977). Aptitudes and instructional methods: A handbook for research on interactions.
New York: Irvington.
Cubberly, E. P. (1909). Changing conceptions t~feducation. Boston: Houghton Mifflin.
Damasio, A. R. (1994). Descartes' error: Emotion, reason, and the human brain. New York: Putnam.
Darwin, C. (1859). The origin of species by means t~'natural selection, or, the preservation of favored races in the strug-
gle/br life. London: J. Murray.
Davis, A. (1949). Poor people have brains, too. Phi Delta Kappan, 30, 294-295.
De Tocqueville, A. (1945). Democracy in America: 1. New York: Vintage Books.
Evans, B., & Waites, B. (1981). IQ and mental testing: An unnatural science and its social history. Atlantic Highlands,
N J: Humanities Press.
Fancher, R. E. (1985). The intelligence men: Makers ~['the IQ controversy. New York, NY: W.W. Norton.
Fleishman, E. A., & Hempel, W. E. (1954). Changes in the factor structure of a complex psychomotor test as a function
of practice. Psychometrica, 19, 239-252.
Frederiksen, N. R. (1981). The real test bias: Influences of testing on teaching and learning. American Psychologist, 39,
193-202.
Freeman, E N. (1926). Mental tests: Their history', principles and application. Boston: Houghton Mifflin.
Galton, F. (1978). Hereditary genius: An inquiry into its laws and consequences. New York: St. Martin's Press (Orig.
work published 1869).
Gardner, J. W. (1961 ). Excellence: Can we be equal and excellent too? New York: Harper and Brothers.
Gould, S. J. (1981). The mismeasure c?]'man. New York: W.W. Norton.
Grant, M. (1916). The passing ¢~fthe great race; or the racial basis t~f European history. New York: Scribner.
Guilford, J. P. (1954). Psychometric methods. New York: McGraw-Hill.
Hernshaw, L. S. (1979). Cyril Burt, psychologist. Ithaca, NY: Cornell University Press.
Hernstein, R. J., & Murray, C. (1994). The bell curve: Intelligence and class structure in American li.[e. New York: Free
Press.
Irvine, S. H., & Berry, J. W. (1988). The abilities of mankind: A revaluation. In S. H. lrvine & J. W. Berry (Eds.), Human
abilities in cultural context (pp. 3-59). Cambridge University Press.
Jensen, A. (1982). Reaction time and psychometric g. In H.J. Eysenck (Ed.), A model for intelligence (pp. 93-132). New
York: Springer.
Joncich, G. M. (1968). The sane positivist: A biography of Edward L. Thorndike. Middletown, CT: Wesleyan University
Press.
Kaus, M. (1992). The end of equality. New York: BasicBooks.
Koch, S. (1959). Psychology: A study e~]'science. New York: McGraw-Hill.
Kuhn, T. S. (1970). The structure of scientific revolutions. Chicago: University of Chicago Press.
Kyllonen, P. C., Lohman, D. F., & Woltz, D. J. (1984). Componential modeling of ahernative strategies for performing
spatial tasks. Journal e~'Educational Psychology, 76, 1325-1345.
Laboratory of Comparitive Human Cognition (1982). Culture and intelligence. In R. J. Sternberg (Ed.), Handbook of"
human intelligence (pp. 642-719). New York: Cambridge University Press.
Lippmann, W. (1976). The abuse of the tests. In N. J. Block & G. Dworkin (Eds.), The IQ controversy: Critical readings
(pp. 18-20). New York, NY: Pantheon Books.
Lohman, D. E & Ippel, M. J. (1993). Cognitive diagnosis: From statistically-based assessment toward theory-based
Educational Testing and Assessment 377

assessment. In N. Frederiksen, R. Mislevy, I. Bejar (Eds.), Test theory for a new generation oJ'tests (pp. 41-71 ). Hills-
dale, N J: Erlbaum.
Lohman, D. E, & Rocklin, T. (1995). Current and recurring issues in the assessment of intelligence and personality. In
D. H. Saklofske & M. Zeidner (Eds.), International handbook of personality and intelligence (pp. 447--474). New
York: Plenum.
McNemar, Q. (1964). Lost: Our intelligence? Why? American Psychologist, 19, 871-882.
Pick, D. (1989). Faces of degeneration: A European disorder, c. 1848-c. 1918. Cambridge, UK: Cambridge University
Press.
Popper, K. R. (1963). Conjectures and reJutations: The growth of scientific knowledge. New York: Harper and Row.
Samelson, E (1979). Putting psychology on the map: Ideology and intelligence testing. In A. R. Buss (Ed.), Psychology
in social context (pp. 103-1168). New York, NY: Halsted Press.
Schama, S. (1987). The embarrassment of riches: An interpretation of Dutch culture in The Golden Age. London: Wil-
liam Collins Sons and Co. Ltd.
Schmidt, F. L., Hunter, J. E., & Pearlman, K. (1981). Task differences and validity of aptitude tests in selection: A red
herring. Journal of Applied Psychology, 66, 166-185.
Shepard, L. A. (1991). Psychometricians' beliefs about learning. Educational Researcher, 20, 2-9.
Slife, B. D., & Williams, R. N. (1995). What's behind the research? Discovering hidden assumptions in the behavioral
sciences. Thousand Oaks, CA: Sage Publications.
Spearman, C. E. (1904). "General intelligence" objectively determined and measured. American Journal of Psychology,
15, 201-293.
Spearman, C. E. (1904). The proof and measurement of association between two things. American Journal t~fPsychol-
ogy, 15, 72-101.
Spearman, C. E. (1923). The nature of intelligence and the principles of cognition. London: Macmillan.
Spearman, C. E. (1930). Autobiography. In C. Murchison (Ed.), A history of psychology in autobiography (Vol 1., pp
299-334). Worcester, MA: Clark University Press.
Spencer, H. (1897). The principles of psychology. New York: Appleton and Company.
Stephan, N. (1985). Biology and degeneration: Races and proper places. In J. E. Chamberlin & S. L. Gilman (Eds.),
Degeneration: The dark side of progress (pp. 97-120). New York: Columbia University Press.
Steinberg, R. J. (1985). Beyond IQ: A triarchic theory of human intelligence. Cambridge, England: Cambridge University
Press.
Sternberg, R. J. (1990) Metaphors of mind: Conceptions of the nature ~" intelligence. Cambridge, UK: Cambridge
University Press.
Tawney, R. H. (1952). Equality (4th edition). London: George Allen and Unwin, Ltd.
Terman, L. M. (1906). Genius and stupidity: A study of some of the intellectual processes of seven "bright" and seven
"'stupid" boys. Pedagogical Seminary, 13, 307-373.
Terman, L. M. (1922). Were we born that way? Worlds Work, 5, 655-660.
Thomdike, E. L., Bregman, E. O., Cobb, M. V., & Woodyard, E. (1926). The measurement ofintellgence. New York:
Columbia University, Teachers College.
Thomdike, R. M., & Lohman, D. E (1990). A century of ability testing. Chicago: Riverside Press.
Thurstone, L. L. (1938). Primary mental abilities. Psychometric Monogr~:phs, No. 1.
Tyack, D. B. (1974). The one best system: A history of American urban education. Cambridge, MA: Harvard University
Press.
Vernon, P. E. (1973). Multivariate approaches to the study of cognitive styles, in J. R. Royce (Ed.), Multivariate analysis
and psychological theory (pp. 125-148). New York: Academic Press.
Watson, J. D. (1980). The Double Helix: A personal account of the discovery of DNA. New York:
Weber, M. (1958). The Protestant ethic and the spirit of capitalism. (T. Parsons, Trans.) New York: Scribner. (original
work published 1904)
White, S. H. (1977). Social implications of IQ. In P. L. Houts (Ed.), The myth of measurability (pp. 23-44). New York,
NY: Hart Publishing Company.
Wolf, T. (1973). Alfred Binet. Chicago: University of Chicago Press.
Wooldridge, A. (1994). Measuring the mind: Education and psychology in England, c. 1860-c. 1990. Cambridge, UK:
Cambridge University Press.
Yeats, W. B. (1949). "A dialogue of self and soul" in The collected poems of W. B. Yeats. New York: Macmillan.
Yerkes, R. M., & Anderson, H. M. (1915). The importance of social status as indicated by the results of the point-scale
method of measuring mental capacity. Journal of Educational Psychology, 6, 137-150.
Yerkes, R. M., Bridges, J. W., & Hardwick, R. S. (1915). A point scale for measuring ability. Baltimore: Warwick &
York.
378 D. E LOHMAN

Biography

David E Lohman is Professor of Educational Psychology and Chair of the Division of Psychologi-
cal and Quantitative Foundations at the University of Iowa. His major research interest is in the
measurement of individual differences in abilities, particularly those abilities required by and
developed through formal schooling.
CHAPTER 3

FUTURE DIRECTIONS FOR NORM-REFERENCED AND


CRITERION-REFERENCED ACHIEVEMENT TESTING*

RONALD K. HAMBLETON and STEPHEN G. SIRECI

University of Massachusetts, Box 34140, 152 Hills House South, Amherst, MA 01003-4140,
U.S.A.

If there were any period one would desire to be born in, is it not the age of revolution; when the old and
the new stand side by side and admit of being compared; when the energies of all . . . are searched by fear
and by hope; when the historic glories of the old can be compensated by the rich possibilities of the new
era ? This time, like all times, is a very good one, if we but know what to do with it. - - RalphWaldoEmer-
son, Phi BetaKappaAddress, 1837.

As we approach the next century, the field of achievement testing is experiencing rapid develop-
ment and changes. Contemporary psychometricians are confronted with the need to measure new
cognitive outcomes of schooling especially higher level cognitive skills such as problem-
solving, reasoning, and critical thinking. In addition there is a great need for new measurement
models, new conceptualizations of test score reliability and validity, and new modes for assess-
ment. New computer technology, too, is going to have a profound impact on achievement testing
practices. Similar to the expansion of opportunities in higher education described above by Emer-
son, current directions and needs in education create new challenges and opportunities for future
research and practices in achievement testing.
The impetus for achievement testing changes and improvements is coming, in part, from policy-
makers who hold the view that schools need to prepare students better to help them function in
the workplace and to insure national economic success in the next century. For example, in the
United States, the relatively low performance of American students on recent international
comparative studies of achievement is used by policy-makers to support their pressure for improve-
ments in the organization of schools, school leadership, school curricula, and instructional practices
(Lapointe, Mead, & Askew, 1992; Lapointe, Mead, & Phillips, 1989).

New Directions in Achievement Testing: An Overview

The various trends and changes in achievement testing can be summarized, generally, into six
areas. These areas represent changes in the practice of achievement testing that have emerged

*Laboratory of Psychometric and Evaluative Research Report No. 296, Universityof MassachusettsAmherst,School
of Education.
379
380 R.K. HAMBLETONand S. G. SIRECI

either in response to (1) the current needs of educators and policy-makers, (2) emerging computer
technology, or (3) the availability of newer theories and models in cognitive psychology, educa-
tion, and psychometric methods. The general trends currently observable in the achievement test-
ing field are:
1. increased use of constructed response item formats as opposed to selected response or forced-
choice item formats;
2. increased emphasis on measuring higher-order cognitive skills as opposed to lower-level
cognitive skills;
3. a focus on criterion-referenced achievement tests and interpretations of test scores in addi-
tion to or instead of norm-referenced achievement tests and test score interpretations;
4. increased sophistication of measurement models, such as those which can (a) estimate
examinee ability or abilities underlying achievement test performance independent of the
particular choice of test items or assessment materials, (b) handle polytomously-scored data,
(c) place examinee abilities from non-parallel tests onto a common scale or scales for report-
ing and both norm-referenced and criterion-referenced score interpretations, and (d) provide
estimates of the error in test scores due to factors such as raters, tasks, and mode of testing;
5. increased rigor in evaluating the reliability and validity of the inferences derived from achieve-
ment test scores; and
6. incorporation of computer technology into the development, administration, scoring, and
reporting of achievement tests.

The purpose of the paper is to describe the changes taking place in these six areas. Each of these
emerging areas, which will impact substantially on achievement testing in the next century, will
be considered in separate sections which follow.

The Shift Toward Constructed-Response Item Formats

A difficult problem confronting psychometricians in the early twentieth century was finding
ways to economically score the increasingly large number of test batteries which were being
administered. In the 1950s, E. F. Lindquist at the University of Iowa invented an optical scanner
that was able to score a certain type of examinee response mechanically. Examinees were required
to darken ovals on an answer sheet that corresponded to their choice from a list of options, alterna-
tives, or choices associated with a test question. This particular item format became known as the
multiple-choice item.
The capability of the multiple-choice item to be scored objectively added to its already attrac-
tive feature of being able to assess a wide array of content in a relatively short amount of time.
In fact, in Bloom's taxonomy for cognitive outcomes of instruction (see, Bloom, 1956), multiple-
choice items were used to illustrate the assessment of cognitive objectives from the lowest level
(basic knowledge) to the highest level (evaluation). As the use of tests became increasingly popular
in educational, employment, and military settings, the multiple-choice item became the
predominant item format in the United States (Haladyna, 1994).
In recent years, however, the limitations of multiple-choice items have received more atten-
tion than their positive features. The multiple-choice item has been criticized as placing stringent
limits on the types of proficiencies that can be assessed, and consequently diminishing the valid-
ity and utility of test scores (e.g., Kentucky Department of Education, 1993). Critics of the
multiple-choice item format argue too that this format precludes "authentic" assessment of
Educational Testing and Assessment 381

important outcomes of education and is susceptible to construct irrelevant confounds such as test
wiseness and response bias (Resnick & Klopfer, 1989). Whatever the merits of the arguments for
and against multiple-choice items, new forms of achievement testing have become popular, and
considerable amounts of research have been conducted to investigate alternatives to the multiple-
choice item format and related objectively-scored item formats.
In response to these criticisms of multiple-choice items, and with an overall concern for valid
assessments, recently developed achievement tests in education in the United States often feature
one or more sections that include various types of constructed response item formats. Examples
of constructed response formats include essays, examinee-constructed solutions to problems, short-
answer responses, portfolios, and task completion exercises. These constructed response item
formats for the most part have been in use in instructional settings for a long time. It is only
recently that they are being used as part of large-scale assessments of students, schools, school
districts, states, and countries.
The inclusion of constructed response item formats is evident in several areas of achievement
testing. Many large-scale testing programs in the United States have recently incorporated this
item format. For example, the National Assessment of Educational Progress, the Graduate
Management Admission Test, Medical College Admissions Tests, Tests of General Educational
Development, and the Uniform Certified Public Accountants Examination, all include constructed
response items and this is only a small sample of testing programs incorporating constructed
response item formats. In addition, most state assessment programs in the United States contain
some portion of constructed response items and often that portion is substantial. In Kentucky, for
example, the state assessment consisted of nearly 100% constructed response items. In other
countries such as Spain, the multiple-choice item format is rarely used. Many language proficiency
tests, such as the Hebrew Proficiency Test (NITE, 1996) include an essay component in addition
to multiple-choice items. The Third International Mathematics and Science Study involving nearly
45 countries used a substantial amount of constructed response testing material.
The Scholastic Assessment Test, which is used in college admissions testing in the United States
and administered to two million students a year, introduced an objectively scorable item type
called the grid-in item. For example, candidates may be asked to solve a mathematical problem
and find a numerical solution. This item format requires examinees to darken a set of ovals that
correspond to the numerical solution to the problem. Because this answer is not chosen from a
list of potential answer choices, but rather is arrived at by the examinee, this format greatly reduces
the possibility of answering an item correctly by guessing. Newer forms of constructed response
item formats are also being developed that can be scored by a computer. These item types are
discussed below with respect to technological innovations in achievement testing.
Although constructed item formats have been used for decades in employment and psychological
testing, they are currently considered to be important for improving the construct validity of infer-
ences derived from achievement test scores as well. That is, they are seen as facilitating a more
comprehensive and realistic assessment of knowledge and skills. This demand for more comprehensive
assessment of educational skills is closely related to another shift in achievement testing: the shift
from the measurement of basic skills to the measurement of higher-order cognitive skills.

Emphasis on Higher-Order Cognitive Skills

Innovations in item formats stem, in large part, from demands for the measurement of higher-
order knowledge and skills. In the United States in the 1970s, federal funding of elementary and
382 R.K. HAMBLETONand S. G. SIRECI

secondary school basic skills programs increased activity in the measurement of basic academic
skills. Criterion-referenced testing methods, though introduced by Glaser (1963) in the 1960s,
were developed in the 1970s as an important measurement system to support the assessment of
students in relation to well-defined outcomes of education (for an excellent summary of criterion-
referenced test models and methods, readers are referred to Berk, 1984). Contingent upon fund-
ing of many of these federal and state-supported educational programs were annual assessments.
A sharp increase in the development and administration of basic skills tests (i.e., criterion-
referenced achievement tests) throughout the 1970s and 1980s was evident. Simultaneously, the
amount of norm-referenced achievement testing in the country was on the decrease though the
actual amount of norm-referenced testing was high in the 1980s and remains high today.
By the middle of the 1980s, the basic skills testing movement was being sharply criticized for
its perceived negative influences on elementary and secondary curricula. Critics of the basic skills
movement claimed school teachers were "dumbing down" their instruction to "teach towards the
achievement test," and, consequently, higher-order knowledge and skills were not being taught
in the classroom. In the U.S. and other countries, parents and educational policy makers became
dissatisfied with the notion of students achieving basic skills, and called for assessment and attain-
ment of higher-order knowledge and skills.
Results from international comparative studies of educational achievement carried out by the
International Association for the Evaluation of Educational Achievement (IEA) in the areas of
reading, mathematics, and science have been particularly influential in reforming educational
curricula with an expanded emphasis on higher-level cognitive skills and encouraging research
into instructional practices. For one example of the impact of findings from these studies on
mathematics and science curricula in the United States, readers are referred to a book by Schmidt,
McKnight, and Raizen (1997). For a second example, Ethington (1990) writes, "The recently
completed Second International Mathematics Study, a comprehensive survey of the teaching and
learning of mathematics in the schools of some twenty countries, is giving impetus to research in
mathematics education world wide" (p. 103).
In response to these demands, achievement test developers now emphasize measurement of
both higher-order and basic skills. One quote from the latest technical manual for the Metropolitan
Achievement Tests (Seventh Edition) published by the Psychological Corporation in the United
States is reflective of the changes currently taking place in nationally administered achievement
tests:

Furthermore, there has sometimesbeen an underemphasis on including in these [standardized achieve-


ment test] batteries items that assessed the varioushigh-ordercognitiveprocessesand instead including a
preponderance of knowledgeor recall items. As a result [our new battery] was planned ... to include a
greater number of items assessing higher-order thinking skills than has ever been on this kind of test
before. (PsychologicalCorporation, 1993, p. 12)

Setting performance standards on these higher-level assessments has also become a major activ-
ity for measurement specialists. For example, the U.S. National Assessment of Educational
Progress (NAEP) switched from an anchor level reporting model, which was used in describing
what it was that students could do, to a standards-based reporting model, which was intended to
specify what students should be able to do and then reporting the percentages of students meet-
ing the various performance levels (Phillips et al., 1993). The standards-based reporting model
comprises three pre-specified levels of student achievement: basic, proficient, and advanced.
Other large-scale achievement tests also focused on assessment of higher-order knowledge
and skills. For example, the Graduate Records Exam, which has a long history of successfully
Educational Testing and Assessment 383

measuring higher-order skills via the multiple-choice item format and is used in the United States
as a selection test by graduate schools, is exploring the use of sorting tasks for assessing
representational components of quantitative proficiency (Bennett, Sebrechts, & Rock, 1995).
Similarly, the recent major revision of the SAT included longer reading passages on the verbal
section, the assessment of higher level thinking skills such as critical reasoning and problem-
solving, and grid-in items and calculator use on the quantitative section, to better measure higher-
order verbal and quantitative knowledge and skills.

Demands for Criterion-Referenced Tests and Score Interpretations

Another criticism of the basic skills testing movement from the 1970s and 1980s was that
students' test scores were often interpreted with respect to the performance of other students,
rather than with respect to their attainment of desired educational objectives. Evaluations of
students and educational programs tended to focus on students' national and local percentile rank
scores, stanines, grade equivalent scores, etc. rather than on scores that were linked to mastery of
important outcomes of instruction (often called "objectives"). Teachers and students often desired
diagnostic information for improving performance. Well-constructed criterion-referenced tests
could meet the need (see, for example, Gregoire, 1997, for an excellent review of diagnostic
testing based upon emerging cognitive models of learning).
Currently, the measurement of what students know and can do with respect to well-defined
areas of academic knowledge and skill is considered to be of paramount importance in education
(Hambleton, 1996). Interestingly, though terms such as "performance assessment" or "authentic
assessment" are found everywhere in the measurement literature today, these types of tests are
criterion-referenced in their purpose, design, evaluation, and use (see, for example, Hambleton,
1994, 1996). The main differences between performance tests today and criterion-referenced tests
of the 1970s and 1980s are (1) performance tests are typically focused on the assessment of
higher-level cognitive skills and CRTs were not; and (2) performance tests typically use constructed
response formats whereas CRTs made frequent use of the multiple-choice item format. But the
differences are not fundamental. Both are focused on assessing what it is students know and can
do and the potential is present for using the available test scores for diagnostic purposes. It is
unfortunate that the concept of criterion-referenced achievement testing became incorrectly associ-
ated with basic skills assessed by multiple-choice items.
Norm-referenced and criterion-referenced approaches to test score interpretation both success-
fully increase the meaningfulness and utility of test scores. Norm-referenced approaches are use-
ful for describing test results in relation to the performance of one or more specific reference
groups, called norm groups, who take the same test. Criterion-referenced approaches describe
test scores with respect to specific knowledge and skill areas in which students demonstrate
mastery.
Recent charges that the educational system is not providing students with the skills they need
have focused attention on criterion-referenced achievement test scores. Criterion-referenced
interpretations of test scores involve clearly defining the objectives measured on achievement
tests, and identifying pre-established standards of expected performance. It seems clear that the
next generation of school tests will be criterion-referenced achievement tests intended, principally,
to facilitate criterion-referenced score interpretations of academic progress.
Standards on criterion-referenced tests are typically developed using subject matter experts
(SMEs) to carefully define the variables to be measured, and to determine scores that reflect
384 R.K. HAMBLETONand S. G. SIRECI

significant levels of the characteristic measured. Many criterion-referenced tests, such as licen-
sure, certification, and graduation tests, incorporate pre-established standards of performance.
Test takers who score above the passing standard are awarded a license, credential, or diploma,
while those who do not reach the standard are not. Some tests, such as the NAEP described
above, have multiple standards. Simple pass/fail dichotomies are replaced by more precise
categories of achievement such as remedial, proficient, and advanced.
The process of determining the test scores that correspond to different levels of performance
is called standard setting. Standard setting on norm-referenced tests is accomplished using
percentile rank scores. For example, a scholarship may be awarded to any examinee who finishes
above the 90th percentile. The obvious disadvantage of norm-referenced passing standards is
that classification decisions vary primarily as a function of the characteristics of the norm group.
The top 10% may represent an overly-restrictive cutoff in a high-proficiency norm group, or may
be too lenient if the proficiency of the norm group is generally low.
Norm-referenced standards make no sense in instructional contexts where the focus must be
on mastery of the content. Passing or failing or being identified as (say) Advanced, Proficient,
Basic, or Below Basic, must depend on level of performance in relation to the pre-specified
standards and not on the performance of other persons taking the test. Thus, in principle, it is
possible for every examinee to be placed in the same performance category. Examinee achieve-
ment dictates the distribution of examinees in the performance categories. Therefore, criterion-
referenced standard-setting procedures are currently preferable to norm-referenced standards in
most instructional situations. At the same time, some normative information is often useful in
providing a perspective on student performance and setting standards of performance required of
students. For example, standards which no one could meet, or everyone could, would have little
relevance in evaluating students and schools.
Criterion-referenced standard-setting procedures circumvent the relative norm group problem
by establishing standards independent of students' performance on the test. The most common
criterion-referenced standard-setting procedures use subject matter experts to scrutinize the items
comprising a test and make various judgments regarding the probable performance of"borderline"
students on each item (for a current review, see Hambleton, 1996). Borderline students are typi-
cally defined as students who are considered to be the marginal members of a particular proficiency
grouping (Cizek, 1996; Livingston & Zieky, 1982). For example, in setting the standard to
distinguish between remedial and proficient students, the borderline students would be described
as those who have "just enough" knowledge and skills to be classified as proficient. These item
judgments are summed over items and experts to derive a "cut score" for each desired examinee
classification (see, for example, Cizek, 1996).
A more empirical criterion-referenced approach is to retrospectively use respondents' perform-
ance on a criterion to establish standards on the test itself. For example, if a test is designed to
select students for an accelerated mathematics curriculum, the test could be administered to all
students in the year prior to initiation of the curriculum. At the end of the school year, the aver-
age test score of all students who earned exceptional grades in mathematics could be used as the
standard for selecting students for the accelerated program in subsequent years. Another reason-
able strategy might be to use the lowest test score of students who earned exceptional grades.
The procedures for determining criterion-referenced standards have their limitations. Standards
developed using subject matter experts are only as good as the particular group of experts
employed. Standards established using external criteria are typically expensive of time and
resources as well as being limited by the variation observed among the respondents used to
establish the cutscores. Thus, there are trade-offs among the different procedures for setting
Educational Testingand Assessment 385

standards on tests. A major topic for current research at the moment is the development of
defensible standard-setting methods.
The emergence of performance assessments in education will require the development of new
standard-setting methods since popular approaches to standard setting by Nedelsky, Angoff, and
Ebel are not suitable. Two new criterion-referenced standard-setting approaches have recently
appeared in the literature. The first is judgmental policy capturing (Jaeger, 1995), which uses
subject matter experts such as teachers to evaluate score profiles (rather than total scores) of
examinees' performance. Students are assigned scores on the tasks which make up the perform-
ance assessment, and then these score profiles are sorted by subject matter experts into perform-
ance categories. Regression models are used to fit the subject matter experts' ratings and determine
performance standards.
A second method uses cluster analysis to partition the population of students who take a test
into meaningful proficiency groupings (Sireci, 1995; Sireci & Robin, 1996). The purpose of this
method is to identify clusters of students that correspond to the proficiency groupings envisioned
by the test developers. A limitation of this method is that an external criterion related to the
construct measured is needed to validate the cluster solution. The judgmental policy capturing
and cluster analysis methods show promise for setting standards on achievement tests, and
highlight the continued interest in the criterion-referenced interpretations of achievement test
scores.

Technological Advances in Measurement Models

The genesis of contemporary achievement test theory and practice can be traced to the early
twentieth century with the emergence of the Binet intelligence scales in France and the Army
Alpha tests in the United States (Binet & Simon, 1905; Yerkes, 1921). It was in this era that clas-
sical test theory was developed and used as a formal basis for test development and evaluation
(see, Gulliksen, 1950, for an excellent summary). Around the middle of this century, more
sophisticated measurement theories and models (called "item response theory") began to emerge
in response to perceived shortcomings of classical test theory (Lord, 1952; Lord & Novick, 1968).
Two shortcomings of classical test theory are well-known: First, the item difficulty and item
discrimination statistics which are routinely used in test development are sample dependent. This
means that in highly capable samples of examinees, items appear to be easy and, in less capable
samples, items appear to be more difficult. Also, item discrimination indices are dependent on
score variability in the examinee sample. With less score variability in the sample, items appear
less discriminating; with more score variability in the sample, items appear more discriminating.
Obviously, item statistics which are not dependent on the characteristics of the examinee sample
used in (say) test administration would be more desirable.
A second shortcoming of classical test theory is that examinee test scores are test dependent.
Consequently, the difficulty of the test directly affects examinee test scores. The same examinee
will score high on an easy test and lower on a hard test, despite the fact that this examinee's abil-
ity which underlies performance on the test itself remains fixed. Test dependent scores create
major problems for measurement specialists since there are many occasions when examinees
may need to be compared to each other or to a set of fixed performance standards though they
have not taken identical tests. Test decisions which depend on the vagaries of the particular test
an examinee is administered are not acceptable.
There is a solution within a classical test theoretic framework when only a small number of
386 R.K. HAMBLETONand S. G. SIRECI

tests are constructed. The problem of placing scores from different tests onto a common scale is
known as "equating." At the best of times, though, equating is a complex statistical process.
When, in principle, all examinees are administered their own tests (as is usually the case with
computer-based testing) and these tests may vary substantially in their difficulty (as is the practice
with computer-adaptive tests), equating scores within a classical measurement framework is next
to impossible. Additional shortcomings of classical test theory are reviewed by Hambleton, Swa-
minathan and Rogers (1991).
Item response theory is the general measurement framework today for many large scale test-
ing and assessment programs. The Third International Mathematics and Science Study, for
example, used item response theory for constructing scales to report test scores. Models of test
development based on item response theory (IRT) have seen wide use and acceptance by test
developers and psychometricians for the past several decades (Linn, 1990; Yen, 1983). Similarly,
the advantages of IRT over classical test theory are well documented (see for example, Hamble-
ton, 1989; Hambleton & Swaminathan, 1985; or Hambleton et al., 1991).
When IRT models can be found which fit the achievement test data, item statistics can be
obtained which are independent of the particular sample of examinees used in the estimation
process. Examinee-free item statistics are a great advantage in test development. Also, examinee
ability estimates can be obtained which are independent of the particular choice of items
administered. Thus, examinees may be compared to each other or to a set of performance standards
though the particular tests administered may differ substantially in their difficulty. Yet another
advantage of IRT-based measurement is that the examinees and the item statistics are reported on
the same scale which facilitates test development, test score equating, the identification of
potentially biased items, computer-adaptive testing, and enhanced score reporting. IRT is rapidly
becoming the measurement framework for important testing programs because of the desir-
ability of "invariant" item and examinee statistics when an IRT model can be found to fit the
available test data.
What is new at the present time in the area of measurement models is an extension of well-
accepted IRT models such as the one-, two-, and three-parameter logistic models, in two major
directions: polytomous models and multidimensional models.

Polytomous 1RT Models

The first IRT models such as the one-parameter logistic model (Rasch, 1960), two-parameter
logistic model (Birnbaum, 1968), and three-parameter logistic model (Lord & Novick, 1968) are
appropriate for items that are scored dichotomously. However, as mentioned above, constructed
response items, which are often scored polytomously, are becoming increasingly popular. Thus,
current applications of IRT involve polytomous models that represent extensions of the original
IRT models.
The first polytomous model, which is seeing increasing use today, is the graded response model
developed by Samejima (1969). This model is appropriate for polytomous items that are scored
on an ordinal scale, such as essays, constructed response items, or Likert-type items used in assess-
ing attitudes. An additional polytomous model, the nominal-response model, was proposed by
Bock (1972). The nominal response model removes the restriction that the response options be
ordered a priori, and has been applied to testier-based tests where the same testlet scores can be
arrived at by answering different items correctly (Sireci, Thissen, & Wainer, 1991; Wainer, Sireci,
& Thissen, 1991).
EducationalTestingand Assessment 387

Polytomous extensions of the Rasch model have also been proposed (e.g., Masters, 1982), as
well as models designed for analyzing just about every type of achievement data imaginable
(van der Linden & Hambleton, 1997).

Multidimensional IRT Models

A significant limitation of popular IRT models is the requirement that the ability measured by
the test be unidimensional. The condition of unidimensionality means that there is a dominant
ability or trait which is measured by the items making up the test. This limitation is severe due to
recent innovations in cognitive psychology and psychometric methods that suggest students'
responses to achievement tasks typically involve more than one cognitive proficiency (Mislevy,
1993; Traub, 1983). It might be added that test developers will sometimes intentionally build in
multidimensionality to reflect the diversity of skills which may be of interest in an achievement
test.
To meet the demands of the multidimensional conceptualizations of student responses,
multidimensional IRT models have recently been proposed (Ackerman, 1994; Reckase, 1985;
Sympson, 1978; van der Linden & Hambleton, 1997). These models score examinees using two
or more dimensions and so they provide greater information regarding the multiple proficiencies
tapped by achievement tests. Although there are two computer programs commercially available
for estimating multidimensional IRT parameters (NOHARM, Fraser, & McDonald, 1986; and
TESTFACT, Wilson, Wood, & Gibbons, 1991), there have been very few applications of these
models to achievement test data. Nevertheless, as these models become more familiar to measure-
ment specialists, and as the demands for more complex assessments increase, it is likely that
multidimensional IRT models will see wide application in the future.

Advances in Reliability and Validity Theory

Reliability and validity theory continue to provide the general framework within which test
use and interpretation are evaluated. However, these fundamental concepts have evolved over
the past few decades and currently involve more sophisticated appraisals of educational achieve-
ment than were previously envisioned.

Extensions of Reliability Theory

Reliability theory has been strongly influenced by the evolution of generalizability theory and
IRT. Classical conceptualizations of reliability theory promulgated different types of reliability
coefficients to account for different sources of measurement error. For example, test-retest reli-
ability estimated error due to time sampling, parallel forms reliability estimated error due to
content sampling, and internal consistency reliability estimated error due to content heterogene-
ity (Anastasi, 1988). These separate "types" of reliability resulted in the provision of different
types of reliability coefficients. Generalizability theory (Brennan, 1983; Cronbach, Gleser, Nanda
& Rajaratnam, 1972) extended reliability theory by incorporating all identifiable sources of
measurement error into a single measurement model. In so doing, generalizability analyses provide
388 R.K. HAMBLETONand S. G. SIRECI

a more comprehensive examination of measurement error, and provide a single index of reli-
ability called the g-coefficient. Generalizability theory is now being widely applied to perform-
ance assessments because the approach provides separate estimates of the errors due to the choice
of tasks, scorers, examinee-task interactions, etc.
IRT models also influenced the extension of traditional notions of reliability. One advantage of
IRT over classical test theory, and not mentioned previously, is that the precision of measurement
is easily evaluated at both the item and total score level at any point along the underlying distribu-
tion of proficiency. Thus, conditional estimates of measurement error (e.g., conditional at a specific
point or interval on the score scale) provide more information regarding test score quality than
do estimates of average precision such as internal consistency or g-coefficients. Using IRT, these
conditional estimates of error are obtained by taking the inverse of item information, or test
information at the particular point of interest on the IRT scale. This feature of IRT generated
similar work in classical test theory, and so estimates of conditional standard errors of measure-
ment based on classical test theory are also currently available (Kolen, Hanson, & Brennan, 1992;
Livingston, 1982; Lord, 1984).
Another area in which IRT has influenced reliability theory is in the evaluation of local item
dependence (LID). LID refers to construct-irrelevant covariance among test items that leads to
inflated estimates of measurement accuracy (Sireci et al., 1991; Thissen, Steinberg, & Mooney,
1989; Yen, 1993). This inflation may occur when (1) sets of items relate to a common stimulus
(if you do well on one item then you are likely to do well on other items in the set), (2) items
provide clues to the answers to other items in the test, or (3) items have built in dependencies so
that failing one item almost certainly results in failure on another item. Although similar concerns
were raised by early psychometricians who were concerned about the most appropriate way to
divide tests when calculating split-half reliability (Guilford, 1936; Thorndike, 1951), IRT analyses
of LID have renewed concerns about redundancy among test items and its impact on measure-
ment error.

Recent Advances in Validity Theory

Contemporary discussions of test validity emphasize two important precepts that are relatively
recent in the evolution of validity theory. First, validity does not refer to a property of a test;
rather, it refers to the meaningfulness and appropriateness of inferences derived from test scores.
Second, the validity of inferences derived from test scores is not a direct consequence of the test
development process. Test users share the responsibility of appropriate test use and interpretation
with test developers. These two fundamental precepts of validity emphasize that the validity of
test scores must be evaluated with respect to the purpose of the testing measured against the util-
ity and consequences (both intended and unintended) of testing (Messick, 1989).
Traditional conceptualizations of test validity described validity in terms of three distinct facets,
or evidential areas: content validity, criterion-related validity, and construct validity. Each of these
facets remains important for contemporary achievement testing. However, most theoretical descrip-
tions of validity describe construct validity as validity in general, subsuming the content and
criterion-related facets. Regardless of psychometric nomenclature, the ability of achievement tests
to represent the content domains they are designed to measure, predict the behaviors they are
designed to predict, and measure the underlying constructs to which they are targeted, continues
to be the primary concern of validity evaluation. The introduction of new forms of assessment,
new mediums through which tests are administered, and the use of new models all require evidence
EducationalTestingand Assessment 389

of their validity prior to wide-scale use. For example, major questions surround the validity of
many of the new forms of assessment such as portfolio assessments and performance testing.
Technical developments in measurement theory have also broadened the area of validity
research. For example, within the past decade, the field of item bias has been largely re-organized
around the notion of differential item functioning (DIF) and statistical models used to measure it
(Hambleton, Clauser, Mazor, & Jones, 1993; Holland & Thayer, 1988; Holland & Wainer, 1993;
Swaminathan & Rogers, 1990). Many other applications of newer statistical technology have
also expanded traditional forms of validity evidence. For example, applications of
multidimensional scaling models have been applied to investigations of content validity (Sireci
& Geisinger, 1992; 1995), and structural equation modeling is now widely applied to investiga-
tions of construct validity (e.g., Vance, MacCallum, Coovert, & Hedge, 1988)

The Influence of Computer Technology on Achievement Testing

It is likely that history will regard the close of the twentieth century as a period of rapid
technological development. The most dramatic presence of technology in achievement testing is
the role of computers in the development, administration, and scoring of achievement tests.
With respect to test development, computers have long been used to store items into large data
bases called item banks. Today, the role of computers in test development goes far beyond item
banking activities. In fact, computers are currently used to build tests by selecting items from an
item bank according to complex item selection criteria. Because these criteria are developed to
optimize testing efficiency, computer-assisted test development is often termed optimal test design
(OTD) (van der Linden & Boekkooi-Timminga, 1989).
The ConTEST software developed by van der Linden and his colleagues (Timminga, van der
Linden, & Schweizer, 1996) provides an example of how OTD systems work. Items are entered
into the item bank where they are indexed according to a myriad of statistical (e.g., IRT item
parameters) and content (e.g., content area specifications) characteristics. A computerized
algorithm is then used to select items according to the multi-faceted demands of the test developer.
For example, the test developer may specify a target test information function, content specifica-
tions, and number of test items to be administered. Other important criteria, such as the number
of graphic items, number of passage-based items, and controls of item exposure rates may also
be specified. In addition, restrictions on inclusion of item pairs that may clue one another, or
undesirable context effects (e.g., all passage-based items relate to nineteenth century literature)
can be assimilated into the algorithm. The end result is an objectively developed test, put together
with minimal personnel cost, that is ready for inspection and revision. Automated test construc-
tion as represented in the work of van der Linden and Timminga (see, for example, Timminga &
van der Linden, in press) represents the next generation of large scale test development.
Computers have also revolutionized the area of test administration. Delivering tests on comput-
ers minimizes test security risks while creating considerably more flexibility in test scheduling
(in principle, candidates can take the tests when they feel ready or it is convenient for them to get
to a test center). Many testing programs, such as the registered nursing licensing exam in the
U.S., formerly had a limited number of test administration dates that involved shipping thousands
of heavy boxes to far away locations. Today, these tests are administered daily without large
boxes involved. Many tests delivered using a computer are also scored by the computer, allow-
ing for test results to be immediate.
Furthermore, the interactive features of computers have made it possible to assess a much
390 R.K. HAMBLETONand S. G. SIRECI

wider variety of skills than is available using the traditional paper and pencil format. For example,
prototypes of the United States Medical Licensing Exam require medical licensure candidates to
treat hypothetical patients "on-line." The condition of the patient changes in accordance with the
specific treatment administered by the doctor. Similarly, tests of managerial skills, where
examinees are required to respond to a scenario played out on the computer screen, are also
operational (Drasgow, 1994). A few years ago, these types of skills were considered measurable
only by using professional actors and expert observers. Today, these skills are measurable using
tests that are administered and scored using computers.
Perhaps the most dramatic contribution of technology to achievement testing is computerized
adaptive testing (CAT) (Wainer, 1990). Computerized adaptive tests are assessments delivered
on computer that, theoretically, are unique for each examinee. After an examinee responds to an
initial pre-selected question (or set of questions), an item selection algorithm determines the next
question (or set of questions) to be administered.
Item selection depends primarily on two factors: the current proficiency estimate of the examinee
(as determined by performance on earlier items), and the purpose of the assessment. Many dif-
ferent item selection algorithms are possible; however, all generally conform to one of two models.
One model attempts to find the best location for the examinee on the score scale; the other model
attempts to determine whether an examinee is above or below a specific point on the score scale.
The first model aims towards minimizing error around the examinee's score on the IRT scale.
The second model focuses on minimizing error around a specific cutscore such as a passing score.
All CATs currently use IRT to select items and place the unique tests for different examinees on
the same scale (see Wainer, 1990).
Because CATs tailor the test to each examinee, they tend to reduce testing time and facilitate
examinee motivation. A variation on the CAT approach is self adaptive testing where examinees
choose whether they would like to receive an easier or more difficult item than the one they just
answered (Wise, Plake, Johnson, & Roos, 1992). Self-adaptive tests have been shown to be slightly
less efficient than CATs, but appear to reduce test anxiety and give examinees the perception of
control over the assessment situation (Wise et al., 1992)
The use of computers to score tests has also revolutionized the field of achievement testing.
Just as the development of the optical scanner in the 1950s sparked the popularity of the multiple-
choice item, the ability of the computer to score more complex responses has fueled the demand
for free-response items on large scale tests. Computers have recently been shown to score essays
on writing skills tests with surprisingly high levels of reliability (Page & Petersen, 1995). In
addition, computers are being used on licensure tests to score simulated patient vignettes (Clauser,
Subhiyah, Nungester, Ripkey, Clyman, & McKinley, 1995) and buildings created by architectural
iicensure candidates (Bejar & Braun, 1994).

Conclusions

It seems clear that achievement testing in the next century is going to be very different than it
is today. Computer technology is certainly going to be playing a major role in all aspects of
assessment: test construction, test administration (both fixed length and adaptive), scoring, and
score reporting. The computer, too, will open up new possibilities for valid assessments by permit-
ting formats which incorporate visual and audio components and by permitting tests to be adapted
to individual ability levels.
Strong test models such as those within the broad framework of item response theory will
Educational Testing and Assessment 391

undoubtedly provide the theoretical underpinnings for educational and psychological assess-
ments in the future and multidimensional reporting of individual performance will become more
common. There is already evidence in the psychological testing literature that score profile report-
ing (i.e., multidimensional score reporting) of performance is enhancing the diagnostic useful-
ness of intelligence tests. More work in this direction can be expected.
Finally, there remains no shortage of problems which will need to be overcome for achieve-
ment tests to reach their full potential. For example, cognitive psychologists are capable of generat-
ing an important array of new variables which will be difficult to measure. Also, as the world
becomes smaller and persons more mobile, the availability of tests for use in multiple languages
and cultures are needed. But adapting tests to be equivalent in multiple language groups and
cultures may not always be possible, and even in the best situations, will be difficult to do well.
These are just a few of the challenges that await measurement specialists in the next century. At
the same time, given the progress which has been made in the last 100 years, it is exciting to
contemplate the progress which will be made in the next 100 years!

References

Ackerman, T. A. (1994). Using multidimensional item response theory to understand what items and tests are measuring.
Applied Measurement in Education, 7, 255-278.
Anastasi, A. (1988). Psychological testing. New York: Macmillan.
Bejar, I., & Braun, H. I. (1994). On the synergy between assessment and instruction: Early lessons from computer-based
simulations. Machine Assisted Learning, 4, 5-25.
Bennett, R. E., Sebrechts, M. M., & Rock, D. A. (1995). A task type for measuring the representational component of
quantitative proficiency (ETS Research Report No, 95-19). Princeton, N J: Educational Testing Service.
Berk, R. (Ed.). (1984). A guide to criterion-referenced test construction. Baltimore: Johns Hopkins University Press.
Binet, A., & Simon, T. (1905). Mtthods nouvelles pour le diagnostic du niveau intellectuel des anormanx. Annie Psy-
chologique, 11, 191-244.
Bimbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability, in E M. Lord & M. R.
Novick (Eds.), Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
Bloom, B. S. (Ed.) (1956). Taxonomy of educational objectives. New York: David McKay Company, Inc.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal
categories. Psychometrika, 37, 29-51.
Brennan, R. L. (1983). Elements ofgeneralizability theory. Iowa City, IA: American College Testing Program.
Cizek, G. J. (1996). Setting passing scores. [An NCME instructional module]. Educational Measurement: Issues and
Practice, 15(2), 20-31.
Ciauser, B. E., Subhiyah, R. G., Nungester, R. J., Ripkey, D. R., Clyman, S. G., & McKinley, D. (1995). Scoring a
performance-based assessment by modeling the judgments of experts. Journal of Educational Measurement, 32,397-
415.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements:
Theory of generalizability of scores and profiles. New York: John Wiley.
Drasgow, E (1994, October). Assessment: new directions and new formats, Paper presented at open conference of the
Joint Committee on the Standards for Educational and Psychological Testing, Crystal City, VA.
Ethington, C. A. (1990). Perspectives on research in mathematics education [special issue]. International Journal of
Educational Research, 14(2), 103-214.
Fraser, C., & McDonald, R. (1986). NOHARM H: a FORTRAN programforfitting unidimensional and multidimensional
normal ogive models of latent trait theory. Armidale, Australia: University of New England, Centre for Behavioral
Studies.
Glaser, R. (1963). Instructional technology and the measurement of learning outcomes: Some questions. American
Psychologist, 18, 519-521.
Gregoire, J. (1997). Diagnostic assessment of learning disabilities: From assessment of performance to assessment of
competence. European Journal of Psychological Assessment, 12, 10-20.
Guilford, J. P. (1936). Psychometric methods. New York: McGraw-Hill.
Gulliksen, H. (1950). Theory of mental tests. New York: Wiley.
Haladyna, T. M. (1994). Developing and validating multiple-choice test items. Hilisdale, N J: Lawrence Erlbaum.
Hambleton, R. K. (1989). Principles and selected applications of item response theory. In R. Linn (Ed.), Educational
measurement (3rd ed., pp. 147-200). Washington, DC: American Council on Education.
392 R.K. HAMBLETON and S. G. SIRECI

Hambleton, R. K. (1994). The rise and fall of criterion-referenced measurement? Educational Measurement: Issues and
Practice, 13(4), 21-26.
Hambleton, R. K. (1996). Advances in assessment models, methods, and practices. In D. C. Berliner & R. C. Calfee
(Eds.), The handbook ofeducationalpsychology (pp. 899-925). New York: Macmillan.
Hambleton, R. K., Clauser, B., Mazor, K., & Jones, R. (1993). Advances in the detection of differentially functioning test
items. European Journal of Psychological Assessment, 9( 1), 1-18.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: principles and applications. Boston, MA: Kluwer
Academic Publishers.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. ( 1991 ). Fundamentals of item response theory. Newbury Park, CA:
Sage.
Holland, P. W., & Thayer, D. T. (1988). Differential item functioning and the ManteI-Haenszel procedure. In H. Wainer
& H. i. Braun (Eds.), Test validity (pp. 129-145). Hillsdale, N J: Lawrence Erlbaum.
Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, New Jersey: Lawrence Erlbaum.
Jaeger, R. M. (1995). Setting performance standards through two-stage policy capturing. Applied Measurement in Educa-
tion, 8, 15-40.
Kentucky Department of Education. (1993). Kentucky Instructional Results Information System, 1991-92 Technical Report.
Frankfort, KY: Author.
Kolen, M. J., Hanson, B. A., & Brennan, R. L. (1992). Conditional standard errors of measurement for scale scores.
Journal of Educational Measurement, 29, 285-307.
Lapointe, A. E., Mead, N. S., & Askew, J. M. (1992). Learning mathematics (Report No. 22-CAEP-01). Princeton, NJ:
Educational Testing Service.
Lapointe, A. E., Mead, N. S., & Phillips, G. W. (1989). A world ofdijferences: an international assessment of mathemat-
ics and science (Report 19-CAEP-01 ). Princeton, N J: Educational Test Service.
Linn, R. L. (1990). Has item response theory increased the validity of achievement test scores? Applied Measurement in
Education, 3, 115-141.
Livingston, S. A. (1982). Estimation of the conditional standard error of measurement for stratified tests. Journal of
Educational Measurement, 19, 135-138.
Livingston, S. A., & Zieky, M. J. (1982). Passing scores: a manual for setting standards of perlbrmance on educational
and occupational tests. Princeton, N J: Educational Testing Service.
Lord, E M. (1952). A theory of test scores. Psychometric Monograph, No. 7.
Lord, E M. (1984). Standard errors of measurement at different ability levels. Journal of Educational Measurement, 21,
239-243.
Lord, E M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174.
Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). Washington, DC: American
Council on Education.
Mislevy, R. (1993). A framework for studying differences between multiple-choice and free-response test items. In R. E.
Bennett & W. C. Ward (Eds.), Construction vs. choice in cognitive measurement: issues m constructed response, perform-
ance testing, and portfolio assessment (p. 75-106). Hillsdale, N J: Erlbaum.
National Institute for Testing and Evaluation. (1996). The inter-university psychometric test guide 1996. Jerusalem, Israel:
Author.
Page, E. B., & Petersen, N. S. (1995). The computer moves into essay grading: updating the ancient test. Phi Delta Kap-
pan, 76, 561-565.
Phillips, G. W., Mullis, I. V. S., Bourque, M. L., Williams, E L., Hambleton, R. K., Owen, E. H., & Barton, E E. (1993).
Interpreting NAEP scales. Washington, DC: National Center for Education Statistics, Office of Educational Research
and Improvement.
Psychological Corporation. (1993). Metropolitan Achievement Tests (Seventh Edition) Technical Manual (Spring Data).
Orlando: Author.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for
Educational Research.
Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measure-
ment, 9, 401-412.
Resnick, L. B., & KIopfer, L. E. (1989). Toward the thinking curriculum: current cognitive research. Alexandria, VA:
Association for Supervision and Curriculum Development.
Samejima, E (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph
Supplement, 4, Part 2, Whole #17.
Schmidt, W. H., McKnight, C. C., & Raizen, S. A. (1997). A splintered vision: an investigation of U.S. science and
mathematics education. Boston, MA: Kluwer Academic Publishers.
Sireci, S. G. ( 1995, August). Using cluster analysis to solve the problem of standard setting. Paper presented at the meet-
ing of the American Psychological Association, New York,
Sireci, S. G., & Geisinger, K. E (1992). Analyzing test content using cluster analysis and multidimensional scaling.
Applied Psychological Measurement, 16, 17-31.
Educational Testing and Assessment 393

Sireci, S. G., & Geisinger, K. E (1995). Using subject matter experts to assess content representation: A MDS analysis.
Applied Psychological Measurement, 19, 24 i-255.
Sireci, S. G., & Robin, F. (1996, June). Setting passing scores on tests using cluster analysis. Paper presented at the
meeting of the Classification Society of North America, Amherst, MA.
Sireci, S. G., Thissen, D., & Wainer, H. (1991). On the reliability of testlet-based tests. Journal of Educational Measure-
ment, 28, 237-247.
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures.
Journal of Educational Measurement, 27, 361-370.
Sympson, J. B. (1978). A model for testing multidimensional items. In D. J. Weiss (Ed.), Proceedings of the 1977 computer-
ized adaptive testing conference (pp. 82-98). Minneapolis: University of Minnesota, Department of Psychology,
Psychometric Methods Program.
Thissen, D., Steinberg, L,, & Mooney, J. (1989). Trace lines for testlets: A use of multiple categorical models. Journal of
Educational Measurement, 26, 247-260.
Thorndike, R. L. (1951). Reliability. In E. F. Lindquist (Ed.), Educational measurement (pp. 560-620). Washington, DC:
American Council on Education.
Timminga, E., & van der Linden, W. J. (in press). Linear optimization models for test construction. Newbury Park, CA:
Sage.
Timminga, E., van der Linden, W. J., & Schweizer, D. A. (1996). CONTEST2.0. Groningen, the Netherlands: iec ProG-
amma.
Traub, R. E. (1983). A priori considerations in choosing an item response model. In R. K. Hambleton (Ed.), Applications
of item response theory (pp. 57-70). Vancouver, BC: Educational Research Institute of British Columbia.
Vance, R. J., MacCallum, R. C., Coovert, M. D., & Hedge, J. W. (1988). Construct validity of multiple job performance
measures using confirmatory factor analysis. Journal of Applied Psychology, 73, 74-80.
van der Linden, W. J., & Boekkooi-Timminga, E. (1989). A maximin model for test design with practical constraints.
Psychometrika, 54, 237-247.
van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item response theory. New York: Springer-
Verlag.
Wainer, H. (Ed.). (1990). Computerized adaptive testing: a primer. Hillsdale, N J: Lawrence Erlbaum.
Wainer, H., Sireci, S. G., & Thissen, D. (1991). Differential testlet functioning: Definitions and detection. Journal of
Educational Measurement, 28, 197-219.
Wilson, D., Wood, R., & Gibbons, R. (1991). Tes(fact: Test scoring, item statistics, and item factor analysis, 386/486
version [computer program]. Chicago, IL: Scientific Software, Inc.
Wise, S. L., Plake, B. S., Johnson, P. L., & Roos, L. L. (1992). A comparison of self-adaptive and computerized adaptive
tests. Journal of Educational Measurement, 29, 329-339.
Yen, W. M. (1983). Use of the three-parameter model in the development of a standardized achievement test. In R. K.
Hambleton (Ed.), Applications of item response theory (pp. 123-141). Vancouver, BC: Educational Research Institute
of British Columbia.
Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of
Educational Measurement, 30, 187-213.
Yerkes, R. M. (Ed.). (1921 ). Psychological examining in the United States Army. Memoirs of the National Academy of
Sciences, Volume 15.

Biographies

Ronald K. Hambleton is Professor of Education and Psychology and Chairperson of the Research
and Evaluation Methods Program at the University of Massachusetts at Amherst. His research
interests lie in the areas of item response theory and large-scale assessment.
Stephen G. Sireci is an Assistant Professor of Education at the University of Massachusetts at
Amherst. His research interests lie in the areas of test translation methodology, multidimensional
scaling, and test development.
CHAPTER 4

LARGE-SCALE ASSESSMENT IN SUPPORT OF SCHOOL REFORM:


LESSONS IN THE SEARCH FOR ALTERNATIVE MEASURES*

JOAN L. HERMAN

UCLA, Graduate School of Education and Information Studies, CRESST, Los Angeles, CA
90095-1522, U.S.A.

Assessment has long been a cornerstone of educational reform in the United States, fueled by
beliefs in meritocracy, accountability, and the value of programmatic efforts to improve teaching
and learning. Thirty years ago, for example, the passage of the Elementary and Secondary Schools
Act brought the federal government into local schools for the first time in support of quality
education for disadvantaged students, and with it the requirement that schools receiving funding
administer standardized tests to determine eligibility and to evaluate the effects of the program.
Fifteen or so years ago, minimum competency testing enjoyed a groundswell of popularity across
the country, mandated by states to assure all students would attain minimum standards of
competence. Within the past decade, the Goals 2000 legislation, advocated by the President and
passed by Congress, encouraged states to set rigorous standards for student performance and to
assess students' progress toward their attainment. Even more recently, a national summit of the
nation's governors similarly affirmed the need for their states to establish high standards for and
rigorous assessment of student accomplishment (NGA, 1996).
What's new in assessment, thus, is not a belief in its power or necessity but rather in the types
of assessment which are being used and the explicit policy and practical purposes those assess-
ments are expected to serve. Thirty years ago in the United States, assessment meant norm-
referenced testing, and an exclusive reliance on multiple choice measures which ranked students,
schools, and locales on general skill areas relative to one another. Fifteen or so years later, when
minimum competency tests were added to the mix, they too were primarily multiple choice, but
tended to be criterion-referenced rather than norm-referenced. That is, these tests were designed
to assess whether students had mastered specific objectives and to describe how students performed
relative to expected competencies rather then in reference to the performance of others. Most
recently, there has been great enthusiasm for alternative assessments which ask students to cre-
ate their own responses rather than simply selecting them, assessments which many believe best

*The work reported herein was supported in part under the Educational Research and DevelopmentCenter Program
cooperative agreement R 117G10027 and CFDA Catalog number 84.117G as administered by the Office of Educational
Research and Improvement, U.S. Department of Education. The findings and opinions expressed in this article do not
reflect the policies of the Office of Educational Research and Improvementor the U.S. Department of Education.

395
396 J.L. HERMAN

represent the kinds of skills students will need for future success. And indeed today, of 48 of the
50 states which already conduct or are in the process of developing statewide assessment systems
to support state goals for education, the great majority include both multiple choice and alterna-
tive type assessments (Bond, Braskamp, & Roeber, 1996).
Such use of large scale assessment is neither unique to the United States nor of recent origin:
civil service examinations in China enjoy a 3000 year history (Dubois, 1966) and the British 11+
examination, which is used to determine students' transfer from primary to secondary education,
is among the most visible high stakes tests in the world (Egan & Bunting, 1991). What is unusual
in the United States situation is the use of large scale assessment to promote system account-
ability and improvement (Feuer & Fulton, 1994). Also unique is an apparently widespread public
dissatisfaction with student outcomes and public schooling which has fueled attention to account-
ability and demands for change (Keeves, 1994).

Alternative Assessment: A Rose by Many Names

Many terms have been advanced when discussing alternatives to conventional, multiple choice
testing. These include "alternative assessment," "authentic assessment," "performance assessment,"
and, in fact, run the gamut from portfolios of student work or extended student projects that may
consume an entire school year to open-ended questions that resemble multiple choice test items where
the response options have been omitted. In this chapter, the terms alternative, authentic and perform-
ance assessment are used more or less synonymously to mean variants of assessments that require
students to generate rather than choose a response. Alternative assessment by any name requires
students to actively accomplish complex and significant tasks, while bringing to bear knowledge,
recent learning, and relevant skills to solve realistic or authentic problems.
Alternative assessment has typically meant having students solve problems and/or compose a
response over a span of 20 minutes to a few classroom periods. Examples include having students
design a bookcase within given function, space and cost constraints and explain how their design
meets the given parameters (California State Department of Education, 1992); writing a letter in
French comparing the benefits of living in the country and living in a big city (Carroll, 1975);
asking students to read historical documents and use the perspective in these documents with
their prior knowledge to explain a major historical issue to a peer (Baker, Freeman, & Clayton,
1991); and asking students to make a presentation on their proposal for disposing of nuclear
waste, based on their knowledge of science and taking into consideration social, political, and
environmental issues (Herman, Osmundson, & Pascal, 1996).
Portfolios also have experienced popularity as a form of alternative assessment in a number of
statewide and local district systems. Typically in these systems, students are asked to select
examples of their work from that produced over the course of a semester or school year; the body
of work so assembled is then evaluated based on a standard scoring rubric. The mathematics
portfolios required in the state of Kentucky, for example, require students to select five to seven
pieces of their work that show the breadth and depth of their understanding of mathematics
concepts and principles. They are scored against a rubric that defines performance characteristics
for ratings of "novice," "apprentice," "proficient," or "distinguished."
EducationalTestingand Assessment 397

Alternative Assessment as a Key to Reform

Prior research suggests both the potential and difficulties of using assessment as a tool to support
meaningful improvement in schools (Corbett & Wilson, 1991; Herman & Golan, 1991; Kellaghan
& Madaus, 1991; Koretz, Linn, Dunbar, & Shepard, 1991). History also shows the shortcomings of
using traditional, multiple choice tests to drive such improvement (Darling-Hammond & Wise, 1985;
Shepard, 1991; Herman & Golan, 1993). With these findings as background, current policy initia-
tives show continuing optimism in the power of assessment to support rigorous goals for academic
achievement. Continuing to cleave to a basic strategic model of accountability to force change, these
initiatives have identified the problems of history as residing in the test instruments themselves - -
the exclusive reliance on multiple choice testing and an often startling mismatch between the content
of traditional, standardized tests and many current goals for student performance. For example, only
26% of the Arizona Essential Skills were covered by then mandated standardized tests being used
statewide (Haladyna as reported in Smith, 1996). These skills represented high standards, complex
thinking, and integrated problem solving, mirroring the national standards being advocated by content
specialists across the country.
For current assessment policy, the solution lies in alternative forms of assessment that better
represent rigorous standards and the advanced knowledge and skills that students will need to be
successful and productive citizens - - abilities to access and use information, to solve problems,
to communicate. These alternative forms of assessment are also intended to provide good models
for instruction that will support meaningful learning, consistent with recent cognitive theory.

The Link to Learning and Cognition

According to cognitive researchers and theorists, meaningful learning is reflective, construc-


tive, and self-regulated (Glaser & Silver, 1994). People are not mere recorders of factual informa-
tion but are creators of their own unique knowledge structures. To know something is not just to
have received information but to have interpreted it and related it to other knowledge one already
has. In addition, the importance of knowing not just how to perform, but also when to perform
and how to adapt that performance to new situations is being recognized. Thus the presence or
absence of discrete bits of information, which typically has been the focus of traditional multiple-
choice tests, is not of primary importance in the assessment of meaningful learning. Rather, what
is highly valued is how and whether students organize, structure, and use that information in
context to solve complex problems.
Recent studies of the integration of learning and motivation also highlight the importance of
affective and metacognitive skills in learning (Garcia & Pintrich, 1994; Weinstein & Meyer, 1991).
This research suggests that poor thinkers and problem solvers differ from good ones not so much
in the particular skills they possess as in their failure to use them in certain tasks. Acquisition of
knowledge and skills is not sufficient to make one into a competent thinker or problem solver.
People also need to acquire the disposition to use the skills and strategies, the knowledge of
when to apply them, and the ability to learn from their experiences. These too have been
incorporated into alternative assessments that almost always require that students plan, organize,
and execute complex tasks; attention to these dispostions and metacognitive skills also is evident
in portfolio assessments which ask that students reflect upon their work.
The role of the social context of learning in shaping students' cognitive abilities and disposi-
tions also has received attention over the past several years, and it too has been incorporated
398 J.L. HERMAN

into some alternative assessments. Groups are thought to facilitate learning in several ways,
e.g., by modeling effective thinking strategies, scaffolding complicated performances, provid-
ing mutual constractive feedback, and valuing the elements of critical thought (Resnick &
Klopfer, 1989; Slavin, 1990). That real life problems often require teams to work together in
problem solving situations has provided additional rationale for including group work in
alternative assessments.

Modeling Good Instruction

That alternative assessments oftenly explicitly model what is thought important in good instruc-
tion adds a significant new twist to the role and function of assessment in school reform. Not
only is accountability assumed to motivate systems and the individuals within them to change
and improve their performance, but the assessments themselves are intended to communicate
how to change. Negative findings of the past regarding teachers' test preparation practices - -
that teachers have students directly practice test-like activities - - have been turned to the posi-
tive. The assessment provides tasks that are instructionally valuable and promote learning, and if
teachers' mimic such tasks in their practice, they will likely improve the quality of their instruc-
tion.

Making Alternative Assessment Good Measurement

Alternative assessments do pose significant research and development problems to assure their
validity as measures of student performance. Face validity - - that an assessment appears to tap
the complex thinking and problem solving skills that are intended to be assessed - - is not suf-
ficient to assure accurate measurement. As one example, Schoenfeld (1991) shares the story of a
New York teacher who was given high awards for his students' advanced performance on the
Regents Examination. On the examination students were asked to complete a series of what were
ostensibly complex geometry proofs requiring complex thinking and problem solving. It turned
out that the teacher had determined which proofs were likely to appear on the exam and had
drilled his students in how to solve them. As a result, despite what the assessment "looked" like,
it was not possible to draw inferences from it about these students' understanding of geometry
and their ability to apply complex geometric concepts and principles, because for them, the
examination questions were exercises in rote learning.
Basic to good student assessment is the notion that assessment results represent important
knowledge and/or skills, broader than the specific task(s) which happen to be chosen for assess-
ment. In a good assessment - - regardless of the type of measure - - test performance generalizes
to a larger domain of knowledge and/or skills and thus enables one to make accurate inferences
about students' learning and accomplishments. For example, when an assessment asks students
to conduct an experiment to determine optimal environmental conditions for sow bugs to thrive,
the interest is not so much in whether the students can create good conditions for sow bugs as
whether students can apply their knowledge of biology and the scientific method to solve problems.
The test performance is expected to represent something more than the specific object included
on the assessment and the specific time and testing occasion on which the assessment was
administered.
Validity is the term the measurement community has used to characterize the quality of an
Educational Testingand Assessment 399

assessment. At the simplest level, validity refers to the extent to which scores derived from an
assessment accurately reflect the knowledge and/or skills the test is intended to measure. For
traditional multiple-choice measures, concerns for validity have focused on issues of reliability
and patterns of relationships that suggest whether the assessment is tapping the intended construct
and whether it provides accurate information for specified decision purposes. For example, does
a student's performance on a standardized test of problem solving coincide with classroom
observations of his or her capability and with his or her success in subsequent courses emphasiz-
ing problem solving? Does using test results as part of the evidence for placement yield accurate
decisions? Any test in and of itself is neither valid nor invalid; rather current theory requires that
evidence as to the accuracy of that assessment for particular purposes be accumulated.
While the validity of these technical concerns are still critical, Linn, Baker, and Dunbar (1991)
have called for an expanded set of eight criteria for judging the quality of an assessment. They
are the consequences of the assessment, fairness, transfer and generalizability of the results, cogni-
tive complexity, content quality, content coverage, meaningfulness, and cost and efficiency. These
criteria form the basis for the discussion to follow.

Issues and Status in Establishing Technical Quality of Alternative Assessments

Reliability is a necessary but not sufficient condition for the technical quality of any measure.
In order to provide meaningful information for any purpose, the measurement instrument must
provide consistent results. Suppose, for example, a student takes a science problem solving assess-
ment today and a parallel assessment tomorrow or next week, and the student has not studied,
been taught science, or had any related experiences in the interim. One would expect the student's
score to be similar on both occasions because the underlying science capability has not changed.
If, however, the scores are quite different on the two occasions, the measurement does not have
a consistent meaning on both occasions and thus does not provide trustworthy information.

Reliability of Scoring

Because constructed responses are a defining feature of alternative assessment, scoring requires
humans, not machines, to examine (e.g., read) and judge the quality of response. Reliability of
scoring thus is the base issue in reliability - - one which hardly occurs in the automated world of
multiple choice testing, and yet is the foundation upon which all other decisions about technical
quality rest. Raters judging student performance should be in basic agreement as to what scores
should be assigned to students' work, within some tolerable limits (which measurement experts
report as "measurement error"). Do raters agree on how an assessment ought to be scored? Do
they assign the same or nearly similar scores to a particular student's response? If the answers to
these questions are not affirmative, then student scores are a measure of who does the scoring
rather than the quality of the work.
While not without challenge, assuring the reliability of scoring is an area of relative technical
strength in performance assessment. Largely from research on writing assessment, considerable
knowledge has accumulated about how to reliably score essays and other open-ended responses.
As summarized by Baker (1991), the literature shows that: (a) raters can be trained to score open-
ended responses consistently, particularly if well documented scoring rubrics and benchmark or
400 J.L. HERMAN

anchor papers exemplifying various score points are used, and scorers are given ample opportuni-
ties to discuss and apply the rubric to student response samples requiring increasingly complex
discriminations; (b) applying systematic procedures throughout the scoring process can help to
assure consistency of scoring, e.g., procedures such as having raters qualify for scoring by
demonstrating their consistency, conducting periodic reliability checks throughout the scoring
period, and retraining scorers as necessary; and (c) rater training reduces the number of required
ratings and costs of large-scale assessment (p. 3). Studies Baker reviewed from the performance
assessment literature in the military further support the feasibility of large-scale performance
assessments, involving tens of thousands of examinees, and the feasibility of assessing complex
problem solving and team or group performance. International studies similarly show that it is
feasible - - although not without challenge - - to use alternative assessments in cross national
studies (Wolf, 1994).
But reliable scoring can be difficult to achieve. The alternative assessment trials which have
been undertaken over the past several years provide supportive data on some of these feasibility
issues, but also document the challenges of reaching agreement in new areas of performance. At
one extreme, the direct writing assessment of the Iowa Test of Basic Skills demonstrates that it is
possible to achieve very high levels of agreement - - better than 90% exact agreement on student
scores - - with highly experienced, "professional" raters, tightly controlled scoring conditions,
and scoring criteria which have an established history (Hoover & Bray, 1995). At the other end
of the spectrum, in Vermont's early experiments with portfolios and Arizona's recent integrated
assessment program, there was insufficient reliability to permit the public release and intended
use of the assessment system results (Koretz, McCaffrey, Klein, Bell, & Stecher, 1993; Smith,
1996). In the Vermont case, which used as scorers teachers from throughout the state, the percent-
age of exact rater agreement in the first year of statewide implementation was essentially that
which would be expected by chance alone (Koretz et al., 1993). Correlations between the scores
given by different raters to the same pieces of student work, a second way of looking at rater
reliability, were similarly discouraging. They ranged from .28 to .60, depending on whether
individual dimension scores, or aggregate (overall summary) scores were the units of analysis.
The Vermont case is telling not only in demonstrating the difficulties of achieving reliability
of scoring in the early years of a new assessment, but also in the potential conflict in various
purposes of large scale, alternative assessment. Vermont's statewide portfolio assessment was
intended to serve an accountability purpose as well as a teacher capacity building and instructional
improvement purpose. In order to serve the first purpose reliability of scoring is a high priority,
because without such consistency, scores cannot be reported at the school, district, state, or regional
levels. But in order to serve the second purpose, Vermont wanted to involve as many teachers as
possible across the state in the scoring process. The two purposes thus pulled in two different
directions: tightly controlled scoring with highly experienced raters to serve accountability uses
versus a highly inclusive process involving as many teachers as possible in the scoring to support
new instructional practices.
Available data suggest that problems in achieving reliability in scoring decrease over time. As
bugs in rubrics are worked through, procedures are fine-tuned, and a core of knowledgable,
experienced raters are developed who in turn can support the development of consensus on
standards for scoring with a widening pool of educator/scorers, reliability of scoring increases.
The Pittsburgh portfolio assessment experience provides an example which demonstrates the
power of consenus derived from common understandings of curriculum and instruction priorities
(LeMahieu, Gitomer, & Eresh, 1994). Students in grades 6 through 12 from across the Pittsburgh
school district created portfolios by selecting four writing samples from those they completed
Educational Testing and Assessment 401

over the course of a year. The four pieces were selected according to a set of guidelines and
included an important piece, a satisfying piece, an unsatisfying piece, and a free pick.
The scoring rubric emerged from a decade of discussions of student writing, conducted first in
the context of developing new approaches to curriculum and instruction and a three-year dialogue
on rubric development (Camp, 1990, 1993; Gitomer, 1993). The discussions produced a rich
evaluative framework which was commonly understood by participants, and most of the scorers
for the portfolio assessment had been participants in this process. The rubric featured three dimen-
sions: accomplishment in writing; use of processes and strategies for writing; and growth, develop-
ment, and engagement as a writer. Despite the great variety of student-selected work, scoring
reliabilities ranged from .84 to .87 at the middle school level and from .74 to .80 for high school
students. At the middle school level, raters agreed within one score point over 95% of the time
(agreements at the high school level ranged from 87%-91%), well above what would be expected
by chance.
The authors credit the effort's success to the strength of the shared interpretative framework,
carefully nurtured and developed over time in the course of continuing, critical conversations. At
the same time, researchers studying the Advanced Placement Studio Art assessment (Myford &
Mislevy, 1995) note the difficulty of establishing such a shared interpretative framework.

Generalizability of Performance

While history shows that reliability of scoring is a tractable problem, the literature makes clear
the difficulties of generalizing from performance on specific measures to inferences about student
learning in larger domains. Student performance appears very sensitive to changes in assessment
format, meaning that the context in which you ask students to perform - - as much as the student
learning which is the object of assessment-- influences the results obtained. In this regard, Shavel-
son and his colleagues (Shavelson, Baxter, & Pine, 1991) examined how students' performance
on science experiments compared with their performance on computer simulations of the same
experiments and with their journal entries from their laboratory work, all intended to measure the
same aspects of science problem solving. Similarly, Gearhart, Herman, Baker, and Whittaker
(1993), in a study of portfolio assessment, compared how students' performance in writing was
judged when based on their writing portfolios, their classroom narrative assignments, and their
responses to a standard narrative prompt, with all three assessments again intended to measure
students' writing capability. The results from both studies showed substantial individual varia-
tion across the various assessment tasks. For example, two-thirds of the students who were clas-
sifted as "capable" on the basis of portfolios were not so classified on the basis of the direct
writing prompt (Herman, Gearhart, & Baker, 1993). Likewise, doing well based on observations
of laboratory work did not predict good performance on the simulation.
Predictably, however, when tasks are tightly defined and the questions strictly parallel except
for format, results are more consistent. Sugrue and her colleagues (Sugrue, 1996) used carefully
crafted tasks to investigate the components of science problem solving in various formats.
Students' performance was similar regardless of whether the assessment task was a hands-on
problem or a paper-pencil fascsimile of the problem - - that is, regardless of whether students
worked with actual batteries, wires, and lightbulbs to create a circuit, or simply were presented
with visual representations of these objects, their written explanations were similar. However,
performance on tasks eliciting different types of performance, for example, selecting the right
402 J.L. HERMAN

combination of materials to achieve the brightest light versus explaining why that combination
achieved such a result, produced different results.
How many tasks are needed to get a reliable estimate of students' learning? Generalizabil-
ity theory has been used to examine this question in a number of content areas, including
writing, mathematics, and science (Dunbar, Koretz, & Hoover, 1991; Linn, Burton, DeSte-
fano, & Hanson, 1995; Moerkerke, 1996; Shavelson et al., 1991; Shavelson, Baxter, & Gao,
1993). Sources of measurement error in student scores have been analyzed, looking at the
variability attributable to raters and tasks as well as the interaction of raters, tasks, and students,
to estimate the combination of number of raters and tasks that are needed to produce suf-
ficiently reliable results. The research has consistently shown that although raters are an
important source of measurement error, variability due to task sampling (i.e., the particular
tasks included on the test) is a far greater problem. Furthermore, individual student's perform-
ance varies across and interacts with tasks - - that a student does well on one science problem
solving task, for example, does not mean he or she will do well on a second problem solving
task. Also, the students who perform well on one task may well not be the same students who
perform the best on the second task.
Returning to the question of the number of tasks needed to obtain a reliable estimate of student
performance, the results vary somewhat depending on the specific study. When items are not
tightly specified, from eight to 20 tasks are needed to obtain reliable individual student estimates,
with most studies in the 15 to 17 range (Dunbar et al., 1991; Linnet al, 1995; Shavelson et al.,
1991; Shavelson et al., 1993; Shavelson, Mayberry, Li, & Webb, 1990). And these numbers of
tasks only achieve about a .8 level of reliability, a level considered minimum by some but one
which risks significant reclassification errors (Rogosa, 1994).
Furthermore, Shavelson et al. (1993) remind us of the problem of great variability in perform-
ance across topic areas within a given discipline (e.g., numbers, operations, measurement within
mathematics; chemistry biology, physics within science). At least ten different topic areas may
needed to provide dependable measures of a student's performance in one subject area.
Careful specification procedures, which set parameters for the nature of the tasks, their content
and format, and use common scoring schemes, can substantially reduce the number of items
needed to achieve minimum reliability. Based on a standard specification to generate explanation
tasks, Baker's (1994) results suggest that only three to five items would be needed to produce
stable estimates of an individual's understanding of history. Moerkerke (1996) also found that
the use of specifications enabled developers to produce parallel assessments that had similar dif-
ficulty levels.
Generalizability of measurement, as reviewed above, is one approach to examining the depend-
ability of a measure. But what about the case where assessments are to be used to make decisions
about whether students have achieved a particular standard and when achieving the standard has
important consequences for students, such as determining school graduation or university entry?
Here the technical accuracy of the decision becomes a critical issue. Recognizing that any score
is only an estimate of a student's actual capability, Linnet al. (1995) used the concept of standard
error of measurement (SEM) to examine this question. The researchers asked, "How many tasks
would it take to be 95% confident that a decision is correct?" With student performance classi-
fied on a four-point scale and three used as the "cut score," the technical question concerns the
number of tasks needed to achieve a SEM of .25 or less, given that a student's true score likely
(95% probability) falls within two SEM of the observed score. Using the New Standards Project
Educational Testingand Assessment 403

mathematics trial data,* Linnet al. (1995) concluded that from nine to 25 tasks would be required
to achieve the 95% confidence level.

Stability Over Time

Whether the analyses are based on generalizability theory or SEM, the underlying problem is
similar: there is great variability in student performance across tasks. This not only raises important
issues for interpreting individual scores at one point in time, but moreover raises particularly
thorny challenges when examining student progress or comparing results from year to year. While
we know that using different tasks will influence performance estimates, concerns for memora-
bility of tasks, test security, and practice effects make it inappropriate to use the same tasks to
make such comparisons. Solutions to these vexing comparability, equating, and progress assess-
ment issues are under study (Bryk & Raudenbush, 1992; Muthdn et al., 1995; Rogosa & Sanger,
1995; Seltzer, Frank, & Bryk, 1994). Adding their views from a non-technical perspective, educa-
tors in Maryland and Kentucky question the validity of test score gains reported in both those
states (Koretz, Barron, Mitchell, & Stecher, 1996; Koretz, Mitchell, Barron, & Keith, 1996).

The Meaning of Assessment Results

Analyses of generalizability and standard errors of measure speak to the dependability and
consistency of scores, but do not directly address whether the results have the meaning that was
intended. Just because a measure intends to measure, say, mathematics problem solving does not
mean its results can be so interpreted. In fact, a number of observers have noted a paucity of
significant disciplinary content in many performance assessments (Wolf, 1992; Herman, 1996).
Glaser and Baxter have developed a framework and an explicit methodology for examining
the match between an assessment's intentions and the nature of cognition actually assessed (Bax-
ter, Elder, & Glaser, 1996; Baxter & Glaser, 1996). Guided by the expert-novice literature and
current understandings of the relationship between competence and quality of cognitive activity
(Glaser, 1991), their framework highlights four types of cognitive activity which differentiate
different levels of competence: problem representation, solution strategies, self monitoring, and
explanation. Applying the framework to a small sample of science problem solving assessment
tasks and using protocol analysis, observation, and analysis of work, the researchers determined
the nature and quality of cognitive activity actually elicited by the tasks and then compared it to
the objectives of the test developers. In essence they asked, "If a task claims to measure complex
science problem solving, is there evidence that the scores represent the level of students'
competence in this specific instance'?"
Three prevalent types of situations were identified. In the first type, the tasks elicited appropri-
ate cognitive activity and the nature and quality of observed cognitive activity (i.e., problem
recognition, solutions strategies, self monitoring, and explanation) correlated well with student
performance on the task. In the second type, the tasks elicited appropriate cognitive activity but

*New Standards is attempting to create an assessment system to certify whether students have achieved rigorous
academic standards and project leaders intend the results to be used for graduation, college admission and job entry,
among other things.
404 J.L. HERMAN

the scores did not match the level of observed activity, either because the scoring system was not
aligned with task demands or was not sensitive to the differential quality of cognition in students'
performances. The third type included tasks where students could bypass the intended cognitive
aspects, i.e., tasks ostensibly measuring problem solving which students could answer without
engaging in any intended cognitive activity.
It is noteworthy that in two of the three situations the results are invalid - - that is, the scores
do not support valid inferences about students' problem solving capability. That all of the assess-
ments analyzed by Baxter and Glaser were part of prominent pilot programs being administered
in large numbers underscores the need for attention to content quality and cognitive validity.
Their framework has implications for assessment development as well as for revision and valida-
tion of results.

Issues and Status of Fairness in Alternative Assessment

An essential ingredient in any assessment, fairness requires attention to a variety of measure-


ment and use issues. Historically, concerns for equity and fairness have centered on assuring
objectivity and avoiding bias. Students' scores should be a function of their learning, not a func-
tion of who the students are, their gender, ethnic or cultural background or other personal or
social characteristics which are irrelevant to what is being assessed. Safeguarding that no students
get special advantage, in fact, is an important reason for developing the standardized administra-
tion and scoring of multiple choice tests (Cronbach & Suppes, 1969). Considerable attention has
been given to sensitivity reviews to guard against potentially offensive or culturally unfair test
content and to statistical techniques, such as differential item functioning, to detect possibly biased
items (e.g., Camilli & Shepard, 1994).

Objectivity and Bias

Attention to reliability of scoring, described earlier, represents similar concerns for objectivity
and avoiding bias in scoring - - assuring that the score reflects the performance and not who does
the scoring or other features or characteristics which are relevant to what is being assessed. Because
alternative forms of assessment generally require human judgment for scoring, additional sources
of bias may creep into the scoring process, and there need to be safeguards against these. For
example, the Pittsburgh portfolio assessment project previously mentioned (LeMahieu et al., 1994)
explicitly examined the effects of gender and race of both scorers and students on scoring. With
regard to gender, the researchers found that females' performance was scored higher than males,
and that female scorers tended to give higher scores than their male counterparts, but importantly
there was no interaction between the sex of the rater and the sex of student (i.e., female scorers
did not score female students differentially higher than male students or vice versa). Similarly,
with regard to race, while White students received higher ratings than African American students
and African American raters gave lower scores than White raters, the race of the rater did not
interact with the race of the student.
Differences in performance according to race or gender (depending on the subject area) are
not unique to performance assessment. Results from traditional standardized tests historically
have shown substantial gaps between the performance of Whites and that of economically
Educational Testing and Assessment 405

disadvantaged minority students and, for some subject areas, between boys and girls. Bolger and
Kellaghan (1990), for example, found that boys outperformed girls on multiple choice tests of
mathematics, Irish, and English achievement, but interestingly found no such differences in short
answer assessments of each subject area.
The analytical question is whether observed differences reflect actual difference in competence
or some bias in the assessment situation that unduly advantages some groups over others. Does
the assessment put minority students at a disadvantage relative to their majority counterparts
because of cultural content which is not essential to the skills and understandings that the assess-
ment is intended to measure? Similarly, does some non-essential aspect of the task give unfair
advantage to boys over girls or vice versa? Winfield and Woodard (1995) has warned that standard-
ized performance assessments are at least as likely as current traditional measures to disadvantage
students of color. She worries that, because time requirements will limit the number of tasks
chosen for assessment, it is likely that the tasks selected will be those more familiar to middle-
class, Caucasian students.
In the minds of teachers performance assessments do unfairly disadvantage some students
(Koretz, Barron et al., 1996; Koretz, Mitchell et al., 1996; Smith, 1996). Of particular concern
are the language demands of many alternative assessments which often ask that students engage
in significant reading and/or writing even though the object of measurement is not language arts
skills. An example is a mathematics problem solving task which requires reading ability to
comprehend the problem. Do the reading demands detract from some students' ability to
demonstrate their mathematics learning? Virtually all teachers in Maryland's statewide assess-
ment program reported that the emphasis on writing made it difficult to judge the mathematical
competence of some students (Koretz, Mitchell et al., 1996). At particular risk are students with
limited English proficiency (August, Hakuta, & Pompa, 1994).

Bias in Opportunity to Learn

A number of observers also have highlighted fairness issues stemming from students'
"opportunity to learn" that which is assessed (Darling-Hammond, 1995; Herman & Klein, 1996;
Linnet al., 1991). Their concerns are particularly acute in high stakes assessments where results
carry serious consequences for students and schools. It is unfair, for example, to hold students
accountable for achieving standards for which they have had little or no instructional exposure.
Similarly it is unfair to use assessment results to compare schools or students - - for example to
determine which schools are the best, which students are to be admitted to college, gain job
entry, etc. - - when the assessments are well aligned with the curriculum for some students and
in some schools but quite inconsistent with the instruction that is provided to and in others.
The equity issues are compounded because there is good reason to believe that economically
disadvantaged, minority students are likely to have less access to the kinds of curriculum that
would prepare students to do well on performance assessments. These are the students who most
likely have been subjected to a "drill and skill" basic skills curriculum, driven by strong account-
ability pressure in their schools to improve scores on traditional standardized tests (Darling-
Hammond, 1995; Herman & Golan, 1993; Shepard, 1991). There also is evidence that children
in economically disadvantaged communities are less likely than their more advantaged suburban
peers to have available some of the resources which are essential to instructional opportunity
(e.g., teachers possessing appropriate subject matter background and instructional materials in
line with new curricular thinking) (Herman & Klein, 1996). Smith (1994) similarly observed that
406 J.L. HERMAN

districts serving poor children are less likely to have the professional development resources to
support teachers' capacity to support the kinds of learning valued by new forms of assessment.

Comparability and Equity in PorO~olioAssessment

Portfolio assessment in particular makes apparent these problems of equity in opportunity and
comparability of results. In recent years, large scale portfolio assessment has gained popularity
because of its potential to bridge the worlds of public accountability and classroom practice and
as the ultimate example of assessment productively integrated with instruction. In contrast to
more contrived "drop from the sky" assessments (Hoover, 1996), portfolios are the products of
on-going classroom work. Targeted on agreed upon standards for student performance but still
permitting teachers and students a great deal of flexibility and choice, portfolio assessments chal-
lenge teachers and students to focus on meaningful work, support the assessment of long-term
projects over time, encourage student-initiated revision, and provide a context for presentation,
guidance, and critique of student progress.
These very strengths, however, may lead to a number of measurement weaknesses (Gearhart
& Herman, 1995). Obvious problems of comparability arise from the variability and generaliz-
ability of tasks which are included in students' portfolios and the differential conditions under
which the work was produced. Findings from the Vermont statewide portfolio assessment
indicated, for example, that teachers vary widely in their classroom portfolio practices. There
was substantial variation in the amount of time students spent on their portfolios on revision.
Some teachers encouraged revisions while others discouraged it; still others required it. Policies
on feedback and support similarly were variable; in some classrooms, getting feedback from
others was permissible, in others it was explicitly forbidden (Koretz, Stechner, Klein, McCaf-
frey, & Deibert, 1993; Koretz, Stecher, Klein, & McCaffrey, 1994).
Perhaps most vexing among these challenges to comparability of results is summed up by the
words of a Vermont teacher after having scored portfolios for several days, "Whose work is this
anyway?" The question naturally arises because portfolios contain the products of classroom instruc-
tion and good classroom instruction, according to current theory, engages communities of learners in
a supportive learning process (Camp, 1993; Wolf, Bixby, Glenn, & Gardner, 1991). Thus, under
optimal instructional conditions, the products being assessed are not the result of a single individual
but rather of an individual working in a social context. The better the instructional process, the more
an individual student's work is likely to have benefited from the work of others.
How to infer an individual's competence from such supported performance is one important
aspect of the problem, but perhaps more important is the differential support which students receive
within and across classrooms. While the Vermont study documents differential help with portfolio
work across classrooms, Gearhart et al. (1993) reported substantial variation within classrooms
as well. The researchers asked teachers such things as how much structure or prompting they
provided individual students, what types of peer or teacher editorial assistance occurred, and
what resources and time were available for portfolio work. Patterns differed across teachers. Within
classrooms, not surprisingly, teachers tended to provide more help to lower achieving students.
What this means is that the quality of a student's work is a function of variable and potentially
substantial external assistance (another source of measurement error) as well as of the student's
learning. As a result, the validity of inferences can be drawn about individual student competence
based solely on portfolio work is highly suspect. Because for some students the work has been
Educational Testing and Assessment 407

highly assisted while other students have received little assistance, comparisons of students' learn-
ing based on such work may be unfair.

Consequences of Alternative Assessment

Despite the many technical challenges, alternative assessment does appear to have value in
supporting instructional reform, based on accumulating evidence from implementation studies.
McDonnell and Choisser (forthcoming), for example, conducted comparative case studies in
Kentucky and North Carolina to analyze the extent to which the assessment systems in these two
states promoted the goals of reform and encouraged classroom teaching practices consistent with
those goals. Despite the fact that the two systems were quite different in assessment design and
incentive structure, McDonnell and Choisser found that educators in the two states had similar
reactions to their state assessment programs.
Teachers and principals took the assessments very seriously, were generally supportive of the
reform goals embodied by the assessments, but were ambivalent about the assessments themselves.
Teachers saw value for students in that the tests encouraged teachers to engage students in activi-
ties such as writing and problem solving that otherwise would be absent or less frequent in the
curriculum, and they viewed the assessment as more complete measures of student accomplish-
ment than previous tests. At the same time, they questioned the validity of the assessments for
some students and were concerned about the stress the assessments place on them and their
students (see also Koretz, Baron et al., 1996). McDonnell and Choisser's analysis of instructional
artifacts, teacher logs, and surveys also showed that the content and process of teachers' classroom
practices generally conformed with the goals of the reform, though in both states there was
evidence that teachers lacked thorough understanding of the meaning of the reform and the specific
kinds of learning which were required by the assessments.
Similar pictures of mixed support and generally beneficial impact on curriculum and instruc-
tion emerge from studies of statewide systems in Maryland and Vermont (Koretz, Mitchell et al.,
1996; Koretz et ai., 1993). Koretz, Mitchell et al. (1996), however, pointed out that from some
vantage points there is negative spillover from some of the positive influences on curriculum.
For example, while Maryland teachers reported instructional changes consonant with the goals
of the state reform, these increases also meant less time in areas not assessed. Some teachers
worried about lack of attention to basic skills such as grammar and number facts. Vocal parents
and community members sometimes share these concerns.
Aschbacher's (1994) research shows that teachers' involvement in the development and
implementation of alternative assessments has diverse, positive influences on teaching practices,
at least when combined with training and follow-up technical support. Clark and Stephens (1996)
documented the similar effects of assessment in Australia. Their study shows that the implementa-
tion of the Victoria Certificate of Education supported systematic reform of mathematics educa-
tion and effectively influenced curriculum and teaching.
Mary Lee Smith's case studies on the consequences of the Arizona State Assessment System
(ASAP) remind us that such changes may be highly dependent on the culture, philosophy, and
leadership and other conditions evident in individual schools (Smith, 1996). Her studies revealed
schools where changes were dramatic in direct response to ASAP, schools which were changing
anyway and would have done so with or without the program, and schools where no change
occurred nor was possible. In the first category was a school whose teachers were predisposed to
the constructivist goals implicit in ASAP but with little knowledge of how to pursue them. With
408 J.L. HERMAN

a supportive principal and resources for intensive professional development, these teachers were
able to improve teaching throughout the school. In another school, although some teachers were
similarly predisposed and even knowledgeable about intended changes, a "persuasive climate of
behaviorism" (p. 99) limited the impact. Nonetheless, these teachers were able to use the ASAP
mandate to advance their agenda and introduce change in the tested grade.
In contrast, very little change occurred in two other schools. One was geographically remote
and lacked resources for new materials or professional development. The other was permeated
by beliefs that their children were too poor, too limited in English language, and possessed too
little ability to profit from new curriculum and instruction goals. These latter two schools
demonstrate two critical variables which shape implementation outcomes: the will and the capac-
ity to change (McDonnell & Choisser, forthcoming).

Benefits Carry Costs

While the literature is promising with regard to the potential positive effects of alternative
assessment on curriculum and instruction, research also indicates the significant challenges and
time such systems entail. For example, a majority of principals interviewed in Vermont believed
that portfolio assessment generally had beneficial effects on their schools in terms of curriculum,
instruction and/or student learning and attitudes. At the same time, however, almost 90% of these
principals characterized the program as "burdensome," particularly from the perspective of its
demands on teachers (Koretz, Stecher et al., 1993). In nearly every project, in fact, concerns
about pressure on teachers and the pervasive demands on teachers' time have been reported (Asch-
bacher, 1993; Koretz, Mitchell et al., 1996; Koretz, Stecher et al., 1993).
The time demands of portfolio assessment programs appear particularly acute. The Vermont
study, for example, asking about only some of these demands, found that teachers devoted 17
hours a month to finding portfolio tasks, preparing portfolio lessons, and evaluating the contents
of portfolios; and 60% of the teachers surveyed at both fourth and eighth grades indicated they
often lacked sufficient time to develop portfolio lessons (Koretz, Stecher et al., 1993). Again,
these time estimates represent important opportunity costs for both teachers and students.

Economic Costs

Knowing which should be directly ascribed to assessment rather than instructional or profes-
sional development is one of the issues that makes it difficult to estimate the costs of alternative
assessments. While conceptual models for analyzing the cost of alternative assessment and for
conducting cost-benefit analyses have been formulated (Picus, 1994; Catterall & Winters, 1994),
definitive cost studies are yet to be completed (see, however, Picus & Tralli, 1996). Nonetheless,
it is clear that compared to traditional multiple choice tests, the costs of development, administra-
tion, scoring, and reporting of alternative assessment is dramatically higher (Hardy, 1995; Hoover
& Bray, 1995). Koretz, Madaus, Haertel, and Beaton (1992), for example, estimated that Advanced
Placement exams, which typically take three hours and require extended essay responses, cost
approximately $65 per subject test, while commercial standardized tests cost from $2 to $5 per
subject test.
One area where direct comparisons are perhaps easiest is in the area of scoring. Catterall and
Educational Testing and Assessment 409

Winters (1994) estimated that the cost of scoring a 45-minute essay was between $3 and $5.
Similarly, Stecher (1995) estimated the cost of scoring a "hands on" science task comprising one
class period at $4 to $5 per student. In comparison, the complete battery of the Iowa Test of Basic
Skills, a nationally standardized multiple choice test, costs about $1 per student.

Credibility and Public Support

As with any public policy, investment in alternative assessment depends on the support of the
public and its policy makers. Of late, some segments of the public have been vocal in their opposi-
tion and in some cases have been successful in derailing new assessment systems. While it has
become axiomatic to many educators that schools must emphasize complex thinking and problem
solving if students are to be well prepared for future success and life long learning and that good
instruction and good assessment alike are constructivist, parents and the community do not neces-
sarily agree.
A prominent national survey in the United States found that the public was very concerned
about students' basic skills and believed that schools should put "first things first" (Johnson &
Immerwahr, 1994). Public controversies over assessments in both Kentucky and California
(McDonnell, 1996) indicated that parents in some cases misunderstood the nature of the tests and
disagreed with underlying values. Opposition groups, for example, questioned the academic rigor
of the tests. Some were offended by the social and cultural agenda they saw underlying the assess-
ments because some assessment materials represented diverse viewpoints or were perceived as
encouraging children to question authority - - e.g., a language arts question "Think about a rule
at your school . . . that you think needs to be changed. Write a letter to your principal about the
rule you want changed" (McDonnell, p. 42). Others were concerned about whether the assess-
ments inappropriately intruded into family life and violated parents' rights because some ques-
tions asked students to reflect on events in their lives - - e.g., "Why do you think some teenagers
and their parents have problems communicating?" (McDonnell, p. 42).
While the political motives and wider agenda of some in the opposition might be open to
question, what is clear from these examples is that significant segments of the population did not
understand the aims or purposes of the new assessments and did not feel that their viewpoints
were considered in the development process. The diversity of opinion within the public - - about
what is important for students to know and be able to do and what are the goals of schooling - -
also is clear, as is the need to involve parents and community activity in all phases of the assess-
ment process.

What Next?

The last decade has witnessed an explosion of interest in performance assessment in the United
States as a policy tool to support accountability and school improvement. Borne of great
enthusiasm and commitment to change, many in the United States have rediscovered what most
countries outside the United States have long understood: that multiple choice and other selected
response testing cannot be the sole basis for assessment systems and that essay and other open
response questions deserve an important role (Keeves, 1994). Set in a somewhat unique environ-
ment in which efficiency and psychometric qualities of multiple choice testing are valued and
410 J.L. HERMAN

in which fears abound that the litigation which might accompany high stakes assessments which
do not meet technical standards, educators in the U.S. have embarked on developmental efforts
to bring their vision of alternative assessment to reality.
As a result, the last five or so years have been a period of great experimentation which has
produced substantial knowledge about the strengths and challenges of alternative assessment for
accountability purposes. The consequences of using performance assessment in accountability
systems appear to be a clear strength: teachers and principals take the new assessments and the
goals they represent seriously and move to incorporate new pedagogical practices into their teach-
ing; teachers engage their students in the kinds of activities they see embodied in the assessment.
With appropriate professional development and supportive local context, alternative assessments
indeed can support meaningful change and improvement of practice.
The technical and logistical challenges, however, are daunting. The possibility of providing
accurate, reliable individual student results based on performance assessments alone seems a
remote possibility. Based on available evidence, the number of tasks required to get a stable,
individual estimate make it unlikely that any state or local system would be able or willing to
invest the necessary student time or financial resources. While available methods do make pos-
sible school-level estimates that can be used for a variety of purposes, educators and the public
alike clamor for individual results from their assessment systems. In the United States parents
want to know (and demand formal, comparable evidence about) what their children are learning;
students likewise want to know "how they're doing." Similarly, like teachers around the world,
teachers in the United States seek information they can use to understand individual students and
how best to support their learning.
The public controversies in several states and numerous communities underscore a diversity
of opinion about what children should know and be able to do and the types of assessments
which should be used to measure their accomplishments. The strength of sentiment for local
control of education and the difficulty of moving to more centralized systems which are taken for
granted elsewhere in the world are evident. The costs of new forms of assessment in current
times of fiscal austerity and public cutbacks also have given policy pause. In fact, two states
which were in the forefront of innovation in assessment - - Arizona and California - - have seen
their programs discontinued; several other states which were moving in that direction have had
their funding derailed. The policy engine which was steaming ahead just two years ago today
appears to have slowed a bit.
The last five years serve as a clear reminder of the complexities of moving from the simple
assumptions of policy to solutions which work in reality. The challenge of designing beneficial
assessment systems to accommodate multiple interests and to serve multiple purposes within
given constraints also is apparent. Past history makes evident the folly of relying solely on multiple
choice testing for accountability purposes, a lesson which most other countries had no need to
learn; more recent history suggests that alternative assessments alone - - at least at this stage of
their development and under the time constraints and costs which those in the United States are
willing to bear - - will not suffice. The obvious and sensible solution towards which most states
and local districts are now moving is an optimal combination of both. How to configure such
systems represents an important and ambitious research and development agenda for our future
work.
Educational Testing and Assessment 411

References

Aschbacher, P. R (1993). Issues in innovative assessment for classroom practice: Barriers and facilitators (CSE Tech.
Rep. No. 359). Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student
Testing.
August, D., Hakuta, K., & Pompa, D. (1994). For all students: Limited English Proficient students and Goals 2000.
Occasional Papers in Bilingual Education, 10(4) entire.
Baker, E. L. (1991, April). What probably works in alternative assessment. In Authentic assessment: The rhetoric and the
reality. Symposium conducted at the annual meeting of the American Educational Research Association, Chicago.
Baker, E. (1994). Learning based assessments of history understanding. Educational Psychologist, 29(2), 97-106.
Baker, E. L., Freeman, M., & Clayton, S. (1991). Cognitively sensitive assessment of subject matter: Understanding the
marriage of psychological theory and educational policy in achievement testing. In M. C. Wittrock & E. L. Baker
(Eds.), Testing and cognition (pp. 135-153). New York: Prentice-Hall.
Baxter, G., Elder, A. D., & Glaser, R (1996). Knowledge-based cognition and performance assessment in the science
classroom. Educational Psychologist, 31(2), 133-140.
Baxter, G., & Glaser, R. (1996). An approach to analyzing the cognitive complexity of science performance assessments.
Los Angeles, CA: University of California, Los Angeles Center for the Study of Evaluation.
Bolger, N., & Keilaghan, T. (! 990). Method of measurement and gender differences in scholastic achievement. Journal
of Educational Measurement, 27, 165-174.
Bond, L. A., Braskamp, D., & Roeber, E. (1996). The status report of the assessment programs in the United States.
Oakbrook, IL: NCREIdCouncil of Chief State School Officers.
Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models: Applications and data analysis methods. Newbury
Park, CA: Sage.
California State Department of Education. (1992). California Assessment Program: A sampler of mathematics assess-
ment. Sacramento, CA: Author.
Camilli, J., & Shepard, L. A. (1994). Methods for identifying biased test items. Newbury Park, CA: Sage.
Camp, R. (1990). Thinking together about portfolios. The Quarterly of the National Writing Project and the Center for
the Study of Writing, 12(2), 8-14.
Camp, R. (1993). The place of portfolios in our changing views of writing assessment. In R. E. Bennett, & W. C. Ward
(Eds.), Construction versus choice in cognitive measurement: Issues in constructed response, performance testing,
and portfolio assessment (pp. 183-212). Hillsdale, N J: Erlbaum.
Carroll, J. B. (1975). The teaching of French as a foreign language in eight countries. New York: John Wiley.
Catterall, J., & Winters, L. (1994). Economic analysis of testing: Competency, certification, and "authentic" assess-
ments. CSE Tech. Rep. No. 383. Los Angeles, CA: University of California, Los Angeles Center for the Study of
Evaluation.
Clark, D., & Stephens, M. (! 996). The Ripple effect: The instructional impact of the systematic introduction of perform-
ance assessments in mathematics. In M. Birenbanm & E J. R. C. Dochy (Eds.), Alternatives in assessments of achieve-
ments, learning processes and prior knowledge (pp. 63-93). Boston: Kiuwer Academic.
Corbett, H. D., & Wilson, B. L. (1991 ). Testing, reform, and rebellion. Norwood, NJ: Ablex Publishing.
Cronbach, L. J., & Suppes, P. (1969). Research for tomorrow's schools: Disciplined inquiryfor education. Stanford, CA:
National Academy of Education/Macmillan.
Darling-Hammond, L. (1995). Equity issues in performance based assessment. In M. T. Nettles & A. L. Nettles, (Eds.),
Equity and excellence in educational test and assessment (pp. 89-114). Boston, MA: Kluwer.
Darling-Hammond, L., & Wise, A. E. (1985). Beyond standardization: State standards and school improvement. The
Elementary School Journal, 85, 315-336.
Dubois, P. A. (1966). Test dominated society: China 1155 BC-1905AD. In A. Anastasi (Ed.), Testing problems in perspec-
tive. Washington, DC: American Council on Education.
Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality control in the development and use of performance
assessments. Applied Measurement in Education, 4, 298-303.
Egan, M., & Bunting, B. (1991). The effects of coaching on I 1+ scores. British Journal of Educational Psychology, 61,
85-9 !.
Feuer, M. J., & Fulton, K. (1994). Education abroad and lessons for the United States. Educational Measurement: Issues
and Practice, 13(2), 31-39.
Garcia, T., & Pintrich, P. R. (1994). Regulating motivation and cognition in the classroom: The role of self-schemas and
self-regulatory strategies. D. H. Schunk, & B. J. Zimmerman (Eds.), Self-regulation of learning and performance
(pp. 127-153). Hillsdale, NJ: Lawrence Eflbaum Associaties.
Gearhart, M., & Herman, J. (1995). Portfolio assessment: Whose work is it? Issues in the use of classroom assignments
for accountability. Evaluation Comment, Winter.
Gearhnrt, M., Herman, J., Baker, E., & Whittaker, A. (1993). Whose work is it? A question for the validity of large-scale
por¢olio assessment (CSE Tech. Rep. No. 363). Los Angeles, CA: University of California, Los Angeles Center for
the Study of Evaluation.
412 J.L. HERMAN

Gitomer, D. H. (1993). Performance assessment and educational measurement. In R. E Bennett & W. C. Ward (Eds.),
Construction versus choice in cognitive measurement (pp. 241-263). Hillsdale, N J: Lawrence Erlbaum.
Glaser, R. ( 1991 ). Expertise and assessment. In M. C. Wittrock & E. L. Baker (Eds.), Testing and cognition (pp. 17-30).
Englewood Cliff, NJ: Prentice-Hall.
Glaser, R., & Silver, E. (1994). Assessment, testing and instruction. Annual Review o[ Psychology, 40, 631-666.
Hardy, R. A. (1995). Examining the costs of performance assessment. Applied Measurement in Education, 8(2), 121-
134.
Herman, J. L. (1996). Technical quality matters. In R. Blum (Ed.), PerJbrmance assessment within the context of school
restructuring. Alexandra, VA: Association for Supervision and Curriculum Development (ASCD).
Herman, J. L., Gearhart, M., & Baker, E. L. (1993). Summer). Assessing writing portfolios: Issues in the validity and
meaning of scores. Educational Assessment, 1(3), 201-224.
Herman, J. L., & Golan, S. (1991). EffEcts of standardized testing on teachers and learning: Another look (CSE Tech.
Rep. No. 334). Los Angeles, CA: University of California, Center for the Study of Evaluation.
Herman, J. L., & Golan, S. (1993). Effects of standardized testing on teaching and schools. Educational Measurement:
Issues and Practices, •2(4), 20-25.
Herman, J. L., & Klein, D. C. (1996). Assessing equity in alternative assessment: An illustration of opportunity-to-learn
issues. Journal of Educational Research, 89, 246-256.
Herman, J. L., Osmundson, E., & Pascal, J. (1996). Evaluation of the Los Alamos National Laboratory Critical Issues
Forum. Los Angeles: University of California, Center for the Study of Evaluation.
Hoover, H. D. (1996, May). Practical and technical issues in the development of perJbrmance assessments in a large-
scale testing program. Paper presented at a meeting at the National Center for Research on Evaluation, Standards, and
Student Testing, UCLA, Los Angeles, CA.
Hoover, H. D., & Bray, G. B. (1995). The research and development phase: Can performance assessment be cost effec-
tive ? Paper presented at the annual meeting of the American Educational Research Association. San Francisco, CA.
Johnson, J., & lmmerwahr, J. (1994). First things first: What Americans expect from the public schools. New York:
Public Agenda.
Kellaghan, T., & Madaus, G. ( 1991). November). National testing: Lessons for America from Europe. Educational Leader-
ship, 49, 87-93.
Keeves, J. (1994). Tests: Different types. In T. Husen & T. N. Postlethwaite (Eds.), International encyclopedia of educa-
tion, Second edition. Oxford, England: Pergamon.
Koretz, D., Linn, R., Dunbar, S., & Shepard, L. (1991). The effects of high stakes testing on achievement: Prelbninary
findings about generalization across tests. Paper presented at the annual meeting of the American Educational Research
Association, Chicago.
Koretz, D. M., Madaus, G. E, Haertel, E., & Beaton, A. E. (1992). National educational standards and testing: A response
to the National Council on Education Standards and Testing. Santa Monica, CA: RAND.
Koretz, D., McCaffrey, D., Klein, S., Bell, R., & Stecher, B. (1993). The reliability of scores from the 1992 Vermont
Portfolio Assessment Program. (CSE Tech. Rep. No. 355). Los Angeles: University of California, Center for Research
on Evaluation, Standards, and Student Testing.
Koretz, D., Stecher, B., Klein, S., McCaffrey, D., & Deibert, E. (1993). Can portfolios assess studentpe~brmance and
influence instruction? The 1991-92 Vermont experience (CSE Tech. Rep. No. 371). Los Angeles: University of
California, Center for Research on Evaluation, Standards, and Student Testing.
Koretz, D., Stecher, B., Klein, S., & McCaffrey, D. (1994). The Vermont Portfolio Assessment Program: Findings and
implications. Educational Measurement: Issues and Practice, 13(3), 5-16.
Koretz, D. M., Barron, S., Mitchell, K. J., & Stecher, B. M. (1996). Perceived effects of the Kentucky Instructional
Results Information System (KIRIS). Santa Monica, CA: RAND
Koretz, D. M., Mitchell, K. J., Bah'on, S., & Keith, B. (1996). Perceived effects of the Maryland State Assessment Program.
(CSE Tech. Rep. No. 406). Los Angeles, CA: University of California, Los Angeles Center for the Study of Evalua-
tion.
LeMahieu, P., Gitomer, D., & Eresh, J. (1994). Portfolios beyond the classroom: Data quality and qualities (MS # 94-01).
Princeton, N J: ETS Center for Perfomance Assessment.
Linn, R., Baker, E., & Dunbar, S. (1991). Complex, performance-based assessment: Expectations and validation criteria.
Educational Researcher, 20(8), 15-21.
Linn, R. L., Burton, E., DeStefano, L., & Hanson, M. (1995). Generalizability of New Standards Project 1993 pilot study
tasks in mathematics (CSE Tech. Rep. No. 392). Los Angeles, CA: University of California, Los Angeles Center for
the Study of Evaluation.
McDonnell, L. M. (1996). The politics of state testing: Implementing new student assessments. Los Angeles California:
UCLA Center for the Study of Evaluation.
McDonnell, L. M., & Choisser, C. (forthcoming). Testing and teaching: Local implementation of new state assessments.
Los Angeles, CA: University of California, Los Angeles Center for the Study of Evaluation.
Moerkerke, G. (1996). Assessment for flexible learning: Performance assessment, prior knowledge state assessment and
progress assessment as new tools. Utrecht: Uitgeverij Lemma BV.
Educational Testing and Assessment 413

Muth6n, B., Huang, L. C., Jo, B., Khoo, S. T., Goff, G., Novak, J., & Shih, J. (1995). Opportunity to learn effects on
achievement: Analytic aspects. Educational Evaluation and Policy Analysis, 17(3), 371-403.
Myford, C., & Mislevy, R. (1995). Monitoring and improving a porqblio assessment system (Center for Performance
Assessment Research Report). Princeton, N J: Educational Testing Service.
Picas, L. O. (1994). A conceptual framework for analyzing the costs of alternative assessment (CSE Tech. Rep. No. 384).
Los Angeles, CA: University of California, Los Angeles Center for the Study of Evaluation.
Picus, L. O., & Tralli, A. (1996). Alternative assessment programs: What are the total costs of assessment in Kentucky
and Vermont. Los Angeles, CA: University of California, Los Angeles Center for Research on Evaluation, Student
Standards, and Testing.
Resnick, L. B., & Kiopfer, L. E. (1989). Toward the thinking curriculum: An overview. In L. B. Resnick & L. E. Klopfer
(Eds.), Toward the thinking curriculum: Current cognitive research (pp. 1-8). Alexandria, VA: Association for Supervi-
sion and Curriculum Development.
Rogosa, D. (1994). Misclassification in student performance levels (Technical report prepared for the California Learn-
ing Assessment System). Stanford, CA: Stanford University.
Rogosa, D., & Sanger, H. M. (1995). Longitudinal data analysis examples with random coefficient models. Journal of
Educational and Behavioral Statistics, 20, 149-170.
Schoenfeld, A. H. (1991 ). On mathematics as sense-making: An informal attack on the unfortunate divorce of formal and
informal mathematics. In J. Voss, D. N. Perkins, & J. Segal (Eds.), Informal reasoning and education. Hillsdale, NJ:
Lawrence Edbaum.
Seltzer, M. H., Frank, K. A., & Bryk, A. S. (1994). The metric matters: The sensitivity of conclusions about growth in
student achievement. Educational Evaluation and Policy Analysis, 16( 1), 41-50.
Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational
Measurement, 30, 215-232.
Shavelson, R. J., Baxter, G. P., & Pine, J. (1991). Performance assessment in science. Applied Measurement in Educa-
tion, 4, 347-362.
Shavelson, R. J., Mayberry, Paul W., Li, Weichang, & Webb, N. M. (1990). Generalizability of Job Performance Measure-
ments: Marine Corps rifleman. Military Psychology, 2, 129-144.
Shepard, L. (199 !). Will national tests improve student learning ? (CSE Technical Report 342). Los Angeles, CA: National
Center for Research on Evaluation, Student Standards, and Testing.
Slavin, R. E. (1990). Cooperative learning: Theory, research and practice. Englwood Cliff, NJ: Prentice-Hall.
Smith, M. L. (1994). How assessments work: Lessons learned in equity. Presentation at the annual conference of the
national Center for Research on Evaluation, Standards, and Student Testing, Los Angeles, CA.
Smith, M. L. (1996). Reforming schools by reforming assessment: Consequences of the Arizona student assessment
program. (CSE Tech. Rep. No. 425). Los Angeles, CA: National Center for Research on Evaluation, Student Standards,
and Testing.
Sugrue, B. (1996). The relative validity and reliability of different assessment formats for measuring domain specific
problem solving ability. (CSE draft deliverable). Los Angeles, CA: National Center for Research on Evaluation, Student
Standards, and Testing.
Stecher, B. (1995). The cost of performance assessment in science. Invited symposium presented at the annual meeting
of the National Council on Measurement in Education. San Franciso, CA.
Weinstein, C., & Meyer, D. ( 1991). Implications of cognitive psychology for testing: Contributions from work in learn-
ing strategies. In M. C. Wittrock & E. L. Baker (Eds.), Testing and cognition (pp. 4 0 ~ 1). Englewood Cliffs, N J: Pren-
tice Hall.
Wolf, D. P. (1992). Good measures: Assessment as a tool for educational reform. Educational Leadership, 49(8), 8-13.
Wolf, D. P., Bixby, J., Glenn, J., & Gardner, H. (1991). To use their minds well: Investigating new forms of student
assessment. In G. Grant (Ed.), Review of research in education (Vol. 17, pp. 31-74). Washington, DC: American
Educational Research Association.
Wolf, R. M. (1994). Performance assessment in IEA studies. International Journal of Educational Research, 21(3), 239-
245.

Biography

J o a n H e r m a n is P r o f e s s o r o f E d u c a t i o n in t h e C e n t e r f o r S t u d i e s in E v a l u a t i o n at t h e U n i v e r s i t y
o f C a l i f o r n i a , L o s A n g e l e s . H e r m a j o r r e s e a r c h i n t e r e s t is t e c h n o l o g i c a l a d v a n c e s in p e r f o r m a n c e
and authentic assessment.
CHAPTER 5

ASSESSMENT AS A MAJOR INFLUENCE ON LEARNING


AND INSTRUCTION

FILIP J. R. C. DOCHY and GEORGE MOERKERKE

University of Heerlen, Educational Tech Innovation Centre, PO Box 2960, 6400 DL Heerlen,
The Netherlands

The vast amount of research on intelligence testing has provided a great deal of what is known
today in the field of applied assessment. It is certainly true, however, as Lohman (this issue,
1993) points out, that the assessment of knowledge and skills in the future needs a grounding in
a theory of knowledge and skills transfer and a theory of person-situation interaction. Advances
in reliability and validity theory offer perspectives for the future. Several studies have resulted in
techniques such as generalizability analysis (e.g., Moerkerke, 1996; Van der Vleuten & Van Luyk,
1986) and differential item functioning analysis (e.g., Berberoglu, Dochy, & Moerkerke, 1996)
that appear to be useful for a next generation of assessment tools.
Cognitive educational psychology has also influenced assessment theory. In the past two decades
this influence has been stronger than ever before (Baker, 1994; Baxter, Elder, & Glaser, 1996;
Bennett, 1993; Glaser, 1991; Snow & Lohman, 1989). Royer, Cisero, and Carlo (1993) provide
an interesting overview of techniques for assessing cognitive skills which portray the influence
of educational psychology very well. As Mislevy (1996) has recently stated: "The cognitive revolu-
tion is afait accompli in psychology, and it has begun to influence the ways in which educators
seek to characterize, monitor and advance students' learning" (p. 41 I). Mislevy also points out
that the models and methods that have evolved within standard test theory can be extended and
should be reconceived to address problems cast in a discourse on student learning.
Despite these advances, the reality is that most daily school practice still looks very much like
"teaching to the test" and perhaps even more like "learning to the test." From one point of view
this is discouraging because it represents a major task here waiting for all those concerned with
education. From another perspective, however, here lies the power for all innovations in educa-
tion (Wilbrink, 1997; Wolf, 1992). As assessment changes, learning and teaching will change
also.

The Need for Research on Performance Assessment in Small-Scale Instructional Settings

When reflecting on the past one can also stress what has not received sufficient attention. One
of the most striking blind spots in the field of educational measurement is small-scale assessment
research. Within classrooms many day-to-day instructional decisions are based on assessments
415
416 E J. R. C. DOCHY and G. MOERKERKE

of knowledge and skills. These decisions should be based on reliable, valid, and useful informa-
tion. Moreover, these assessments should be efficient since classroom time is a valuable resource.
Given the importance of these instructional decisions it is remarkable that scientific interest has
so long been devoted to the development of heuristics, psychometric models, and software tools
that support the design and validation of large-scale testing programs. At the same time, little
scientific attention for heuristics and tools that guarantee the quality of small-scale assessment
programs designed and evaluated by teachers has been evident. As a result, teachers are often not
adequately prepared for the assessment part of their job; training programs for teachers are often
dominated by psychometric problems and models, and solutions for large-scale assessment
problems (Stiggins, 1991).
A decade ago, Gulliksen (1985) wrote the following after a long career in measurement.

I think it is important for persons interested in education ... to bear in mind that large-scale standardized
testing programs require that a test or tests be given to thousands of students only a few times a year using
standardized machine-scoredanswers sheets. . . . The foregoing situation must be differentiated from the
requirements of and the methods appropriate for tests prepared by the individual teacher.... I am begin-
ning to believe that the failure to make this distinction is responsible for there having been no improve-
ment, and perhaps even a decline, in the quality of teacher-made small classroom tests over the last forty
years (p. 4).

Since Gulliksen's report, the quality of classroom assessment has received more attention from
educational measurement specialists. At the same time, however, researchers in the field believe
that considerable research is yet to be done on the development of tools and heuristics for teach-
ers who need to integrate assessment with learning and instruction (Birenbaum, 1996; Linn, 1989;
Nitko, 1989).
Performance assessment is one of the areas of research which should be emphasized; its role
in education is growing rapidly and the scientific progress has long been too slow. Priority should
be given to the activities of teachers and their reasons for assessing during instruction. Teachers
should have a framework which enables them to validate the assessment procedures they develop
and use. In addition, teachers should have the tools for developing and applying assessment
procedures systematically, bearing in mind the consequences assessment has for instruction and
learning.

Validation Techniquesfor Small-Scale Performance Assessments

It is very doubtful that standard validity procedures used in large-scale testing programs are
adequate for small-scale performance assessments. Large-scale testing programs involve a number
of activities that guarantee a certain amount of validity. Scientifically grounded instructions and
tools are used by professional test developers in the construction of the test. Pilot test procedures
are used by domain experts, students, raters, administrative staff, and so on. Research is undertaken
to investigate the validity of the pilot test procedures. Any form of research commonly applied in
social science is, in principle, appropriate in this regard (Messick, 1989). The results of the pilot
tests are used to make adjustments in the test prior to its wide scale use.
Because of a lack of resources, validation of small-scale assessment procedures is only occasion-
ally carried out according to Messick's guidelines. Several researchers working with small-scale
assessment programs have observed that field of psychometrics is inadequate to deal with the
problems occurring in the development and use of these assessments (Ridgway & Schoenfeld,
Educational Testing and Assessment 417

1994; Shephard, 1993; Stiggins, 1991). Shephard (1993), for example, emphasized the need for
a conceptual framework that identified the critical validity questions for assessment programs.
Teachers and others developing small-scale performance assessments need to know the answers
to several key questions in order to be able to defend the trustworthiness and validity of their
assessment programs. Examples include: How much evidence is needed before we can make
decisions based on this assessment procedure? Where should validation start?.
Kane (1992) provided one way to deal with questions of validity in small-scale educational
settings. He proposed a method called the argument-based approach to validity, an approach which
has an affinity with Cronbach's (1982) view on validation as an act of evaluation. Most evalua-
tors have to evaluate their programs in a very limited time. Since they do not have the opportunity
to extensively investigate a number of questions, they must identify the most relevant questions
concerning the program and decide on how elaborately each should be answered (Cronbach,
1982). Validation of an assessment program can be considered as evaluation in a social environ-
ment. The emphasis need not be on empirical research, but on arguments defending the quality
of the program. Results from empirical research may be one of the elements of argumentation.
According to Kane (1992), argumentation that is transparent and coherent and makes use of
plausible assumptions will contribute most to the validation of assessment procedures. As a
consequence, the most questionable assumption deserves the most attention:

To validate a test-score interpretation is to support the plausibility of the corresponding interpretive argu-
ment with appropriate evidence. The argument-based approach to validation adopts the interpretive argu-
ment as the framework for collecting and presenting validity evidence and seeks to provide convincing
evidence for its inferences and assumptions, especially its most questionable assumptions. One (a) decides
on the statements and decisions to be based on the test scores, (b) specifies the inferences and assumptions
leading from the test scores to these statements and decisions, (c) identifies potential competing interpreta-
tions, and (d) seeks evidence supporting the inferences and assumptions in the proposed interpretive argu-
ment and refuting potential counter arguments (Kane, 1992, p. 527).

This approach has several advantages for the validation of performance assessments by teachers.
First, its orientation is towards improvement of the existing assessment procedures instead of
"go"/"no go" decisions. Second, its orientation is towards solving both practical and critical
problems in assessment procedures. Third, it can be applied in many situations since no particular
empirical investigation is prescribed. Moreover, for the validation of small-scale assessment
programs empirical research is a possible but not an essential element in the construction of
arguments. This means the core issue about validation is not that we must collect data to underpin
validity, but that we should formulate transparent, coherent, and plausible arguments to underpin
validity.
Moss (1992) stated that the changing views on validity could be useful for developers of
performance tests in formulating new criteria that might be considered as elements of the generic
criterion of validity. The concept of quality has been extended beyond that of traditional validity
to include such criteria as the transparency of the assessment procedure, the impact of assess-
ment on education, directness, effectiveness, fairness, completeness of the domain description,
practical value and meaningfulness of the tasks for candidates, and authenticity of the tasks (Haer-
tel, 1991; Linn, Baker, & Dunbar, 1991). In Europe, a number of these features of assessment
were identified earlier by legal specialists and instructional technologists (cf. De Groot, 1970),
but they were not considered to be an integral part of the validation of assessment procedures by
educational measurement specialists. Although Messick (1994) did not disagree with most of the
criteria mentioned by Moss (1992), Haertel (1991), and Linnet al. (1991), he argued that what
he called "specialized validation criteria" for performance assessment could be elaborated more
418 E J. R. C. DOCHYand G. MOERKERKE

systematically within the framework of the unifying concept of validity. According to Messick
(1994), these validation criteria are, in a more sophisticated form, already part of the unifying
concept of validity which he expressed in 1989. As far as we know, this statement has not been
yet contradicted. Nonetheless, the theoretical definitions and concepts of validity theory may not
be the most optimal frameworks in which teachers and measurement specialists can build their
arguments about the validity of performance assessment. The necessary arguments should address
the most critical questions about the weakest parts of the assessment procedures. These questions
will be formulated by students, parents, administrators, colleagues, peer review committees,
organizations of employers, and occasionally measurement specialists. One day in the future, it
may become clear which validity concepts are most appropriate for communication among
measurement specialists and which are most appropriate for communication among other
participants in assessment.

A Research Program for Performance Assessment: Providing Teachers the Means to Build
Arguments

Part of the research program in educational measurement should be concentrated on tools and
heuristics which enable teachers to build the arguments when validity is questioned. In general,
confidence in the validity of assessment increases with the use of standard methods whose
effectiveness has been established (Gronlund, 1988). Teachers need easy-to-follow, thoroughly
studied, efficient tools and heuristics that have been shown to be effective in similar educational
contexts. Once these methods and heuristics are trustworthy and crystallized, teachers and test
constructors can use them as a base for the activities involved in performance assessment and as
a base for formulating their claims about validity.
The main activities in the development of performance assessments are (1) describing the func-
tion of assessment, (2) establishing the performance to be assessed, (3) developing tasks to elicit
the student's skills, and (4) determining procedures for rating the performance (Moss, 1992; Stig-
gins, 1987). Since activities 3 and 4 are underrepresented in the literature, they are emphasized
in the sections that follow.

Performance assessment and task construction

The literature does not offer cut-and-dried formulas for the identification of the relevant task
dimensions and their values. Most literature on task construction describes the construction of
multiple choice test items. These methods often provide very precise prescriptions for well-
defined problems, like those in arithmetic (cf. Birenbaum, Tatsuoka, & Gutvirtz, 1992). Research
on task construction for assessment of problem solving skills for so-called ill-defined problems
(see Newell, 1969) is scarce. In this regard, Moerkerke (1996) evaluated a method for task
construction for performance assessment.
The method consists of a tool and a number of rules for using that tool. The tool is called a
task specification chart (TSC). A TSC lists the dimensions on which the tasks vary within the
chosen universe and the values of these dimensions. There are no constraints on values; values
can be numerical or qualitative. Numerical values can be discrete or continuous. Each task is
specified by choosing one or more values for each dimension. Each unique combination of values
is called a task profile and every task profile defines a class of tasks.
Educational Testing and Assessment 419

Moerkerke (1996) identified six general task dimensions in a review of literature on task
construction: cognitive behavior, subject-specific content, context of a problem, presentation of
a problem, structure of a problem, and form of response. In a project aimed at the construction of
a selection instrument two additional task dimensions were needed for a full description of
performance tasks: the physical environment of the candidate and assessment-related issues.
Moerkerke provided detailed descriptions of the use of TSCs for the assessment of complex skills
in business education and chemical process operating.

Performance assessment and rating procedures

Rating procedures have to be designed in order to minimize two major threats to validity:
construct-irrelevant variance and construct underrepresentation (Messick, 1995). In the case of
construct-irrelevant variance, the assessment contains systematic variance that is irrelevant to
the construct being measured. It may be variance that can be accounted for by the assessment
format, or by the intrusion of other constructs that are independent of the construct being measured.
The threat of construct underrepresentation means that the assessment fails to include important
dimensions of the construct. Generally, the assessment is too small in relation to the construct
being measured.
The literature on ratings usually concentrates on construct-irrelevant variance. The standard
approach identifies raters as sources of bias. Raters are considered to be, and are expected to act
as, interchangeable, randomly selected elements from an infinite universe. A rater should strive
to be a nameless clerk who could be replaced by any other clerk with the same qualifications
without even being noticed. Raters should be as objective as identical machines, performing identi-
cal actions leading to identical decisions.
There is extensive evidence that some form of regimentation is necessary in order to achieve
an acceptable level of interrater agreement. The following results were often reported when rat-
ing guidelines were absent (Burger & Burger, 1994; Moerkerke, 1996; Van der Vleuten & Van
Luyk, 1986):
• different raters do not always assign the same grades to the same product or procedure;
• a single rater does not always assign the same grades to the same paper on different occa-
sions; and
• differences among the grades assigned tend to increase as the examinee's task permits greater
freedom of response and actions.

Suggestions for dealing with these problems also appear in the literature (Gronlund, 1988; Haer-
tel, 1991). They include:
• constructing guidelines (e.g., model answers, product scales, checklists and rating scales,
and advice for scoring procedures),
• using multiple raters, and
• selecting and training raters.

Construct underrepresentation in relation to ratings is less well investigated. How can a teacher
argue that his or her rating is valid? For the answer to this question alternative visions on the
variability among raters must be taken into account. Some authors feel that variability among
raters is not only due to errors and bias, but is in part a result of systematic differences within the
community of problem solvers. Voss and Post (1990) discussed assessing the quality of a product
420 E J. R. C. DOCHYand G. MOERKERKE

or process in relation to solving ill-defined problems. They noticed that communities of problem
solvers are not able to unanimously specify the solution to an ill-defined problem. This means
that the one-and-only true status, the one and only objective reality, which is the bedrock of the
positivistic view on objectivity, does not exist. In the words of Voss and Post (1990), "the important
point to make about good solutions to ill-structured problems is that there are generally not right
answers" (p. 281). In addition, there is no universally accepted stopping rule for an ill-structured
problem. "Ill-structured problems are regarded as solved via the application of stop rules, with
such rules being established for the particular domain, and, quite importantly, often being applied
differentially by different persons" (Voss & Post, 1990, p. 281).
How should one assess the quality of the solution under those conditions? Voss and Post
underlined the importance of argumentation as a means to motivate judgments of the quality of
solutions for ill-defined problems, since universally accepted criteria do not exist. The existence
of schools of thought was investigated in a project on judging software programming activities
(Moerkerke, 1996). Systematic - - but implicit - - differences among teachers who taught program-
ming were found. Defining and exposing these differences was necessary to reform the assess-
ment (the purpose of the project) and this eventually led to changes in instruction.
With respect differences among raters, an interesting view was expressed by Suen, Logan,
Neisworth and Bagnato (in press). They argued that the focus on objectivity among raters as a
desirable characteristic of an assessment procedure leads to a loss of relevant information. If dif-
ferent raters have unique qualities, than each rater may provide relevant and unique information
on performance. In those cases, there seems to be no logic in requiring congruence among judg-
ments. In high-stake decisions, procedures which include ways of weighing high quality informa-
tion from multiple perspectives may lead to better decisions than those in which information
from a single perspective is taken into account.
The issue of construct underrepresentation in relation to rating can be addressed by an explicit
definition of the quality of the processes or products which are to be assessed. This means that
product and process quality criteria needs to be identified, defined, and communicated to candidates
and raters. A candidate needs to know these criteria in order to be a successful problem solver, a
rater needs to know these criteria in order to give a rating which is valid with respect to the skill
which is the scope of the assessment. Some, like Stiggins (1987) see the selection of criteria as
the most critical activity in the design of performance assessments.
A few authors have addressed the role of criteria in assessment and/or proposed a taxonomy of
criteria. Arter (1993), for example, identified two types of criteria: criteria for task specific scor-
ing and criteria for generalized judgmental scoring. Criteria for task specific scoring are those
which are task-related because they indicate which elements of the responses should be scored.
The criteria for task specific scoring become rating scales or checklists (Gronlund, 1988). Criteria
for generalized judgmental scoring are not task specific. The same set of criteria can be used
across a set of tasks or exercises.
A taxonomy of criteria has been developed by Sadler (1983), who distinguished four classes
of performance criteria: regulative, logical, prescriptive, and constitutive. Regulative criteria are
rules that are meant to provide a degree of uniformity in presentation. These criteria are prescrip-
tions for characteristics such as length, layout, structure, spelling, grammar, and use of tools.
Some rules are accepted practice, while others are more-or-less arbitrary decisions made by a
community of experts.
Logical criteria are used to evaluate the validity of reasoning. They capture the way humans reason;
they are about the way equations should be solved, theorems should be proven, conclusions should
Educational Testing and Assessment 421

be reached, and so on. Logical criteria are probably not universal in a strict sense, but most logical
criteria are broadly accepted with the paradigms of science.
Prescriptive criteria are normative statements about how something is to be valued. These
criteria are abstract concepts that link features of the same nature. When the link is made, these
concepts can be labeled. Examples include coherence, originality, and readability.
Finally, constitutive criteria are used to define and distinguish disciplines. Their basis is
consensus among members of that discipline on the standard categories, concepts, and methodolo-
gies. Sadler stated that regulative and logical criteria both refer to empirical facts of products or
behaviors. Errors and mistakes are detectable and standards can be defined. Since the criteria are
often straightforward, it is relatively easy to find out when an error has been made and when it
has not. Prescriptive and constitutive criteria lead to judgments that are essentially subjective.
According to Sadler (1983), they "rely greatly upon whether the assessor is persuaded or
convinced" (p. 70).
Typologies of criteria, whether abstract (Sadler, 1983) or directly related to domains (e.g.
Boehm, Brown, Kaspar, Lipow, MacLeod, & Merrit, 1978), can be helpful in developing a set of
rating criteria for performance assessment. Along this line, Arter (1993) has provided a heuristic
for the development of "your own general judgmental performance criteria" (p. 11). Her heuristic
includes seven steps. They are:
• gather samples of student performance;
• brainstorm alist of attributes;
• cluster the attributes;
• write a value-neutral definition of each trait;
• describe strong, middle and weak performance on each attribute;
• define benchmarks; and,
• make it better.

Her insights can be combined with the ones that we gained in an instructional setting that was
somewhat more complex by virtue of the number of teachers involved and the need for
instructional scientists to work with experts from several disciplines. The following five phases
quite likely will produce a defensible set of performance criteria.
Phase 1: Orientation of the Criteria. In this phase, domain experts are asked to gather a sample
of student performance, to make explicit their tacit knowledge about quality and criteria in their
domain, and to search and analyze relevant literature. A measurement specialist could provide
information extracted from theory, research, or similar projects in other domains. The problem
should be approached by brainstorming and creative thinking. This leads to a provisional list of
criteria and a provisional definition of good problem solving with the domain.
Phase 2: Defining, Relating, and Selecting Criteria. In this phase, domain experts provide a
more or less value-neutral definition for each criterion. The criteria are related to each other and
selected or integrated on the basis of their definitions. Quality requirements (and their defini-
tions) for a highly specialized domain need to be reworked into rating criteria for performance
assessment in schools and universities.
Phase 3: Elaborating and Illustrating the Rating Criteria. In this phase, domain experts
elaborate the criteria by describing or providing good and bad examples of behaviors or products.
They also give an indication of how the features of a behavior or product are related to one or
more of the criteria. The examples can be produced by the experts (for instance, model answers
or model products) or can be gathered from examinees.
Phase 4: Investigating the Importance of the Rating Criteria. In this phase, domain experts
422 E J. R. C. DOCHYand G. MOERKERKE

give indications on the importance of the criteria. This could be done for a course, but also for a
curriculum or different moments in a curriculum. The latter can be useful in supporting the idea
that during a curriculum students progress from novices to experts and are able to use more profes-
sional criteria towards the end of their study (Sadler, 1983). Coherence among the experts and
the assumption that there are not different schools of thought among the experts should be
investigated.
Phase 5: Improve Ratings and~or Instruction. In this phase, domain experts will have to compare
the results of phases one through four with the existing rating procedures and the content and
objectives of instruction. In this phase, the decision must be made as to whether the rating
procedures and/or instruction should be improved.

Changing Education Today: Powerful Learning Environments and the Need for New
Assessment Instruments

One of the changes affecting our learning systems most is the mass implementation of differ-
ent forms of Powerful Learning Environments (PLE). Examples include problem-based learn-
ing, study-house learning, and project-oriented learning. Characteristics that are relevant for the
design of powerful learning environments have emerged from recent research on learning and
instruction (De Corte, 1990). Two are discussed here: the constructive nature of learning, and the
necessity to anchor learning in real life contexts.
A robust result of recent research in instructional psychology is that learning is an active and
constructive process. Learners are not passive recipients of information; rather, they actively
construct their knowledge and skills on the basis of their prior knowledge - - informal as well as
formal - - and through interaction with their environments. Learning environments should,
therefore, support constructive acquisition processes. Powerful learning environments are
characterized by a balance between discovery learning and personal exploration, on the one hand,
and systematic instruction and guidance, on the other. Individual differences in abilities, needs,
and motivation between students are always taken into account.
The need to anchor learning in real life situations derives from a variety of investigations reflect-
ing different theoretical perspectives, namely, studies in the Vygotskian tradition (Vygotsky, 1978),
findings concerning the impact of children's informal knowledge on their learning, and analyses
of successful knowledge and skill acquisition in non-school situations (Resnick, 1987). All these
research outcomes support the conclusion that students' constructive learning activities should
preferably be embedded in contexts that are rich in resources and learning materials, that offer
ample opportunities for social interaction, and that are representative of the kinds of tasks and
problems to which the learners will have to apply their knowledge and skills in the future.

The Need for Reconceptualizing Assessment Instruments

The new findings and insights about powerful learning environments provide a framework for
reflecting on and evaluating the current educational practice in our schools. In addition - - and
considerably more important in the context of this c h a p t e r - - they point to the necessity of recon-
ceptualizing current tests and assessments and critically examining their underlying theory (Gla-
ser, 1990; Lohman, this issue). As Mislevy (1993) has observed:
Educational Testing and Assessment 423

Most items on standard achievement tests assess students' abilities to recall and apply facts and routines
presented during instruction. Some require only the memorizationof detail. . . . Other achievement test
items, although supposed to assess higher-level learning outcomes like "comprehension" and "applica-
tion", often require little more than the ability to recall a formula and to make appropriate substitutions to
arrive at a correct answer (pp. 219-220).

Although one should recognize that the accumulation conception of learning is to some degree
applicable in some domains and under certain circumstances, it is not the appropriate view with
respect to the most important objectives of present day education (i.e., understanding and problem
solving). Neither is it appropriate with respect to the forms of learning that involve the construc-
tion of meaning by the student and the development of strategies for approaching new problems
and learning tasks. Therefore, new types of instruments are required that allow the assessment of
qualitatively distinct levels of understanding (or misunderstanding), as well as strategic differ-
ences in learners' approaches to unfamiliar problems and challenging learning situations.
The need for this new type of assessment instrument is especially obvious from the perspec-
tive of the need for a better integration of instruction and assessment (Dochy, 1992; Nitko, 1989).
Indeed, because of their static and product-oriented nature, traditional achievement tests fail to
provide relevant diagnostic information which is needed to adapt instruction appropriately to the
needs of the learner (Campione & Brown, 1990; Dochy, 1994; Snow & Lohman, 1989).
One example of such a reconceptualized assessment is the OverAll Test (OAT) at the University
of Maastricht (Segers, 1996). The OAT measures the extent to which students are able to analyze
problems and contribute to their solution by applying the knowledge they acquired, such as
economic concepts, models, and theories. Additionally, the OAT-items measure if students are
able to retrieve the relevant knowledge needed for solving the problem presented; if they know
"when and where" (i.e., conditional knowledge) (Dochy & Alexander, 1995).
The OAT is administered twice a year within the curriculum. Students receive a manual
beforehand which provides information about the main goals of the OAT, the parts of the cur-
riculum which are relevant for the study of the material presented in the manual, an example of
an elaborated case with test items, some practical (organizational) information, and finally a set
of articles. The character of the articles is different. For instance, an article can be a description
of a case relating to innovations in an international firm found in newspaper or journal. Other
articles express theoretical considerations of a scientist, contain the report of a research, and
include comments on a theory or model.
During a self-study period students are expected to apply the knowledge they acquired over
the past week in explaining the new, complex problem situations described in the articles. They
have to try to explain spontaneously to themselves (i.e. without being explicitly prompted by a
tutor) the ideas and/or theories described in the articles by relating them to previously acquired
knowledge. This process is referred to as "self-explanation" (Chi & Van Lehn, 1991).
The OAT combines two item formats: true-false items with the question mark option and
essay questions. The true-false items are mostly intended to determine whether students can
apply acquired knowledge in new situations and can use abstract concepts to understand
specific complex situations which are relevant for the "real life of economists". The essay
questions assess, for instance, the ability of the students to determine the c o m m o n elements
in three different plans for reducing inflation, or if students can detect the similarities and
differences between an economic model explained by the author of an article and the model
as described by authors previously studied. These questions also measure students' analytic
abilities. Other essay questions ask students to formulate solutions for a problem described in
424 E J. R. C DOCHYand G. MOERKERKE

an article, such as developing strategies for reducing inflation or taking account of the micro-
and macro-elements mentioned by the author or test constructor. Such items also measure
synthetic ability.

The Future: Integrating Learning, Instruction and Assessment Through Flexible and
Transformative Learning in Powerful Learning Environments

There currently is a strong appeal to take the "sword of Damocles" away from all testing and
to view assessment as a means to achieve the prescribed goals of learning. One could question
whether the future will bring us as far as this "overall assessment prophecy" would predict?
There is evidence that teachers will alter their instruction to provide those activities that support
student transformative learning (see Herman, this issue). Within another decade it will be seen
how much assessment will be integrated with learning. A model for this integration, based on
different forms of assessment, is proposed later in this section.

The Overall Assessment Prophecy

For several years, Glaser (1990) has advocated of what is called the "overall assessment
prophecy." This prophecy holds that it is no longer possible to consider assessment only as a
means of determining which individuals are already adapted to or have the potential for adapting
to mainstream educational practice. A conceivable alternative goal is to reverse this sequence of
adaptation; rather than requiring individuals to adapt to means of instruction, the desired objec-
tive is to adapt the conditions of instruction to individuals to maximize their potential for suc-
cess. This objective can be realized only if learning can be designed to take account of an
"individual's profile of knowledge and skills" (Pelligrino & Glaser, 1979).
The question before us is whether we should aim at assessment-driven instruction or at
instruction-driven assessment in order to reach the aforementioned learning goals. At present,
educational practice is assessment-driven. Teachers do teach to the test. But there are obviously
problems related to this practice. One of the serious charges is that teaching to the test leads to a
sacrifice in the depth of the learning knowledge and skills. Specifically, as Hambleton and Sireci
(this issue) argue, with the association of multiple-choice items and high-stakes assessments,
teachers tend to emphasize the memorization of isolated factual information suitable for this type
of test format at the expense of high-order problem solving skills.
Our point of view is that assessing high-order skills by means of authentic assessments will
lead to the teaching of such high-order knowledge and skills. In this regard, we agree with Knight
(1996) that instructional reform can be attained by a careful reform in assessment. On the one
hand, the assumption is that alternative assessment will be instruction-driven by using authentic
test items directly related to instruction. (Birenbaum & Dochy, 1996). On the other hand, there is
an assumption that alternative assessment will have a positive "back wash" effect on instruction
making it more active and relating it more to real life experience.

Teaching: Support for Student Learning as Transformative Learning

In the near future it is hoped that teaching will be viewed as a set of support activities for
students' learning processes. Assessment is a core element in the transformative view insofar as
Educational Testing and Assessment 425

it yields reliable information about the added value or the degree of transformation related to
learning experiences. As the culture is shifting from testing to assessment (Birenbaum & Dochy,
1996), one should also try to change the culture in students. Much more formative assessment
will be needed in order to convince students that assessment has two main purposes: (1) show-
ing students their strong points, their weaknesses, and their growth, and (2) guiding students
towards the achievement of the learning goals. More formative assessment might be promoted
by making more use of educational technology, computer-based assessment packages, self-, peer-
and co-assessment and other alternative assessment forms (see Hambleton and Sireci, this issue).
The option of self-, peer- and co-assessment is an interesting way of providing formative assess-
ment. Not only do these activities ensure that learners must take the assessment criteria seriously
and develop their own understanding, they also promote important skills in the areas of judgment
and self-appraisal (e.g. communication skills, self-evaluation skills, observation skills, self-
criticism) (Boud, 1992).
Current, what is referred to as student self-assessment is a process which involves teacher-set
criteria and has the students themselves carry out the assessment. Another conception of student
self-assessment, however, requires that students assess themselves on the basis of criteria they
have selected with the assessment being either for the student's private information or for com-
munication to the teacher or others (Hall, 1995). There are two critical factors in this conception
of student self-assessment: 1) the student not only carries out the assessment, but 2) the student
also selects the criteria on which the assessment is based.
Similarly, in traditional peer-assessment the peers select the criteria and carry out the assess-
ment. Situations in which the tutor (peer) and the student share in the selection of criteria and/or
the carrying-out of the assessment is more accurately termed co-assessment (Hall, 1995). Profes-
sors still control the process as a part of their professional responsibilities, sometimes assisted by
professional bodies or assessment experts. Hence, although students' assessments are considered
seriously, they are perceived to be supplementary to the key competencies identified by the profes-
sor (Rogers, 1995).
Implementing forms of self-, peer- and co-assessment will decrease considerably the total
amount of time professors would need to spend on assessment. These forms of assessment will
likely have many faces: formative or summative; final assessments, assessments of growth, or
assessments of prior knowledge; assessment of knowledge, assessment of competencies, or the
integration of both.
Alternative assessments have a future in this context. Traditional tests and examinations have
tended to encourage memorization and rote learning even when lecturers hoped for something
more. Throughout the world there have been criticisms of the use of standardized tests as measures
of either learning or competence (Darling-Hammond, 1994; Linn et al., 1991; Rowe & Hill, 1996;
Wigdor & Garner, 1982). Hambleton and Sireci (this issue) discuss the unfortunate incorrect
association between criterion-referenced assessment and multiple-choice items. Newmann and
Archbald (1990) have argued that "most data fail to measure meaningful forms of human
competence and that significantly new forms of assessment need to be developed" (p. 164). As a
reaction to mainly the large-scale multiple-choice testing in the U.S., alternative approaches are
being investigated (e.g. Birenbaum & Dochy, 1996; Floden, 1994; Resnick & Resnick, 1992;
Shavelson, Xiaohong, & Baxter, 1996).
All these concerns and attempts illustrate the need for the development of more "in context"
and "authentic" approaches to assessment. Alternative assessment methods such as science
journals, Overall Tests, portfolios, group projects, use of case studies and simulations offer chances
426 E J. R. C. DOCHYand G. MOERKERKE

of making assessment a valuable learning experience in addition to being used to assign grades
(Birenbaum & Dochy, 1996).
Research suggests that students do often find newer forms assessment interesting and motivat-
ing. While students never lose interest in grades, they do learn and behave differently than they
do in courses where traditional tests are used (McDowell, 1996). Research on alternative assess-
ment has yielded the following conclusions (Birenbaum, 1996; Broadfoot, 1986; Dochy, Moerk-
erke, & Martens, 1996; Segers & Dochy, 1996; Wilbrink, 1997):
• alternative assessment methods are less threatening to most students than the traditional exams
and are perceived as fairer;
• students do find meaning in assignments such as projects, group exercises, and portfolios
because of their authenticity and their greater fit in powerful learning environments;
• although such assessments appeal more to students' internal motivation, grades remain on
students' minds; and,
• changing assessment methods encourages changing learning methods and results in students
shifting from pure memorization to real learning.

Use of multimedia, local area networks, shared communication systems, Internet with shared
electronic databases, video conferencing facilities, electronic self-study materials, study support
and guidance through networks, progress assessment systems, and intake and monitoring systems
will likely lead to new teaching and learning strategies. Audio, for example, is indispensable
when music or spoken language is at issue. Video, in turn, can be used when surrogate experi-
ences must be presented (e.g., exotic rituals, dangerous physical phenomena, animation of invis-
ible phenomena) or when dynamic processes or phenomena cannot be adequately presented in
printed form (e.g., plasma currents).
The computer can be seen as indispensable not only in assessing, but also for simulating
complex processes and systems (e.g., economic or physical models) and for practicing complex
skills (e.g., managing complex systems). Software that supports interactive assessments based
on different multiple-choice formats is already available (e.g., CAT system, Examiner) and the
use of different electronic devices such as CD-ROM movies, audio, or pictures with such items
is possible (e.g., as in Question Mark). Very soon, it will be possible to use Internet- based informa-
tion with assessment software; item formats can then be expanded to include cases, 3D events
and problems, and 3D simulations. Finally, it will be possible to integrate course development
software with high quality assessment software, called integrated learning and assessment systems,
to support assessment during learning (e.g. software such as Integrative Testing and Electronic
Media (ITEM) and the Mercator course development software).

A Model for Integrating Assessment and Learning

To guide our own research and development projects, we used a model which is consistent with
the overall assessment prophecy. The purpose of the model is to integrate informal assessment, formal
assessment, and learning and to give students explicit responsibilities for their progress in learning.
The model is also consistent with Nitko's (1989) opinions about the integration of assessment and
learning and Snow's (1990) view that cognitive testing in education should be aimed at initial state,
daily progress in competence and strategy, weekly progress in knowledge, or monthly progress in a
course. In this model (see Dochy, 1992), students themselves are responsible for formative assess-
ment (e.g. prior knowledge state assessments and progress assessments) which helps them get started
Educational Testing and Assessment 427

in their study and gives them the opportunity to monitor their progress or growth. Instructors, on the
other hand, are responsible for formal final assessment. The model is presented in Figure 5.1.
A student begins by stating his or her learning goals. These relate to a certain part of the knowledge
base (specific content or the whole of a university's courses)(arrow A). The second step is estimating
the state of one's prior knowledge. Dochy, Valcke, and Wagemans ( 1991) distinguished three types of
prior knowledge assessments: subject-oriented (directly relevant to the material to be studied), optimal
requisite (what one masters if one is to start a course under optimal circumstances) and domain-
specific (an individual's performance with respect to a well-defined level or body of knowledge).
After the student has made an assessment of his or her prior knowledge, the learning goals are
reformulated (if necessary) and the student starts with the appropriate learning tasks (arrow B).
During their learning the student regularly uses self-, peer-, and/or co-assessment to check his
or her learning progress, to determine the required guidance, and to identify subsequent learning
tasks (arrow C). To support goal-directed learning it is important that progress assessments measure
the same objectives and have the same difficulty as final tests. Following a final progress assess-
ment the student decides either to stop learning and take a formal summative authentic assess-
ment or study some parts of the course again.

Conclusions

The field of educational measurement and assessment in schools has made substantial progress
over the past decades. The contribution to this progress by theories of intelligence, psychometrics,

Student ~ Knowledge base

Assessment

Prior Knowledge
State Assessment
ling activity 1

M
Progress
ning activity 2 E
Assessment
D
ning activity 3 I
A

ning activity n

Final
Assessment
ning activities

Figure 5.1. Dochy's model for the integration of learning, instruction and assessment (adapted from Dochy, 1992,
p. 187)
428 E J. R. C. DOCHYand G. MOERKERKE

and large-scale assessment should not be underestimated. Gains in research and applied practices
are limited, however, by societal changes, which probably will always run ahead. Changes in
society have a large impact on educational practice, on the economy, on the labor market, and on
individual thinking. Several foreseeable social changes will likely influence education in the future.
The need for lifelong learning will increase. Due to the rapid pace of technological develop-
ment and societal changes, education in childhood and adolescence is no longer sufficient to
ensure that people will be able to function adequately in society. Powerful learning environments
have to be provided for a heterogeneous group of people who increasingly will demand educa-
tion which is flexible in terms of schedule (e.g., time, place, and study pace) and which incorporates
information and communication technologies.
The balance of power between teacher and student is likely to change. Students will become
accustomed to the responsibility they have for decisions that concern what they want to learn and
how they want to learn it.
The nature and storage of knowledge and skills will be different. The sheer amount of human
knowledge has increased enormously over time and will continue to do so. New carriers of
information make large amounts of knowledge accessible. New ways of communication will be
developed between these carriers of information and the individuals who seek this information.
These changes will affect (in fact are already influencing and changing the face of) assess-
ment. Assessment in the near future will very likely:
• receive a more important place throughout the learning process;
• focus on knowledge gain and certainly more at mastery of skills and competencies;
• appear in many different forms and will have several functions (intake, progress, etc.)
• use increasingly different kinds of profiles;
• benefit from and use extensively all forms of educational and communication technology
and multimedia; and,
• be administered by different participants in the learning process (students, teachers, peers,
external bodies).

The assessment culture can be used to change instruction from a system that deposits knowledge
into students' heads to one that tries to develop students who are capable of learning how to learn
(Ridgway & Schoenfeld, 1994). The explicit objective is to interweave assessment and instruc-
tion to improve education. As an example, Clarke and Stephens (1996) have used assessment as
a means for systematic reform of mathematics education in Australia. Their study demonstrated
the effectiveness of curricular change as a result of the implementation of the Victorian Certificate
of Education, which is a mix of traditional assessment using multiple-choice tests and alternative
assessment forms like investigative projects, challenging problems, and extended-answer analyti-
cal tasks. Although enthusiastic and promising results have been reported, the first attempts to
introduce new assessment procedures have not been entirely positive. Madaus and Kellaghan
(1993) reported problems with the organization, time, and costs of assessment programs.
It is imperative that one does not throw the baby out with the bath-water. Objective tests are
very useful for certain purposes, such as high-stake summative assessment of an individual's
achievement. But objective tests should not rule an assessment program. Increasingly, measure-
ment specialists recommend so-called balanced or pluralistic assessment programs, where multiple
assessment formats are used (Birenbaum & Dochy, 1996; Clarke & Stephens, 1996; Ridgway &
Schoenfeld, 1994).
There are several motives for pluralistic assessment programs (Birenbaum, 1996; Messick,
EducationalTestingand Assessment 429

1994). First, a single assessment format cannot serve several different purposes and decision-
makers. Second, each assessment format has it own method variance, which interacts with persons.
There seems to be no one assessment format that fits all students. Third, any assessment format
can have a negative effect on teaching and learning From this perspective, Frederiksen's article
(1984) has a broader meaning than exposing the negative effects of multiple-choice items; it also
shows that aspects of any assessment can have negative as well as positive educational
consequences.
There is a need to establish a system for assessing the quality of alternative assessment. Several
authors have proposed ways to extend the criteria, techniques, and methods used in traditional
psychometrics to alternative assessment (Cronbach, 1988; Kane, 1992; Linn et al., 1991). Oth-
ers, like Messick (1994), oppose the idea that there should be specific criteria, and claim that the
concept of construct validity applies to all educational and psychological measurements, includ-
ing performance assessment.
Finally, there is a need for well-designed and well-evaluated heuristics which would help teach-
ers design and implement high quality performance assessment procedures. Besides clear criteria
for the development and use, individual teachers need easy to use techniques for quality improve-
ment and easy-to-use methods for quality control. Studies aimed at ways to ensure the control of
quality of alternative assessment programs should be conducted (Birenbaum, 1996).
As organizational constraints seems always to be high barriers against innovation (Noble &
Smith, 1994; Madaus & Kellaghan, 1993; Rothman, 1995; Van Meel, 1997), a first step towards
integration of assessment and learning could be a distinction between assessment for learning
and assessment to demonstrate achievement. By making this distinction, organizations can change
adaptively and new ways of assessment can be introduced gradually.

References

Arter, J. (1993). Pe~brmance criteria: The heart of the matter. Paper presented on the Annualmeetingof the American
Educational ResearchAssociation,Atlanta:April 12-16.
Baker, E. (1994). Learningbased assessmentsof history understanding.Educational Psychologist, 29(2), 97-106.
Baxter, G., Elder, A. D., & Glaser, R. (1996). Knowledge-basedcognitionand performanceassessment in the science
classroom. Educational Psychologist, 31(2), 133-140.
Bennett, R. E. (1993). Toward intelligentassessment:An integrationof constructed-responseteststing, artificial intel-
ligence, and model-basedmeasurement.In N. Frederiksen,R. J. Misleve, & 1. I. Bejar (Eds.), Test theory.lbr a new
generation of tests (pp. 99-123). Hillsdale,NJ: Erlbaum.
Berberoglu, G., Dochy, F., & Moerkerke, G. (I 996). Psychometricevaluationof entry assessmentin highereducation:A
case study. European Journal of Psychology of Education, XI, 1, 15----43.
Birenbaum, M. (1996). Assessment2000: Towards a pluralisticapproach to assessment. In M. Birenbaum,& F. J. R. C.
Dochy (Eds.), Alternatives in assessment of achievements, learning processes and prior knowledge (pp. 3-30). Boston:
KluwerAcademic Publishers
Birenbaum,M., Tatsuoka, K. K., & Gutvirtz,Y. (1992). Effects of response format on diagnosticassessmentof scholastic
achievement.Applied Psychological Measurement, 16, 353-363.
Birenbaum, M., & Dochy, F., (Eds.) (1996). Alternatives in assessment of achievement, learning processes and prior
knowledge. Boston: KluwerAcademic Publishers.
Boehm, B. W., Brown, J. R., Kaspar, H., Lipow, M., MacLeod G. J., & Merrit, M. J. (1978). Characteristics ofs~f~,are
quality. Amsterdam:No~h-HollandPublishingCompany
Boud, D. (1992). The use of self-assessmentschedules in negotiatedlearning.Studies in Higher Education, 18(5), 529-
549.
Broadfoot, P. M. (1986). Profiles and records of achievement: A review of issues and practice. London:Holt, Rinehart&
Winston.
Burger, S. E., & Burger, D. L. (1994). Determiningthe validityof performance-basedassessment.Educational Measure-
ment: Issues and Practice, 13( 1), 9-15.
Campione, J. C., & Brown, A. L. (1990). Guided learningand transfer: Implicationsfor approaches to assessment. In N
430 E J. R. C. DOCHY and G. MOERKERKE

Frederiksen, R. Glaser, A. Lesgold, & M. G. Shafto (Eds.), Diagnostic monitoring ~tskill and knowledge acquisition
(pp. 141-172). Hillsdale, N J: Lawrence Erlbaum Associates.
Chi, M. T. H., & Van Lehn, K. A. ( 1991 ). The content of physics self-explanation. Journal of the Learning Sciences, 1,
69-105.
Clarke, D., & Stephens, M. (1996). The Ripple effect: The instructional impact of the systematic introduction of perform-
ance assessments in mathematics. In M. Birenbaum, & E J. R. C. Dochy (Eds.), Alternatives in assessment of achieve-
ments, learning processes and prior knowledge (pp. 63-93). Boston: Kluwer Academic.
Cronbach, L. J. (1982). Designing evaluations or'educational and social programs. San Francisco: Jossey-Bass.
Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer, & H. I. Braun, Test validity (pp. 3-17).
Hillsdale, N J: Lawrence Erlbaum.
Darling-Hammond, L. (1994). Performanced-based assessment and educational equity. Harvard Educational Review,
64, 5-29.
De Corte, E. (1990). Toward powerful learning environments for the acquisition of problem solving skills. European
Journal of Psychology ~)(Education, 5, 519-541.
De Groot, A. D. (1970). Some badly needed non-statistical concepts in applied psychometrics. TijdschriJ~ voor de Psy-
chologie, 25, 360-376.
Dochy, F. (1992). Assessment of prior knowledge as a determinant for]uture learning: The use ~'prior knowledge state
tests and knowledge profiles. Utrecht/London: Lemma/Jessica Kingsley Publishers.
Dochy, F. (1994). Prior knowledge and learning. In T. Husen & T. N. Postlethwaite (Eds.), International encyclopedia of
education, Second edition (pp. 4698-4702). Oxford/New York: Pergamon Press.
Dochy, F., & Alexander, P. A. (1995). Mapping prior knowledge: A framework for discussion among researchers. European
Journal for Psychology of Education, X(3), 225-242.
Dochy, E, Moerkerke, G., & Martens, R. (1996). Integrating assessment, learning and instruction: assessment of domain-
specific and domain-transcending prior knowledge and progress. Studies in Educational Evaluation, 22(4), 309-339.
Dochy, E J. R. C., Valcke, M. M. A., & Wagemans, L. J..I.M. (1991 (1994). Learning economics in higher education:
An investigation concerning the quality and impact of expertise. Higher Education in Europe,/6(4), 123-136.
Floden, R. E.. Reshaping assessment concepts. Educational Researcher, 23, 4.
Frederiksen, N. (1984). The real test bias, influences of testing on teaching and learning. American Psychologist, 39(3),
193-202.
Glaser, R. (1990). Toward new models for assessment. International Journal oJEducational Research, 14, 375-483.
Glaser, R. ( 1991 ). Expertise and assessment. In M. C. Wittrock, & E. L. Baker (Eds.), Testing and cognition (pp. 17-30).
Englewood Cliffs, NJ: Prentice Hall.
Gronlund, N. E. (1988). How to construct achievement tests. (4th ed.). Englewood Cliffs, N.J.: Prentice-Hall.
Gulliksen, H. (1985). Creating better classroom tests. (Research Memorandum no. RM 85-5). Princeton, NJ: Educational
Testing Service.
Haertel, E. H ( 1991 ). New forms of teacher assessment. In G. Grant (Ed.), Review of research in education, 17, 3-29.
Hall, K. (1995). Co-assessment: Participation ~tstudents with stqffin the assessment process. Invited address at the 2nd
EECAE conference (European Electronic Conference on Assessment & Evaluation), EARLI-AE list, March 10-14
(listserver: listserv @nic.surfnet.nl).
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527-535.
Knight, P. (1996). Quality in higher education and the assessment e~fstudent learning. Invited paper at the Third European
Electronic Conference on Assessment and Evaluation, March 4-8, EARLI-AE list European Academic & Research
Network (EARN) (EARLI-AE on Listserv@nic.surfnet.nl).
Linn, R. L. (1989). Current perspectives and future directions. In R. L. Linn (Ed.) Educational Measurement (3rd ed.,
pp. 1-10). New York: Macmillan.
Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assessment: Expectations and validation
criteria. Educational Researcher, 20(8), 15-21.
Lohman, D. F. (1993). Teaching and testing to develop fluid abilities. Educational Researcher, 22, 12-23.
Madaus, G., & Kellaghan, T. (1993). British experience with "authentic" testing. Phi Delta Kappan, 74, 458-469.
McDowell, L. (1996). The impact of innovative assessment on student learning. IETI, 32(4), 302-313.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13-103). New York: Macmil-
lan.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational
Researcher, 23(2), 13-22.
Messick, S. ( 1995 (1996). Validity of psychological assessment. American Psychologist, 50(9), 741-749.
Misleve, R. J. (1993). A framework for studying differences between multiple-choice and free-response test items. In R.
R. Bennett, & W. C. Ward (Eds.), Construction vs. choice in cognitive measurement: isssues in constructed response,
performance testing, and porO'olio assessment (pp. 75-106). Hillsdale, NJ: Erlbaum
Misleve, R. J. (1996). Test theory reconceived. Journal of Educational Measurement, 33(4), 379-416.
Moerkerke, G. (1996). AssessmentJbrflexible learning. Utrecht: Lemma b.v.
Moss, P. A. (1992). Shifting conceptions of validity in educational measurement: Implications for performance assess-
ment. Review ~f Educational Research, 62(3), 229-258.
Educational Testing and Assessment 431

Newell, A. (1969). Heuristic programming: ill-structured problems. In J. S. Aronofsky (Ed.), Progress in operations
research. New York: John Wiley & Sons.
Newmann, F. M., & Archbald, D. A. (1990). Organizational performance of schools. In P. Reyes (Ed.), Teachers and
their workplace: commitment, perJbrmance and productivity. Newbury Park, CA: Sage Publications.
Nitko, A. J. (1989). Designing tests that are integrated with instruction. In R. L. Linn (Ed.), Educational measurement
(3rd. ed., pp. 447-474). New York, NY: Macmillan Publishing Company.
Noble, A. J., & Smith, M. L. (1994). Old and new beliefs about measurement-driven reform: "Build it and they will
come.". Educational Policy, 8, 111-136.
Pelligrino, J. W., & Glaser, R. (1979). Cognitive correlates and components in the analysis of individual differences.
Intelligence, 3, 187-214.
Resnick, L. B. (1987). Learning in school and out. Educational Researcher, 16(9), 13-20.
Resnick, L. B., & Resnick, D. E (1992). Assessing the thinking curriculum. In B. R. Gifford & M. C. O'Connor (Eds.),
Changing assessment: alternative views ~'aptitude, achievement and instruction (pp. 37-75). Boston: Kluwer Academic.
Ridgway, J., & Schoenfeld, A. (1994). Balanced assessment: Designing assessment schemes to promote desirable
change in mathematics education. Invited paper at the First European Electronic Conference on Assessment and
Evaluation, February 21-23, EARLI-AE list European Academic & Research Network (EARN) (EARLI-AE on
Listserv @nic.surfnet.nl).
Rogers, E (1995). Validity of assessments. Contrbution to the 2nd EECAE conference (European Electronic Conference
on Assessment & Evaluation), EARLI-AE list, March 10-14 (listserver: listserv @nic.surfnet.nl).
Rothman, R. (1995). Measuring up: Standards, assessment, and school reJ~)rm. San Francisco: Jossey Bass.
Rowe, K. J., & Hill, E W. (1996). Assessing, recording and reporting students' educational progress: the case for "subject
profiles". Assessment in Education, 3(3), 309-352.
Royer, J. M., Cisero, C. A., & Carlo, M. S. (1993). Techniques and procedures for assessing cognitive skills. RER. 63,
201-243.
Sadler, D. R. (1983). Evaluation and the improvement of academic learning. Journal of Higher Education, 54( 1), 60-79.
Segers, M. S. R. (1996). Assessment in a problem-based economics curriculum. In M. Birenbaum, & E J. R. C. Dochy
(Eds.), Alternatives in assessment ~] achievements, learning processes and prior knowledge (pp. 201-226). Boston:
Kluwer Academic Publishers.
Segers, M. S. R., & Dochy, E J. R. C. (1996). The use of performance indicators for quality assurance in higher educa-
tion. Studies in Educational Evaluation, 22(2), 115-139.
Shavelson, R, J., Xiaohong, G., & Baxter, G. (1996). On the content validity of performance assessments: Centrality of
doamin-specifications. In M. Birenbaum, & E Dochy (Eds.), Alternatives in assessment of achievements, learning
processes and prior knowledge (pp. 131-142). Boston: Kluwer Academic.
Sbephard, L. A. (1993). Evaluating test validity. In L. Darling-Hammond (Ed.), Review of research in education (Vol. 19,
pp. 405-450). Washington, DC:,AERA.
Snow, R. E. (1990). New approaches to cognitive and conative assessment in education. International Journal ~/"
Educational Research, 10, 455-473.Stiggins, R. J. (1987). Design and development of performance assessments.
Educational Measurement: Issues and Practice, 6, 3 3 4 1 .
Snow, R. E., & Lohman, D. F. (1989). Implications of cognitive psychology for educational measurement. In. R. L. Linn
(Ed.), Educational measurement (3rd ed., pp. 263-331 ). New York: American Council on Education/Macmillan.
Stiggins, R. J. ( 1991 ). Relevant classroom assessment training for teachers. Educational Measurement: Issues and Practice,
10, 7-12.
Suen, H. K., Logan, C. R., Neisworth, J. T., & Bagnato, S. (in press). Parent-professional congruence: Is it neccessary.
Journal of Early Intervention.
Van der Vleuten, C., & Van Luyk, S. (1986). A validity study of a test for clinical and technical medical skills. In L R.
Hart, R. M. Harden & H. J. Walton (Eds.), Newer developments in assessing clinical competence (pp. 00-00). Montreal:
Heal Publications.
Van Meel, R. M. (1997). Management ~fflexible responses in higher education. Utrecht: Lemma.
Voss, J. E, & Post, T. A. (1990). On the solving of ill-structured problems. In N. Frederiksen, R. Glaser, A. Lesgold & M.
G. Shafto (Eds.), Diagnostic monitoring of skill and knowledge acquisition (pp. 261-285). Hillsdale, N J: Lawrence
Erlbaum Associates.
Vygotsky, L. S. (1978). Mind in socie~: The development ~f higher psychological processes. Cambridge, MA: Harvard
University Press.
Wigdor, A. K., & Garner, W. R. (Eds.) (1982). Ability testing: Uses, consequences and controversies. Washington, DC:
Academy Press.
Wilbrink, B. (1997). Assessment in historical perspective. Studies in Educational Evaluation, 22(1), 31-48.
Wolf, D. E (1992). Good measures: Assessment as a tool for educational reform. Educational Leadership, 49(8), 8-13.
432 E J. R. C. DOCHYand G. MOERKERKE

Biographies

Filip J. R. C. Dochy is Professor of Instructional Science and Technology for Teacher Training at
the Catholic University of Leaven, Belgium, and is Research Manager at the Center for Educational
Technology and Expertise. He is Secretary of the European Association for Research into Learn-
ing and Instruction (EARLI).
George Moerkerke is an educational technologist and researcher at the Center for Educational
Technology and Expertise, Open University, Heerlen, The Netherlands.
CHAPTER 6

EMPIRICISM AND VALUES: TWO FACES OF EDUCATIONAL


CHANGE

PETER W. AIRASIAN

Boston College, School of Education, Campion 336D, Chestnut Hill, MA 02167, U.S.A

This issue of the International Journal of Educational Research presents an examination of four
important educational areas: intelligence testing, large-scale alternative assessment, small-scale
alternative assessment, and educational measurement. What is unique to these chapters is the
authors' efforts to go beyond strictly technical issues in order to provide a conceptual perspec-
tive. They show us that there are underlying histories, theories, applications, and uncertainties
associated with their topics.

David Lohman

David Lohman examines the early history of intelligence testing and asks whether and what
we can learn from researchers who preceded us. His chapter is in large measure an historical
retrospective in which he strives to capture the context, psyche, and personalities of the research-
ers influential in the early development of intelligence tests. Lohman is correct to encourage that
present-day researchers seek to understand the roots of their fields. Too often, the filters of time
and social change reduce our view of early researchers and their contributions to brief sound
bytes or charicatures: eugenicist, hereditarian, positivist, racist, and the like. As Lohman sug-
gests, such labels rarely capture the breadth and intellectual richness of these predecessors and
their work. Historical study can provide a genealogy of one's research area, and often, an illuminat-
ing view of the thoughts and conflicts of the early pioneers. Historical study also can be person-
ally rewarding and interesting because it gives not only a glimpse of conceptual issues and debates,
but also the nature of the accepted methodology of the time. It is a yardstick that shows how
much or little a field has progressed.
However, there are limitations to what historical study of a research area can provide for present-
day investigators. First, historical data are voluminous in most areas of study, too much to fully
examine and comprehend unless one can devote substantial time and effort. Thus, while study of
history can take one beyond simplistic views of earlier researchers and their work, the sheer
volume of pertinent historical records, including as Lohman suggests, not just the corpus of
scholarly work but also the social context within which that work was produced, becomes a
handicap to full understanding and appreciation of our predecessors.
Second, the past has passed and will not come again in the same form. History does not iso-
morphically reproduce itself in all particulars, thereby limiting what we can learn from it. The
433
434 P W. AIRASIAN

issues and controversies that researchers study do repeat themselves - - reforms do reappear and
theories do recycle - - but the social, cultural and value context in which they repeat is inevitably
different from one era to another. Consider the case of intelligence testing. It is fair to say that the
characteristics of the intelligence tests used in the 1920s are not greatly dissimilar to the
characteristics of the intelligence tests used today. However, the perceptions of intelligence tests
today are quite different from those in the 1920s because the social and value milieu has changed.
Thus, although the study of history provides a perspective on past issues and thought, its mes-
sage and morals are not always directly transferable to present-day issues and thought. This real-
ity should not lead us to dismiss the study of the pioneers and ideologies that shaped a field of
study. Rather, it should temper expectations that the past can provide clear, direct answers to
present-day issues and research. In answer to Lohman's question, "What can we learn from his-
tory and how much can history inform subsequent practice?" the reasonable answer is "to some
degree, but not completely." The interesting and important question that flows from this reality
is "If the characteristics of intelligence tests have remained much the same since the 1920s, how
and why have the perception and use of these tests differed so markedly in different time periods?"
This question leads to Lohman's second and major theme: our research methods and under-
standings are infused with affective components derived from personal experiences and cultural
mores. Research is a blend of data and personal proclivities, both of which influence its under-
standings and interpretations. Lohman suggests that we are, in a real sense, prisoners of ourselves;
our experiences and the impact of our culture make our knowledge both contextual and
constructed. These contextual, constructed understandings are especially influential when situa-
tions of uncertainty arise in our research, as they frequently do.
Research cannot answer questions that seek to determine what should be done, even though
these often are the most important questions posed. Research findings are frequently used to
justify a particular answer to a "should" question, but the decision of what will actually - - or
should - - be done is determined ultimately by criteria such as feasibility, potential consequences,
beliefs, political rewards, and the like (Cuban, 1990; Airasian, 1988).
But common sense and even a passing acquaintance with the history of such "should" issues
indicates that beliefs and social values change over time - - witness the recent shift from a positiv-
ist to a constructivist view of knowledge acquisition. It will not be empirical research that will
dictate which - - if either - - of these epistemologies becomes transcendent in the near future. It
will be, as Lohman indicates, the broader political and social climate that will provide the answer.
Lohman supports his case by identifying a litany of factors and movements that formed and
forged the crucible from which early researchers derived concepts of intelligence. The result was
a culture that viewed sorting of children as a progressive reform and intelligence tests as the fair-
est measure to determine ability levels of all pupils, regardless of their racial, cultural, and language
characteristics.
Contrast this perspective to the 1970s, when the Civil Rights movement was in full swing The
emphasis in schools was on equality for all students, cultural and language differences were viewed
positively rather than as an indication of intellectual inferiority, and test bias was a recognized
and important concern. In this era, the use of intelligence tests all but disappeared from schools
as intelligence testing fell into disrepute. Essentially, the same tests that were heralded as a viable
tool for classifying people in terms of their intelligence in the 1920s were viewed as biased and
unfair in the 1970s.
Lohman's list of the factors pressing on early intelligence researchers provides a compelling
example of how cultural and belief factors influence the nature and focus of research.
Unfortunately, the portrait painted treats each factor as if it were an equal contributor to the cultural/
Educational Testingand Assessment 435

social stew. While the sheer number of factors adds weight to the argument that cultural and
social factors do influence research and researchers, the number alone does not help us understand
the relative influences and cross influences among the factors. One wonders, for example, why
only a few states in the United States implemented sterilization for criminals thought to be "mental
defectives." Or, whether the perception of "inferior races" came before, after, or concomitant to
the development of intelligent tests. It is hard for historians to provide answers to such questions,
partially because those living through them seldom step out of immediate reality to pose them,
and partially because there are so many interactions among the broad domain of cultural factors.
Also, since we view history through the lens of our own social and cultural context, there is a
danger that the past will be unappreciated for its contributions and viewed simply as a series of
research failures leading Up to the present era. Finally, with the rise of the constructivist epistemol-
ogy, especially that of the Social Constructivist bent, interpretation of history will become
characterized by reconstruction, deconstruction, and the existence of many quite different social
and cultural frameworks or camps. All of these factors place limits on what we can learn from
history

Joan Herman

Joan Herman provides a comprehensive, contemporary review of large-scale, mandated, non-


multiple-choice assessment programs. She highlights the use of these alternative assessments as
a key spur to instructional reform and the attainment of broader, more complex student outcomes.
She ties the rise of alternative assessments to dissatisfaction with the narrowness and corrupt-
ibility of multiple-choice items, the growing emphasis on students learning higher-order thinking
skills, and the constructivist link between learning and cognition. The bulk of the remaining top-
ics center on the psychometric characteristics of alternative assessments, their validity, reli-
ability, and generalizability.
In considering Herman's chapter, it is important to note some basic facts. First, alternative
assessments are not new to the education scene; they have had a long history of use. Before the
introduction of multiple-choice test items, educational assessments required that students construct
their responses or products. What has changed is the use of (1) common, large-scale alternative
assessments on a national, state- or district-wide basis; (2) broadly applicable standardized scor-
ing schemes and rubrics; and (3) consequential rewards or penalties for students and teachers
linked to student performance on the assessments. In essence, alternative assessments which most
teachers have long used in their individual classrooms have been centralized, routinized, and
made consequential.
Second, the consequential nature of many alternative assessment, programs has the underlying
aim of altering curriculum and instruction. Consequential assessments communicate the kinds of
knowledge and skills that are valued. In the case of alternative assessments, the important
knowledge and skills involve student constructions, products, problem solving, and explana-
tions. Emphasis on assessing this knowledge and these skills is intended to shift teachers'
instructional emphases towards the processes required by the assessment tasks. Assessments
without consequences are usually not nearly as potent in driving curriculum and instruction as
those with consequences.
Third, the basic definition of alternative assessment - - students generate rather than select
their answers - - encompasses a broad range of task complexity, ranging from writing a one
436 R W. AIRASIAN

sentence response to an open-ended question to completing a term-long project containing student-


developed diagrams, schedules, models, narratives, and the like. When we discuss alternative
assessments it is important to recognize both their variability and the problems this variability
can have for teaching, as well as administering, scoring, and validating them. In most cases, with
the possible exception of some portfolio exercises, the alternative assessments in the large-scale
programs Herman describes fall at the narrow end of task complexity, in order to make them
manageable for large-scale administration and scoring.
Fourth, all assessment approaches have their strengths and weaknesses, as Herman's chapter
makes quite clear. She does indicate that standardized multiple-choice tests used in large-scale
assessment programs present a number of problems. There is truth to this claim. But it is also
true that many multiple-choice items can and do assess performances more complex than mere
remembering. Further, when large-scale alternative assessment programs are consequential in
nature, teachers also teach students to mimic anticipated assessment tasks and formats. Given
that teachers know the general format of a task, "prepping" students and narrowing the cur-
riculum have more to do with the importance of the test consequences for students and teachers
than with the type of assessment.
Alternative assessments do provide a change from multiple-choice tests, but how easily they
escape mimicry and validity corruption is not yet clear. Instances of schools, districts, and states
showing substantial increases in student performance on alternative assessments focusing on
presumably higher-level cognitive learning in relatively short periods of time raise questions about
the extent to which meaningful, higher level learning is actually taking place in these alternative
assessment situations. As Herman notes, one cannot tell whether a task - - any task - - is assess-
ing high- or low-level cognitive processes solely by examining the task itself. One must also
know about the instruction provided.
Particularly important in this regard is the argument that differences in students' opportunity
to learn should preclude assessment-based comparisons across different schools and curricula.
That is, if some students have weaker curricula or poorer teachers than others, large-scale,
consequential assessment programs should not be implemented. Several questions follow from
this view. How is equal opportunity defined and measured? Will we ever be able to attain equal
opportunity to learn for all our students? Should evidence that consequential assessments do push
curriculum and teaching in the direction of the assessed areas lead us to ignore differences in
opportunity to learn and implement consequential tests because they are the best means of nar-
rowing, if not equalizing, students' opportunities to learn'? Opportunity to learn raises a number
of conflicting issues, but heretofore, most large-scale assessment programs have been initiated in
the face of knowledge that all students will n o t have had equal opportunities to learn.
The results of the validity and reliability studies Herman presents are instructive and, to a
degree, discouraging. There are difficulties in generalizing student performance on a particular
task to other tasks in the same content domain. Factors such as the format of assessment (e.g.,
whether a given task is presented as a written problem or as a "hands on" demonstration) often
produce quite different responses from the same student. Findings such as these raise issues about
the meaningfulness or validity of alternative assessment results.
Reliability studies also engender concerns about the psychometric properties of alternative
assessments. Human scoring and judgment are at the heart of alternative assessments because
there is variation in what students produce or demonstrate. There have been successful efforts to
attain reliable scores for alternative assessments, as Herman notes. However, in virtually all cases,
narrow criteria, substantial scorer training, narrow assessment tasks, and limited breadth of
response are needed to ensure high scorer reliability. However, the narrower the assessment and
Educational Testing and Assessment 437

the more explicit the scoring criteria, the easier it is to prep students for the assessments - - not
unlike the problem of prepping students on multiple-choice items. There is, then, a trade-off
between reliable scoring and the potential corruptibility of performance assessments.
Consequential, large-scale alternative assessment programs have altered classroom curricula
and broadened the domain of outcomes sought in schools. These assessment programs have led
classroom teachers to pay particular attention to the criteria of good performance, a change from
the intuitive and shifting criteria traditionally used by many teachers. But these assessment
programs have not solved psychometric problems associated with validating inferences that extend
beyond the specific alternative assessments administered and with establishing individual student
reliability for the small numbers of alternative assessment tasks typically found in large-scale
programs.
Further, the search for higher-order student outcomes via alternative assessments creates a
tension between instructional depth and breadth. More time is needed to help pupils master higher-
level thinking and problem solving skills than is required to learn lower-level, rote outcomes. If
there were clear evidence that higher-level thinking was generalizable to different issues and
content domains - - if, for example, problem solving skills were transferable across content
domains and problem types - - emphasis on instructional depth would be well-justified. However,
as Herman notes, it is not clear that such generalizations take place in any routinized way. So, in
the end, the excitement and promise many derive from emphasis on large-scale, consequential
alternative assessment programs must be tempered by a number of existing questions - - if not
problems - - that are raised in Herman's chapter and echoed in reports of national alternative
assessments programs in England (Gipps, Brown, McCallum, & McAlister, 1995). As so often
happens, our expectations outstrip our understandings and technology.

Filip Dochy and George Moerkerke

Dochy and Moerkerke provide a much needed perspective on small-scale alternative assess-
ments carried out in a single or small group of classrooms. Both Herman and Dochy and Moerk-
erke justify and argue for increased emphasis on alternative assessments because of limitations
of multiple-choice items and the needs of constructivist teaching and the broader "cognitive revolu-
tion." Both view the use of alternative assessments as a means to enhance and alter instruction,
providing a tighter integration of learning, instruction, and assessment. Both emphasize the
importance of higher-order, problem-based thinking skills which allow students to construct their
own solutions. Both also recognize that lull implementation of the alternative assessments they
envision is not yet realized and may be some time away. Both are concerned with the psychometric
properties of alternative assessments, particularly their validity and reliability.
In spite of these similarities, there are differences that emerge largely from the nature, purpose,
and practices of large-scale versus small-scale alternative assessment. Large-scale assessment
programs are constructed by professional test developers, while small-scale programs are
developed by individual or small groups of teachers. These two groups approach alternative assess-
ment from quite different perspectives, experiences, expertise, and purpose. A teacher's considera-
tions in developing and interpreting classroom-based assessments are not necessarily the same as
developers of large-scale assessments. Large-scale programs are piloted and refined prior to
implementation in order to address issues of fairness and to justify the consequences for schools,
438 E W. AIRASIAN

teachers, and students that typically emanate from the large-scale assessment results. Small-
scale, teacher-constructed alternative assessments are designed and administered without pilot-
ing or cross-classroom comparisons and are focused on the needs of students in that classroom.
Large-scale alternative assessments focus on the collection and analysis of empirical evidence of
the assessment's validity and reliability (see Herman). Small-scale assessments rarely are defended
using empirical evidence.
In light of these differences, Dochy and Moerkerke recognize and emphasize the need for
tools and heuristics to help teachers evaluate the quality of their own, small-scale, classroom-
based alternative assessments. While they do not rule out the use of empirical evidence in evaluat-
ing the quality of classroom-based assessments, they rightly focus on logic, description, and
argument rather than statistical analyses as the appropriate methods for teachers to identify and
defend the trustworthiness and validity of their alternative assessments. While most teachers do
carry out tacit self-examination intended to provide themselves with an implicit evaluation of the
adequacy of their assessment exercises and practices, Dochy and Moerkerke seek more formal
frameworks to validate the assessment procedures teachers develop and use.
The argument-based approach to evaluating small-scale alternative assessments is a realistic
and useful way for teachers to examine the nature and quality of their assessments. It recognizes
teachers' "on the fly" judgments and seeks to add focus to them. An argument-based approach
can lead teachers to confront assessment validity, recognize assessment strengths and weak-
nesses, and improve practice. Dochy and Moerkerke are not altogether clear about to whom a
teacher is to make the argument - - the teacher himself or herself or some external agent or agency.
Certainly it is most important for the teacher to justify the appropriateness of an assessment before
it is carried out.
Three concerns will arise in attempts to develop criteria for teachers' critiquing their own
classroom-based alternative assessments. First, if the identified criteria are lengthy and detailed,
they will be too cumbersome for leachers to use. On the other hand, if criteria are too global or
few in number, they may not provide an adequate basis for making convincing arguments about
assessment adequacy. Criteria must be important, small in number, and able to be easily evalu-
ated by teachers. Second, as Dochy and Moerkerke wisely point out, the quality of complex,
multi-faceted, ill-structured alternative assessments likely will not be best judged by a single set
of criteria. Task complexity influences the nature of suitable heuristics and criteria for assess-
ment task validity. Herman's discussion also emphasized this fact. Third, the criteria must represent
both desired content learning and the medium through which content learning is conveyed. Dochy
and Moerkerke stress the need for construct representation, that is, assessment criteria that represent
all the important dimensions of a performance assessment task. A performance task contains a
content outcome and a medium for conveying task performance (e.g., a written response, a model,
a drawing). When performance assessments are "authentic" or embedded in "real life" settings,
as encouraged by Dochy and Moerkerke, assessment of content mastery is often superseded by
focus on the means by which content mastery is presented. In such situations, assessment criteria
tend to emphasize rhetoric, style, neatness, persuasiveness, and the like, with less focus on the
desired content outcomes per se. In order to provide a representative indication of learning, assess-
ment criteria must focus more on the desired learning outcomes, less on the medium used to
convey mastery of the outcomes.
While the need to identify criteria that teachers can use to evaluate their alternative assess-
ments is important in its own right, Dochy and Moerkerke have a broader reason for this emphasis:
clear assessment criteria can lead to richer learning environments, improved student outcomes,
and better instructional practices. The authors wish to foster powerful learning environments
Educational Testing and Assessment 439

focused on problem-based, project-oriented, real-life, active, social-interactive learning. To assess


learning in such environments will demand new forms of assessments and Dochy and Moerk-
erke envision a new assessment culture that will change instructional emphasis from a rote, didactic
approach to one that develops students who are capable of learning on their own. They note that
"our point of view is that assessing high-order skills by means of authentic assessments will lead
to the teaching of such high-order knowledge and skills (p. 7)." Assessment will lead instruction.
Certainly assessment criteria can and do focus learning, as noted in my comments on Joan Her-
man's chapter. Often, however, the search for explicit criteria results in narrowed instruction and
diminished richness of intended student performance.
There are two final aspects of Dochy and Moerkerke's discussion of small-scale assessment
that need some comment. These aspects also are relevant to large-scale assessments. Making the
shift from didactic, knowledge-centered education to learner-centered constructivist education
will require (1) reexamination and relearning of the roles of student and teacher and (2) honest
discussion about our ability to carry out the teaching strategies required by higher-level student
outcomes. Teachers and students will be required to develop different classroom roles (Airasian
& Walsh, 1997). Teachers will have to learn to guide, not tell; to accept diversity in student
constructions and responses, not seek a single "right" answer; to construct classroom settings
that encourage students to disclose their constructions, not ones that are closed and judgmental.
Students will have to take initiative for their own learning, not wait for the teacher to tell them
what to do; to adapt to less structured instruction, not wait for didactic, rote instruction; to revisit
and self-evaluate their constructions, not move quickly from topic to topic with no reflection or
metacognitive activities. Such changes will not happen quickly or without reluctance, yet such
changes must take place if envisioned shift in learning, instruction, and assessment are to reach
fruition.
In addition, it is important that the existing knowledge base needed to implement the changes
in classroom instruction be honestly and thoughtfully scrutinized. Most discussions of construc-
tivism, the cognitive revolution, and alternative assessments rarely focus on instructional issues,
seeming not to view them as a problem. Discussion focuses on powerful learning environments,
higher-order and problem-based learning, students' personal constructions, discovery learning,
and real-life activities. Yet how much do we really know about implementing these instructional
strategies and the likely success of their implementation? This is an important concern, since
instruction is the least discussed but most central process necessary to effect desired educational
change.
Overall, Dochy and Moerkerke provide an argument that change in assessments can produce
changes in instruction and learning in small-scale classroom contexts. They provide a useful
discussion of issues related to improving teachers' assessments and instructional practices. They
adopt a realistic and appropriate view of the envisioned changes, arguing that there is room for
both the old and new assessments and that change will take time, perhaps a decade or more, to
begin to take hold. This is a refreshing stance from the much more common approach that promises
to bring about relevant educational change in a few weeks or months.

Ronald Hambleton and Stephen Sireci

Like Dochy and Moerkerke, Hambleton and Sireci provide a futuristic glimpse. In their forward
looking view, the future of achievement testing will be characterized by increased use of
440 P.w. AIRASIAN

constructed response item formats; successful efforts to measure higher-level thinking and reason-
ing processes; greater use of criterion~referenced scoring methods; heightened complexity of
measurement models; more rigor in judging the validity and reliability of test score inferences
and their uses; and application of computer technology to all stages of educational testing. In
short, the authors view the technologies of achievement testing as becoming richer, broader, and
more precise as we move into the next millennium.
Like Herman and Dochy and Moerkerke, Hambleton and Sireci focus on the limitations and
detrimental consequences of multiple-choice test items as one important justification for a greater
need for constructed item formats. They argue that multiple-choice items place limits on the
types of proficiencies that can be assessed, thus diminishing the validity and utility of test score
inferences. It seems more appropriate to say that when any single test item format is used in
assessment, the inferences that can be drawn from that assessment have diminished breadth, but
not necessarily diminished validity. If this is not the case, then the same argument made about
the limitations of standardized multiple-choice tests can also be made for constructed response
tests.
In fact, a number of limitations in constructed response assessments have led to retrenchment
of their use. The state of Georgia has abandoned its constructed response state assessment in
favor of standardized multiple-choice tests and the state of Kentucky has eliminated the group-
based constructed response section from the state assessment, although it has retained the portfolio
portion of the assessment. Political and cost factors led the state of California to eliminate its
constructed response-based statewide assessment. Great Britain's national constructed response
assessment program has been retrenched in recent years to redress the imbalance between time
devoted to instruction and time devoted to assessment. Balance has been attained by reducing the
number of constructed response items and tasks, which were sapping time available for instruc-
tion. In light of these changes, it will be interesting to watch the predicted trajectory of constructed
response item formats in the next few years. However, Hambleton and Sireci are correct in indicat-
ing that at this point in time, momentum still is with increased use of constructed response items.
It is very likely that the authors' prediction of increased emphasis on assessing higher-order
thinking skills will come to pass, since constructed-response items are still perceived to be a
fundamental way to assess these skills. However, although there presently is substantial support
for higher-level forms of thinking, that support is by no means unanimous. Large segments of the
population still wish instructional emphasis to be on so-called "basic skills," not on higher-level
thinking skills. They want students to think in the context of specific content (e.g., basic skill
areas) rather than in the context of generic, content independent processes. To the extent that
assessing higher-level skills is synonymous with the use of constructed response items, these two
areas are likely to rise and fall in unison, since the advantages and disadvantages of one are very
similar to those of the other.
The section describing the influences of computer technology for achievement testing outlines
many anticipated uses of the computer: building optimally efficient tests, permitting multiple
test administrations and immediate scoring; providing interactive testing using more complex
scenarios; and adapting testing to individual examinee responses to more quickly identify the
examinee's proficiency. In many ways, these uses of the computer represent the most future-
oriented areas described in this chapter, much like Dochy and Moerkerke's view of the importance
of computers and technology. The computer-based test development programs the authors describe
appear to provide a power tool for constructing tests that incorporate a number of desired
parameters. Similarly, the computer can administer complex, realistic, multi-branched test
"problems" that non-computer testing cannot mirror. Computer-based test administration holds
Educational Testing and Assessment 441

out the probability of increasing the convenience and perhaps comfort of most test takers and test
givers. However, in implementing this form of testing, issues of confidentially, interpretation,
reporting, and the like must be addressed and resolved. The technology may be ready, but the
context in which that technology will be used warrants significant attention. In this regard, it is
important to note that there is a prior history of using computers for item pooling, adaptive test-
ing, and test administration. The fact that these applications have not become widespread in the
past raises questions as to how widespread they will be in the future. What will be the driving
force behind widespread adoption of these computer uses?
In a broader vein, this chapter raises a number of more general issues and questions. Some
of these could have been addressed by the authors, while others are more philosophical
concerns that merit their own, exclusive attention. In the former category - - issues and ques-
tions that might have been addressed in their chapter - - are information about the intercon-
nections among the six areas addressed and discussion of the advantages and disadvantages
of achievement testing moving in the directions predicted. It seems, for example, that there
are logical measurement links among areas such as the use of constructed response formats,
the emphasis on measuring higher order skills, the use of criterion referencing, new methods
of defining standards, and the validity and reliability of test inferences based on each of these
areas. How do measurement needs and practices apply to combinations of these areas? How,
if at all, is the interpretation of constructed response formats enhanced by the availability of
polytomous and multidimensional measurement models? What do we know about the valid-
ity and reliability of constructed response and higher-level items? Addressing questions
such as these could have shed additional light on the relationships among the six areas
described.
Allied to the above comments is a tension between the precision and meaningfulness of new
approaches to educational measurement; a play-off between technical elegance and efficiency,
on the one hand, and practical understanding and usage, on the other. Historically, educational
measurers have made many important contributions to school practices and programs. More
recently, however, as the tools of educational measurement have become increasingly sophisticated
and precise, their link to the nitty-gritty business of teaching and learning has become more
removed - - or at least more difficult to see. Perhaps this must be the case; the twain may meet,
but only infrequently and with little common understanding. If this is the case, it is a criticism of
neither educational measurers nor of school people. For the most part, each group increasingly
operates in a different sphere, as Lohman elegantly noted in expressing his concern for both
construct understanding and measurement precision. Perhaps all that can be done is to have
educational measurers understand that more precision is not always what is necessary to solve
important educational problems and to ask them to articulate the benefits of new methods in a
context and language that non-educational measurers can understand, evaluate, and adapt to their
practice.

Integrating Themes

Two themes will be used to integrate the information and ideas contained in these chapters.
The first is the role of social context in educational testing and assessment. The second is a ques-
tion: How much do we know about improving education?
442 E W. AIRASIAN

Role of Social Context

Each chapter addressed the role of social context in research and practice. Lohman focused
directly and in detail on social context in his analysis of the factors that influenced the values and
beliefs of early intelligence researchers. Herman, Dochy and Moerkerke, and Hambleton and
Sireci addressed contextual factors less directly, through their discussion of student constructed-
responses and higher-order thinking processes which implicitly endorse a range of student
responses emanating from a range of student perspectives. Herman also noted the role of gender,
ethnicity, culture, and other personal and social characteristics as potentially biasing factors in
assessment.
There is, however, another, broader dimension of social context that is also pertinent. That
dimension concerns the contextual factors that lead to the adoption of educational reforms such
as those discussed by the authors. Why are reforms like constructed-response items, higher-order
thinking, computer applications, or constructivism advocated and implemented? Why these
particular reforms, but not others?
Most of the authors focused on recent changes in educational policies and practices, noting
many problems that preceded and prompted these changes: overuse of multiple-choice items,
overemphasis on teaching lower-level skills, low student performance on national and international
tests and assessments, and concern over inter-individual forms of measuring. These problems
have been widely discussed. But why did this set of problems lead to the particular solutions they
did? Why were alternative assessments and constructed responses selected as the strategies of
choice to alter perceived over-use of multiple-choice items? There are many ways to alter over-
emphasis on lower-order skills. Why was a constructivist perspective selected as the epistemologi-
cal orientation of choice?
There is little debate that each of the problems cited needed to be addressed. However, the
existence of these problems does not itself provide legitimacy to the specific reform strategies
adopted to solve the problems. There is an important difference between evidence that supports
the need for change and evidence that supports the selected strategy for change; documentation
of the need for reform is different from documentation of the efficacy of a particular reform
strategy. Reform strategies must seek their own sources of legitimacy and validity. For a variety
of reasons, empirical evidence rarely legitimizes reforms.
There are three reasons for this (Airasian, 1988). First, reforms are nearly always the result of
some social or educational problem that is perceived to demand urgent solution. Patience is a
virtue lacking when it comes to most educational reforms. Second, the political advantages linked
to instituting reforms provides a motive to avoid empirical study. Identified deficiencies in the
reform strategy may decrease or eliminate the political will for, and benefits of, its implementa-
tion. Third, there is a general lack of knowledge needed to remedy many identified educational
problems. Our expectations of reforms often outstrip both our understanding of them and our
capacity to carry them out. Many of the reforms advanced by policy makers are either too gener-
ally stated to provide guidance for their implementation or make erroneous assumptions about
the knowledge available or effort required to implement them.
Most often, support and legitimacy are found in a Ibrm of social validation, in the mesh between
relevant social values and the values perceived to be reflected in a particular reform strategy
(Parson, 1956; Rowan, 1982). Issues about the nature of curriculum, assessment, instruction,
administration, testing, and the like are value issues, and pressure is placed on schools to align
with public shifts in values (Cuban, 1990). This is especially true because schools are conserva-
tive institutions in two respects. First, their basic function and rationale is to conserve the society's
Educational Testingand Assessment 443

values and mores. Second, they are also conservative in the sense that most school reforms emanate
from outside rather than inside the school. Schools respond to reforms, they rarely initiate them.
Thus, reforms calling for constructivist practice, alternative assessments, constructed-
response items, criterion-referenced testing, powerful learning environments, and computer-
adapted testing are the symptoms of changes in social values and school expectations. If we
ignore the role of social factors in reform, we will not understand why some reforms are selected
and others are not; why some reforms proliferate while others whither away. Nor will we have a
clear idea of why we are doing what we are doing. In a real sense, we must confront an issue
raised by Lohman and ask ourselves what social context factors promote particular reforms. To
obtain meaningful - - and hopefully transferable - - understanding of the role of social factors in
promoting reforms of different kinds, we should examine over time the enduring cyclical
dichotomies of education (e.g., nature vs. nurture, centralized vs. decentralized administration,
curriculum depth vs. coverage, homogeneous vs. heterogeneous grouping). We should examine
how one context and set of reforms is replaced by another context with its alternative reforms. To
capture the role of social context in educational reform, it will be useful to examine the ebb and
flow of such dichotomies across a number of cycles with an eye towards important social values
and beliefs.

How Much Do We Know

The authors discussed a number of educational processes and perspectives: higher-order think-
ing, student constructed responses, and constructivism, to cite a few. They made broad and
optimistic predictions about how these processes and perspectives might alter and improve educa-
tion and student learning, explicitly assuming that the new will be better than the old. In spite of
this optimism, however, it is not overly cynical to ask how much we really know about fostering
such processes and perspectives. How clear is our understanding of how they function in the
classroom where they will ultimately be implemented? How can we be sure that newer really
does mean better? And better for whom?
In examining these questions, it is necessary to distinguish between the technologies of educa-
tion and the judgmental processes associated with teaching. The technologies of education are
many and are exemplified by processes such as stating objectives, writing test items, developing
alternative assessment tasks, constructing rubrics, administrating tests by computer, and employ-
ing multidimensional IRT models. An important characteristic of these and other technologies is
that they can be exported relatively intact to a variety of settings and situations. Teaching, on the
other hand, is not a technology in this sense. It is an idiosyncratic, uncertain, multi-faceted, trial
and error process. The techniques of teaching cannot be meaningfully transferred intact from one
setting to another without knowledge of the transmitting and receiving settings. The contextual
reality of classrooms - - the unique mix of students, teacher, and setting - - makes it difficult to
specify in detail suitable strategies of teaching, even if the objectives and assessments are similar
from one context to another.
We can easily become so enamored of our technologies that we forget the contexts in which
our technologies are applied. Lohman notes this in his lament that emphasis on understanding
intelligence as a construct has been superseded by the technologies of constructing and scoring
intelligence tests. Precision supersedes meaning.
The specifics of instruction, the most important and complicated link in the process of learn-
ing, is left to the discretion of classroom teachers, on the assumption that they can and will foster
444 R W. AIRASIAN

the learnings stated in objectives and examined in the assessments. However, simply stating objec-
tives and placing related tasks into assessments do not guarantee that instructional quality will
improve. Stated in a more homely manner, the assumption is that we know how to foster in
students the many and varied learnings related to our technologies. This is, of course, a crucial
assumption, for if we are to focus on higher-order thinking, powerful learning environments, and
meaningful student constructions, it is critical that our knowledge of instruction be sufficient to
enable teachers to teach and pupils to learn these outcomes.
The hallmark of higher-order thinking is the utilization of cognitive processes more complex
than rote memorization to obtain solutions to unfamiliar problems (Bloom, Englehart, Furst, Hill,
& Krathwohi, 1956). The two defining properties of higher-order thinking are that they involve
problems that require more than rote learning and that they are new to the learner. The key
instructional tasks involved in teaching higher-order thinking are to have pupils (1) represent
new problems in terms that they recognize, (2) select from their repertoire of information, rules,
principles, and experiences the ones that will help them solve the problem, and (3) apply the
knowledge to solving the problem.
This is not a simple process to teach. It requires different ways of teaching from those employed
to foster rote learning. It requires more time to teach because higher-order thinking develops
gradually. Its self-discovery aspect requires indirect, less structured instructional environments
and opportunities. We have available very few validated instructional programs or approaches to
guide the teaching of higher-order thinking.
At present, a popular response to this somewhat gloomy portrait of available knowledge is the
adoption of a constructivist framework. Constructivism is heralded by many as the instructional
panacea through which students will be encouraged and learn to construct their own knowledge.
However, many constructivist supporters overlook some basic facts about constructivism (Aira-
sian & Walsh, 1997). Constructivism is an epistemology, a philosophical (value-based) explana-
tion about the nature of knowledge. It is not an instructional model. We do not have an instruction
of constructivism ready to apply in classrooms. Although constructivism might lead to a model
of knowing and learning that could be useful for educational purposes, at this point in time the
constructivist model is descriptive, not prescriptive. The instructional path to students becoming
constructors of their own knowledge is not well marked and is potentially as murky as our
understanding of fostering higher-order thinking. These comments are not a criticism of the con-
structivist viewpoint. Rather, they are an attempt to point out current realities associated with
efforts to develop a pedagogy based on a constructivist epistemology.

Conclusion

In reflecting on the issues raised by the authors, a number of thoughts emerge. It is clear that
social contexts and their attendant values and ideologies are important factors in the adoption
and justification of proposed educational changes. However, it also is clear that values and ideolo-
gies alone are not sufficient bases for change. An exclusive reliance on value-based change
inevitably leads to the exercise of political power. Issues in value-based reform are not solved by
research and investigation, they are negotiated. At the extreme, value-based ideologues look for
and see only the benefits of the changes they seek. They reason with a zeal based primarily on the
values perceived to be inherent in a desired change, ignoring practical considerations of
Educational Testing and Assessment 445

implementation or existing knowledge about that change. Value-based change efforts rely heav-
ily on rhetoric, which in turn, often results in advocates focusing on the politics o f the desired
change, rather than on the desired change itself
Value-based advocacy does provide a powerful signal that change may be needed. But value-
based advocacy rarely can provide information about how to approach the practical aspects and
problems of implementing that change. Thus, there are limits to reforms based solely on social
context and its associated values.
There are also obvious limits to basing educational change solely on empirical evidence. A
number of these problems were noted previously, but additional problems are that research results
are often equivocal, limited by the local situation, and unable to provide answers about the efficacy
of many important educational problems and processes. However, a lack of knowledge in some
areas does not imply a lack of knowledge in all areas. Many things are known about the educational
process; furthermore, even knowing areas in which knowledge is lacking is important for meaning-
ful consideration of proposed reforms.
In sum, we must be skeptical of reforms justified solely on the basis of values or solely on the
basis of empirical evidence. There are limits to the usefulness of each. In the real world, values
and empirical knowledge do and must work in concert to inform decisions about educational
change. Each may be more dominant at one time than the other, but some balance between values
and empiricism is necessary.

References

Airasian, E W. (1988). Symbolic validation: The case of state-mandated. High-Stakes Testing. Educational Evaluation
and Policy Analysis, 4(10), 301-313.
Airasian, P. W., & Walsh, M. E. (1977). Constructivist cautions. Phi Delta Kappan, 78(6), 444---499.
Bloom, B. S., Englehart, M. D., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of educational objectives.
Handbook 1. Cognitive domain. New York: Longman.
Cuban, L. (1990). Reforming again, again, and again. Educational Researcher, 19(1), 3-13.
Gipps, C., Brown, M., McCallum, & McAlister, S. (1995). Intuition or evidence? Buckingham: Open University Press.
Parsons, T. (1956). Suggestions for a sociological approach to the theory of organization-I. Administrative Science
Quarterly, 1, 63-85.
Rowan, B. (1982). Organizational structure and the institutional environment:The case of the public schools. Administra-
tive Science Quarterly, 27, 259-279.

Biography

Peter W. Airasian is Professor of Education at Boston College. His research interests lie in the
areas of classroom assessment and policy implication of large-scale testing.

You might also like