Popham 2014

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Criterion-Referenced Measurement:

Half a Century Wasted?


Four serious areas of confusion have kept criterion-referenced
measurement from fulfilling its potential.
W. James Popham

ifty years ago, Robert Glaser


(1963) introduced the
concept of criterion-referenced
measurement in an article in
American Psychologist. In the
half century that has followed, this
approach has often been regarded
as the most appropriate assessment
strategy for educators who focus
more on teaching students than on
comparing them. Its early proponents
touted criterion-referenced testing
as a measurement strategy destined
to revolutionize education. But has
this approach lived up to its promise?
Letssee.

Origins of an Idea
To decide whether criterion-referenced
testing has accomplished what it set
out to accomplish, we need to understand its origins. Glaser, a prominent
University of Pittsburgh professor,
asserted in his seminal 1963 article
that certain instructional advances
could render traditional education
testing obsolete. More specifically, he
raised the issue of whether traditional
measurement methods, and especially
62

their score-interpretation strategies,


were appropriate in situations in
which instruction was truly successful.
During World War II, Glaser had
tested bomber-crew trainees by using
the then widely accepted norm-
referenced measurement methods
aimed chiefly at comparing test takers
with one another. Each trainees score
was interpreted by comparing it with
(or referencing it to) the scores
earned by previous trainees, usually
known as the norm group. Because
the norm groups performances were
usually nicely spread out across the
full range of possible scores, it was
easy to understand what it meant
for an individual test taker to score
at the 98th percentile or at the 30th
percentile. Such comparative interpretations were particularly useful
in military settings, providing a
straightforward way to select the
highest-scoring (and presumably
most-qualified) applicants to fill a
limited number of openings.
Following the war, Glaser pursued
his PhD at Indiana University and
studied with B.F. Skinner, often
regarded as the father of modern
behaviorism. In the late 1950s, Glaser

became an advocate of programmed


instruction, an approach growing
out of Skinners theories, in which
students worked through carefully
sequenced instructional materials that
were designed to present information
in small steps, provide immediate
feedback, and require learners to
correctly complete one step before
moving on to the next (Lumsdaine &
Glaser, 1960). Because practitioners of
programmed instruction relentlessly
revised their curriculum materials
until these materials were effective in
getting students to the desired learning
objective, Glaser and his programmed
instruction compatriots were often
able to produce high levels of learning
for essentially all students.
We might think that this accomplishment would engender jubilation.
However, a number of measurement
traditionalists were far from delighted.
Thats because the uniformly high
test results typically produced by
programmed instruction materials
exposed a serious shortcoming in traditional test-interpretation practices.
When the range of student scores
was compressed at the high end of
the scale, the possibility of useful

Educational Leadership / March 2014

Popham.indd 62

1/31/14 10:15 PM

edel rodriguez/theispot

s tudent-to-student comparisons
instantly evaporated.
Glaser recognized that a dramatic reduction in the variability
of students test scores would make
norm-referenced score interpretation
meaningless. After all, if nearly every
students score approached perfection,
it made no sense to compare one students near-perfect score with the nearperfect scores of other students. In
his landmark 1963 article, therefore,
Glaser proposed an alternative way
of interpreting students test performances in settings where instruction
was working really well. The label
he attached to this new, more

instructionally attuned score interpretation strategy, criterion-referenced


measurement, is still widely used today.
An Approach Preoccupied
with Instruction
Unlike the more traditional method
of referencing a given students
test score to the scores of other test
takers, Glasers proposed approach
called for referencing a students test
score to a criterion domaina clearly
described cognitive skill or body of
knowledge. For example, suppose
that a set of 250 hard-to-spell words
has been the focus of the school years
spelling instruction, and students

take a spelling test containing a representative sample of those words.


A criterion-referenced interpretation
of a students score would focus on
the number or percentage of the 250
words that the student was able to
spell correctly. We might report, for
example, that a students test score
signified that he or she had mastered
90 percent of the criterion domain
of hard-to-spell words. Or, if we had
previously determined that the proficiency cutoff score would be 90
percent, we might simply report the
students performance on this criterion
domain as proficient.
The contrast between a norm-
referenced and a criterion-referenced
interpretation is quite striking. On the
same end-of-year spelling test, a normreferenced interpretation might report
that a student who spelled 90percent
of the words correctly scored at the
78th percentile in relationship to
the scores of students in the norm
groupor at the 98th percentile, or
the 30th percentile, depending on how
well the norm group students performed. This norm-referenced interpretation, however, would be of little
use in deciding whether a particular
student had mastered the criterion
domain to the desired level.
Of course, advocates of criterionreferenced testing dont suggest that
there is no role for tests yielding
norm-referenced interpretations.
Indeed, in some situations its useful to
compare a students performance with
the performance of other students.
(For example, educators may want to
ASCD /

Popham.indd 63

w w w . ascd . o r g

63

1/31/14 10:26 PM

identify which students in a school


would most benefit from remedial
support or enrichment instruction.)
However, to support actionable
instructional decisions about how best
to teach students, norm-referenced
inferences simply dont cut it.
An inherent assumption of criterionreferenced assessment, then, is that
by articulating with sufficient clarity
the nature of the curricular aims
being assessed, and by building tests
that enable us to measure whether
individual students have achieved
those aims to the desired level, we
can teach students better. Criterionreferenced measurement, in every
significant sense, is a measurement
approach born of and preoccupied
with instruction.
Four Areas of Confusion
Glasers 1963 introduction of
criterion-referenced testing attracted
only modest interest from educators.
Actually, nothing more was published
on the topic until the late 1960s, when
a colleague and I published an article
analyzing the real-world education
implications of criterion-referenced
measurement (Popham & Husek,
1969). Nonetheless, a small number
of measurement specialists began to
tussle with issues linked to this innovative approach.
Here are four key issues we must
address to decide whether criterionreferenced measurement has lived up
to the instructional promises accompanying its birth.

Tests or Test Interpretations?


During the 1970s, when interest in
criterion-referenced measurement
began to flower, a misconception
emerged that still lingers: the idea that
there are criterion-referenced tests
and norm-referenced tests. This
is simply not so. Whats criterionreferenced or norm-referenced is the
64

inference about, or the interpretation


of, a test takers score. Although
test developers may build tests they
believe will provide accurate normreferenced or criterion-referenced
inferences, a test itself should never be
characterized as norm-referenced or
criterion-referenced.
To understand this point, imagine
a district-level accountability test
whose items are designed to measure
students mastery of three distinct
criterion domains representing three
key mathematical skills. The district
uses the test results to make criterionreferenced inferencesthat is, to

precise language in a measurement


arena where precision is so badly
needed, its score-based inferencesnot
teststhat are criterion-referenced or
norm-referenced.
Whats a Criterion?
One of the important early disagree
ments among devotees of criterion-
referenced measurement was what
the word criterion meant. In his 1963
essay, Glaser used the term the way it
was commonly employed in the early
1960s, to refer to a level of desired
student performance. In that same
essay, however, Glaser indicated that a

To support actionable instructional decisions


about how best to teach students, normreferenced inferences simply dont cut it.
measure the degree to which each
student has mastered the three key
math skills. However, after administering the test for several years, district
educators also develop normative
tables that enable them to compare
a students score with the scores of
previous test takers. Thus, students
performances, originally intended to
provide criterion-referenced inferences, could also be interpreted in a
norm-referenced manner. The test
itself hasnt changedonly the way
the results are interpreted has.
If a colleague refers to a norm-
referenced test or a criterion-
referenced test, you should not
necessarily regard this colleague as
a loose-lipped lout. Your colleague
might be casually referring to tests that
have deliberately been developed to
provide norm-referenced or criterion-
referenced interpretations. But to use

criterion identified a behavior domain,


such as a cognitive skill or a body of
knowledge.
Candidly, a degree of definitional
ambiguity existed in Glasers initial
essay. Nor did Husek and I improve
that situation in our 1969 followupregrettably, we also failed to take
a clear stance on the level-versusdomain issue.
Nonetheless, by the close of
the 1970s, most members of the
measurement community had abandoned the view of a criterion as a
level of performance (Hambleton,
Swaminathan, Algina, & Coulson,
1978), recognizing that the criterionas-domain view would make a greater
contribution to teachers instructional
thinking. Although determining
expected levels of student performance
is important, the mission of criterionreferenced measurement criteria is to

Educational Leadership / March 2014

Popham.indd 64

1/31/14 10:15 PM

tie down the skills or knowledge being


assessed so that teachers can target
instruction, not to set forth the levels
of mastery sought for those domains.
Regrettably, the criterion-as-level
view appears to be seeping back into
the thinking of some measurement
specialists. During several recent
assessment-related conferences, I
have heard colleagues unwisely characterize criterion-referenced testing
as an assessment strategy intended
to measure whether test takers have
reached a specified level of performance. Such a view makes little contribution to the kind of measurement
clarity Glaser thought would lead to
better instruction.
Whats the Optimal Grain Size?
We can properly consider tests that
provide criterion-referenced interpretations as ways of operationalizing the
curricular aims being measured. Thats
where grain-sizethe breadth of a criterion domaincomes in. If the grain
size of whats to be measured is either
too narrow or too broad, instructional
dividends disappear.
If each curricular domain is too
narrow, a teacher may be overwhelmed by too many domains. We
saw this clearly when the behavioral
objectives movement of the late 1960s
and early 1970s foundered because it
sought students mastery of literally
hundreds of behaviorally stated objectives (Popham, 2009). Sadly, that
same mistake was reenacted in recent
years when state education officials
adopted far too many state curriculum
standards. Moreover, the federal
government (in an effort to dissuade
states from aiming only at easy-toachieve curriculum targets) insisted
that each states annual accountability
tests measure students mastery of all
of that states standards. The result
was an excessive number of curricular
targetsfar too many for teachers

to use in day-to-day instructional


decision making
On the other hand, now that
so many states have adopted the
Common Core State Standards,
the assessment pendulum may be
swinging too far in the opposite
direction. At last report, the two
federally funded state assessment
consortia charged with creating assessments to measure students mastery of

that yields instructionally actionable


reports, then the architects of the
Common Core curricular aims are
likely to see their lofty education
aspirations realized. If, however, the
grain size of the Common Core assessments is too broad to guide teachers
in making sensible instructional
moves, then our optimism regarding
the Common Core initiative should
diminish.

the Common Core standards appear


intent on reporting a students performance on their tests at a remarkably
general level. In the case of reading,
for example, this is the assessment
claim one assessment consortium
plans to use to report a students performance: Students can read closely
and analytically to comprehend a
range of increasingly complex literary
and informational texts (Smarter Balanced Assessment Consortium, 2012).
Teachers are certain to be baffled
about what such a broad domain is
actually intended to measure. If the
Common Corefocused assessment
domains remain too broad, the
criterion-referenced inferences about
students performances that these tests
yield may be instructionally useless.
And therein lies the dilemma that
determines the promiseor the
impotenceof criterion-referenced
assessment. If students mastery of
the Common Core State Standards is
measured with criterion referencing

How Much Descriptive


Detail Should We Provide?
Criterion-referenced measurement
revolves around clear descriptions of
what a test is measuring. If teachers
possess a clear picture of what their
students are supposed to be able to
do when instruction is over, those
teachers will be more likely to
design and deliver properly focused
instruction. And if the test shows that
an instructional sequence has failed
to work satisfactorily, a clear criterion
domain description can help isolate
needed adjustments so that the teacher
can make the next version of the
instructional sequence more effective.
Its just common sense: Clarified
descriptions of curricular ends permit
teachers to more accurately select and
refine their instructional means.
However, we need to include the
right amount of detail when describing
curricular targets, or few educators
will actually employ such descriptions.
Too brief or too detailed descriptions
ASCD /

Popham.indd 65

w w w . ascd . o r g

65

1/31/14 10:18 PM

of criterion domains can erode the


instructional dividends of criterionreferenced measurement.
Over the years, particularly since the
mid-1960s, U.S. educators have often
made these two opposite but equally
serious mistakes when describing the
criterion domains to be taught and
measured. Initially, educators tried to
describe what tests ought to measure
by using extremely abbreviated statements of instructional objectives. But
such abbreviated statements frequently
led to misinterpretation. To avoid this
problem, certain assessment specialists
tried to describe the nature of criterion

test takers performances in relative


terms (that is, by referencing these
performances to the performances of
other test takers), criterion-referenced
measurement is an absolute interpretive strategy in which students
performances are referenced to clearly
explicated domains of knowledge
or skills. This fundamental relativeversus-absolute distinction continues to
be important.
However, in our attempts to
implement criterion-referenced
measurement, we have sometimes
made four serious mistakes that have
robbed it of its instructional potential.

and demand that any instructionally


oriented assessments avoid the four
implementation errors identified here,
Glasers assessment gift to education
will fulfill its promise and foster the
improved instruction its early advocates foresaw.
But please, lets get this done
without waiting another 50 years. EL
Authors note: This article is based on
a presentation made at the Teach Your
Children Well conference honoring Professor Ronald K. Hambleton, held on
November 912, 2012, at the University of
Massachusetts, Amherst.

References

Criterion-referenced testing as Glaser


conceptualized it represented an important
departure from traditional thinking.
domains in great detail (Hively, 1974).
Sometimes the description of a single
domain consumed 35 single-spaced
pages. Unfortunately, the longer and
more detailed these descriptions were,
the less likely it was that busy educators possessed the patience to use
themor even read them.
Clearly, we need Goldilocks
domain descriptions, in which the
level of descriptive detail is neither too
brief nor too elaborate, but just right.
Promises Fulfilled?
Looking back on 50 years of criterionreferenced measurement, what can we
conclude? Has Glasers concept lived
up to his vision?
Criterion-referenced testing as
Glaser conceptualized it represented
an important departure from traditional thinking. Instead of interpreting
66

We have been sloppy in the way


we think and talk about criterionreferenced measurement, often
slapping the label criterion-referenced
on tests rather than on test-based
interpretations. Weve also sometimes subscribed to a dysfunctional
criterion-as-level definition of this
approach to testing, beclouding our
measurement picture even more. We
have failed to focus our tests on a
reasonable number of instructionally
digestible assessment targets. And we
often havent described the domains of
skills or knowledge being assessed in
practical language. These four implementation mistakes have distorted
the instructional use of criterion-
referenced measurement.
Happily, all of these mistakes are
rectifiable. If educators understand the
basics of criterion-referenced testing

Glaser, R. (1963). Instructional technology


and the measurement of learning outcomes: Some questions. American Psychologist, 18, 519521.
Hambleton, R. K., Swaminathan, H.,
Algina, J., & Coulson, D. B. (1978).
Criterion-referenced testing and measurement: A review of technical issues
and developments. Review of Educational
Research, 48(1), 147.
Hively, W. (1974). Introduction to
domain-referenced testing. Educational
Technology, 14, 510.
Lumsdaine, A. A., & Glaser, R. (Eds.).
(1990). Teaching machines and programmed learning: A source book.
Washington, DC: National Education
Association.
Popham, W. J. (2009). Unlearned lessons:
Six stumbling blocks to our schools
success. Cambridge, MA: Harvard Education Press.
Popham, W. J., & Husek, T. (1969).
Implications of criterion-referenced
measurement. Journal of Educational
Measurement, 6(1), 19.
Smarter Balanced Assessment Consortium.
(2012, March 1). Claims for the English
language arts/literacy summative
assessment. Retrieved from www.smarter
balanced.org/wordpress/wp-content/
uploads/2012/09/Smarter-BalancedELA-Literacy-Claims.pdf
W. James Popham (popham@ucla.edu)
is emeritus professor at the University of
CaliforniaLos Angeles. His most recent
book is Evaluating Americas Teachers:
Mission Possible? (Corwin, 2013).

Educational Leadership / March 2014

Popham.indd 66

1/31/14 10:15 PM

Copyright of Educational Leadership is the property of Association for Supervision &


Curriculum Development and its content may not be copied or emailed to multiple sites or
posted to a listserv without the copyright holder's express written permission. However, users
may print, download, or email articles for individual use.

You might also like