Professional Documents
Culture Documents
Frederic M. Lord - Applications of Item Response Theory To Practical Testing Problems (1980) - 1-25
Frederic M. Lord - Applications of Item Response Theory To Practical Testing Problems (1980) - 1-25
Frederic M. Lord - Applications of Item Response Theory To Practical Testing Problems (1980) - 1-25
FREDERIC M. LORD
Educational Testing Service
ROUTLEDGE
Routledge
Taylor &. Francis Group
Routledge Routledge
Taylor and Francis Group Taylor and Francis Group
270 Madison Avenue 2 Park Square
New York, NY 10016 Milton Park, Abingdon
Oxon OX14 4RN
Publisher's Note
The publisher has gone to great lengths to ensure the quality of this reprint
but points out that some imperfections in the original may be apparent.
Contents
Preface xi
iii
IV CONTENTS
The topics, organization, and presentation are those used in a 4-week seminar
held each summer for the past several years. The material is organized primarily
to maintain the reader's interest and to facilitate understanding; thus all related
topics are not always packed into the same chapter. Some knowledge of classical
test theory, mathematical statistics, and calculus is helpful in reading this mate-
rial.
Chapter 1, a perspective on classical test theory, is perhaps not essential for
XI
Xii PREFACE
References
Cohen, A. S. Bibliography of papers on latent trait assessment. Evanston, Ill.: Region V Technical
Assistance Center, Educational Testing Service Midwestern Regional Office, 1979.
Warm, T. A. A primer of item response theory. Technical Report 941078. Oklahoma City, Okla.:
U.S. Coast Guard Institute, 1978.
FREDERIC M. LORD
I INTRODUCTION TO ITEM
RESPONSE THEORY
1 Classical Test Theory—
Summary and Perspective
1.1. INTRODUCTION
This chapter is not a substitute for a course in classical test theory. On the
contrary, some knowledge of classical theory is presumed. The purpose of this
chapter is to provide some perspective on basic ideas that are fundamental to all
subsequent work.
A psychological or educational test is a device for obtaining a sample of
behavior. Usually the behavior is quantified in some way to obtain a numerical
score. Such scores are tabulated and counted. Their relations to other variables of
interest are studied empirically.
If the necessary relationships can be established empirically, the scores may
then be used to predict some future behavior of the individuals tested. This is
actuarial science. It can all be done without any special theory. On this basis, it is
sometimes asserted from an operationalist viewpoint that there is no need for any
deeper theory of test scores.
Two or more "parallel" forms of a published test are commonly produced.
We usually find that a person obtains different scores on different test forms.
How shall these be viewed?
Differences between scores on parallel forms administered at about the same
time are usually not of much use for describing the individual tested. If we want a
single score to describe his test performance, it is natural to average his scores
across the test forms taken. For usual scoring methods, the result is effectively
the same as if all forms administered had been combined and treated as a single
test.
The individual's average score across test forms will usually be a better
3
4 1. CLASSICAL TEST THEORY—SUMMARY AND PERSPECTIVE
measurement than his score on any single form, because the average score is
based on a larger sample of behavior. Already we see that there is something of
deeper significance than the individual's score on a particular test form.
Σ2XT Σ2T
Ρ2XT ≡
σ2Xσ2T σ2X
2
= 1 - σ2 E (1-6)
σ X
If ΡXT were nearly 1.00, we could safely substitute the available test score X for
the unknown measurement of interest T.
Equations (1-2) through (1-6) are tautologies that follow automatically from
the definition of T and E.
What has our deeper theory gained for us? The theory arises from the realiza
tions that T, not X, is the quantity of real interest. When a job applicant leaves
the room where he was tested, it is T, not X, that determines his capacity for
future performance.
We cannot observe T, but we can make useful inferences about it. How this is
done becomes apparent in subsequent sections (also, see Section 4.2).
An example will illustrate how true-score theory leads to different conclusions
than would be reached by a simple consideration of observed scores. An
achievement test is administered to a large group of children. The lowest scoring
children are selected for special training. A week later the specially trained
children are retested to determine the effect of the training.
True-score theory shows that a person may receive a very low test score either
because his true score is low or because his error score E is low (he was
unlucky), or both. The lowest scoring children in a large group most likely have
not only low T but also low E. If they are retested, the odds are against their being
so unlucky a second time. Thus, even if their true scores have not increased, their
observed scores will probably be higher on the second testing. Without true-score
theory, the probable observed-score increase would be credited to the special
training. This effect has caused many educational innovations to be mistakenly
labeled ''successful.''
It is true that repeated observations of test scores and retest scores could lead
the actuarial scientist to the observation that in practice, other things being equal,
initially low-scoring children tend to score higher on retesting. The important
point is that true-score theory predicts this conclusion before any tests are given
and also explains the reason for this odd occurrence. For further theoretical
discussion, see Linn and Slinde (1977) and Lord (1963). In practical applica
tions, we can determine the effects of special training for the low-scoring chil
dren by splitting them at random into two groups, comparing the experimental
group that received the training with the control group that did not.
Note that we do not define true score as the limit of some (operationally
impossible) process. The true score is a mathematical abstraction. A statistician
doing an analysis of variance components does not try to define the model
6 1. CLASSICAL TEST THEORY—SUMMARY AND PERSPECTIVE
Equations (1-1) through (1-6) cannot be disproved by any set of data. These
equations do not enable us to estimate σ2T|, σ2E|, or ρXT, however. To estimate
these important quantities, we need to make some assumptions. Note that no
assumption about the real world has been made up to this point.
It is usual to assume that errors of measurement are uncorrelated with true
scores on different tests and with each other: For tests X and Y,
ρ(E X ,E Y ) = 0, ρ(E X ,T Y ) = 0 (X≠Y). (1-7)
Exceptions to these assumptions are considered in path analysis (Hauser &
Goldberger, 1971; Milliken, 1971; Werts, Linn, & Jöreskog, 1974; Werts,
Rock, Linn, & Jöreskog, 1977).
1.5. ENVOI
In item response theory (as discussed in the remaining chapters of this book) the
expected value of the observed score is still called the true score. The discrep
ancy between observed score and true score is still called the error of measure
ment. The errors of measurement are thus necessarily unbiased and uncorrelated
with true score. The assumptions of (1-7) will be satisfied also; thus all the
remaining equations in this chapter, including those in the Appendix, will hold.
Nothing in this book will contradict either the assumptions or the basic con
clusions of classical test theory. Additional assumptions will be made; these will
allow us to answer questions that classical theory cannot answer. Although we
will supplement rather than contradict classical theory, it is surprising how little
we will use classical theory explicitly.
Further basic ideas and formulas of classical test theory are summarized for
easy reference in an appendix to this chapter. The reader may skip to Chapter 2.
APPENDIX
Composite Tests
Up to this point, there has been no assumption that our test is composed of
subtests or of test items. If the test score X is a sum of subtest or item scores Yi,
8 1. CLASSICAL TEST THEORY—SUMMARY AND PERSPECTIVE
so that
n
X =
Σ
i=1
Yi,
where V indexes the items in test X'. If all subtests are parallel,
nρYY'+ (1-17)
ρXX' = ,
1 + (n - l)ρYY'
n Σ σ2i
Ρ2XT = ρXX'≥ 1- ≡ α. (1-18)
n - 1 ( σ2X )
Alpha is not a reliability coefficient; it is a lower bound.
If items are scored either 0 or 1, α becomes the Kuder-Richardson formula-20
coefficient ρ20: from (1-18) and (1-23),
n Σπi (1 - πi)
Ρ2XT = ρXX'≥
n - 1
1 -
σ2
X } = ρ20, (1-19)
n μX(n - μX)
ρ20 ≥
n - 1 [1 -
nσ2X } = ρ21 , (1-20)
Item Theory
Denote the score on item i by Yi. Classical item analysis provides various
tautologies. The variance of the test scores is
where ρij and ρix are Pearson product moment correlation coefficients. If Yi is
always 0 or 1, then X is the number-right score, the interitem correlation ρij is a
APPENDIX 9
phi coefficient, and ρix is an item-test point biserial correlation. Classical item
analysis theory may deal also with the biserial correlation between item score and
test score and with the tetrachoric correlations between items (see Lord &
Novick, 1968, Chapter 15). In the case of dichotomously scored items (Yi = 0 or
1), we have
n
μx = πii, (1-22)
Σ
i=l
Σ σiρiC
i
ρxc = √ Σ Σ σ σ ρ . (1-25)
i j ij
i 3
These two formulas provide the two paradoxical classical rules for building a
test:
Overview
Classical test theory is based on the weak assumptions (1-7) plus the assumption
that we can build strictly parallel tests. Most of its equations are unlikely to be
contradicted by data. Equations (1-1) through (1-13) are unlikely to be falsified,
since they involve the unobservable variables T and E. Equations (1-15), (1-16),
and (l-20)-(l-25) cannot be falsified because they are tautologies.
The only remaining equations of those listed are (1-14) and (1-17)—(1-19).
These are the best known and most widely used practical outcomes of classical
test theory. Suppose when we substitute sample statistics for parameters in
(1-17), the equality is not satisfied. We are likely to conclude that the discrep
ancies are due to sampling fluctuations or else that the subtests are not really
strictly parallel.
The assumption (1-7) of uncorrelated errors is also open to question, however.
Equations (1-7) can sometimes be disproved by path analysis methods. Similar
comments apply to (1-14), (1-18), and (1-19).
10 1. CLASSICAL TEST THEORY—SUMMARY AND PERSPECTIVE
Note that classical test theory deals exclusively with first and second moments:
with means, variances, and covariances. An extension of classical test theory
to higher-order moments is given in Lord and Novick (1968, Chapter 10). With-
out such extension, classical test theory cannot investigate the linearity or non-
linearity of a regression, nor the normality or nonnormality of a frequency
distribution.
REFERENCES
Cronbach, L. J., Gleser, G. C , Nanda, H., & Rajaratnam, N. The dependability of behavioral
measurements: Theory of generalizability for scores and profiles. New York: Wiley, 1972.
Hauser, R. M., & Goldberger, A. S. The treatment of unobservable variables in path analysis. In H.
L. Costner (Ed.), Sociological methodology, 1971. San Francisco: Jossey-Bass, 1971.
Linn, R. L., & Slinde, J. A. The determination of the significance of change between pre- and
posttesting periods. Review of Educational Research, 1977, 47, 121-150.
Lord, F. M. Elementary models for measuring change. In C. W. Harris (Ed.), Problems in measur-
ing change. Madison: University of Wisconsin Press, 1963.
Lord, F. M., & Novick, M. R. Statistical theories of mental test scores. Reading, Mass.: Addison-
Wesley, 1968.
Milliken, G. A. New criteria for estimability for linear models. The Annals of Mathematical Statis-
tics, 1971, 42, 1588-1594
Werts, C. E., Linn, R. L., & Jöreskog, K. G. Intraclass reliability estimates: Testing structural
assumptions. Educational and Psychological Measurement, 1974, 34, 25-33.
Werts, C. E., Rock, D. A., Linn, R. L., & Jöreskog, K. G. Validating psychometric assumptions
within and between several populations. Educational and Psychological Measurement, 1977,
37, 863-872.