The Meaning and Consequences of "Reliability"

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Journal of Educational and Behavioral Statistics

Summer 2004, Vol. 29, No. 2, pp. 245–249

The Meaning and Consequences of “Reliability”


Pamela A. Moss
University of Michigan

The concern behind my question, “Can there be validity without reliability?”


(Moss, 1994), was about the influence of measurement practices on the quality of
education. I argued that conventional operationalizations of reliability in the mea-
surement literature, which I summarized as “consistency, quantitatively defined,
among independent observations or sets of observations that are intended as inter-
changeable” (Moss, 1994, p. 6)—unnecessarily privileged certain assessment prac-
tices over others. I argued that consideration of alternative epistemologies, in this
case, hermeneutics, might: (a) expand the range of assessment practices consid-
ered sound and, more importantly, (b) illuminate taken-for-granted theories and
practices of psychometrics for critical review.
I characterized hermeneutics as involving “a holistic and integrative approach
to interpretation of human phenomena that seeks to understand the whole in light
of its parts, repeatedly testing interpretations against the available evidence until
each of the parts can be accounted for in a coherent interpretation of the whole”
(Moss, 1994, p. 7). I cited familiar examples from higher education of practices
consistent with hermeneutics, including how we grant tenure, confer doctoral degrees,
or hire new colleagues, and I pointed to published examples of similar practices in
K-12 education. These practices all involve consensus-seeking dialogue among
knowledgeable evaluators, representing different perspectives, who integrate
multiple pieces of evidence to make a decision about an individual.1
Both Li (2003) and Mislevy (2004) maintain that hermeneutics is not in conflict
with their conceptions of psychometrics. While there are commonalities between
these disciplines, there are also disjunctions (Moss, & Schutz, 2001). Unless we
also seek out the disjunctions, we risk what Bernstein (1992) calls “flabby plural-
ism”—“assimilating what others are saying to our own categories and language
without doing justice to what is genuinely different” (Bernstein, 1992, p. 66)—and
we diminish the opportunity for critical reflection that encounters with other dis-
ciplines can provide.
Li (2003) asserts that it was incorrect for me to focus my question about reliabil-
ity on operational (as distinct from theoretical) definitions. Noting that operational
definitions “are valid only for certain measurement models” (Li, 2003, p. 90), Li cited
what he described as a “general model free definition” (Li, 2003, p. 91) based on

I am grateful to Steve Schilling and Mark Wilson for comments on an earlier draft of this manuscript.
My work on hermeneutics and assessment has been supported, in part, by grants from the Spencer Foun-
dation. This article was accepted under the editorship of Larry Hedges.

245
Moss
equivalent theoretical definitions provided by Guilford (1954) and Lord and Novick
(1968) (hereafter L&N). Against this assertion, I would argue, first, it is the opera-
tional definitions of reliability that shape the assessments teachers and students
experience. If our concern is about the influence of measurement practices on
education, then it is the commonly used operational definitions of reliability that
must be confronted. Second, I would argue, few if any theoretical definitions are
completely “model-free,” and certainly not the definitions of L&N and Guildford.
While these theoretical definitions have broader reference than the particular oper-
ational definitions through which they are instantiated, they nevertheless bound a
set of (potential) operational definitions that are consistent with the theoretical def-
initions from those which are not. As L&N note, the constructs of classical mea-
surement theory lead to “specific practical restrictions on the experimental design
used for estimating” their values (Lord & Novick, 1968, p. 129). For instance,
assumptions of independence and reliance on replications have implications for
any operationalization of L&N’s definition of reliability.
This concern about the limitations of conventional practice in psychometrics is
also raised by Mislevy (2004). He argues that, “the challenge to test theory special-
ists is to continually broaden their methodology, in order to extend the toolkit of
data-gathering and interpretation methods available to deal with increasingly richer
sources of evidence and more complex arguments for making sense of that evi-
dence” (Mislevy, 2004, p. 239). In an earlier response to my article, Mislevy (1994)
characterized four different senses of “reliability”: (a) true score reliability, as
reflected in classical test theory; (b) reproducibility, as reflected, for instance, in
“proportions of agreement among raters, decision-consistency coefficients, and gen-
eralizability coefficients” (Mislevy, 1994, p. 6), (c) differential likelihood, as
reflected, for instance, in item response theory, and more generally, in probability
based reasoning, “where the relative likelihood of an observation under alternative
‘true states’ is the weight of evidence it provides for each” (Mislevy, 1994, p. 6)2,
and finally (d) credibility, as used in “common parlance, where ‘reliability’ simply
means the extent to which information can be trusted” (Mislevy, 1994, p. 8). He
concludes, “if by reliability we mean credibility of evidence, where credibility is
defined as appropriate to the inference, the answer is no, we cannot have validity
without reliability” (Mislevy, 1994, p. 11). I have no quarrel with this conclusion:
his definition of “reliability” is sufficiently broad to encompass a broad range of
validity practices within and beyond psychometrics. If there is such a thing as a
model-free definition of reliability, this is surely it. What this generalized definition
fails to illuminate, however, is that whenever we reason from evidence to inference,
we engage in particular practices, that can be located within particular theories of
“credibility.” These theories and the practices through which they are instantiated
do indeed constrain what can count as evidence and what kinds of inferences can
be supported. This is not a problem that can be overcome. It is true as true of
hermeneutics as it is of psychometrics or of any other approach to “credibility.” The
obligation for the theorist is to illuminate those consequences for critical review.

246
The Meaning and Consequences of “Reliability”
Mislevy turns to a generalized conception of evidence-based probabilistic reason-
ing that he draws, in part, from the work of Schum (1994) and Kadane and Schum,
(1996) (hereafter K&S). As envisioned in K&S, evidence-based probabilistic rea-
soning can encompass both (a) frequentistic probabilities that rely on replicability and
the enumeration of outcomes and (b) “personal, subjective, judgmental, or epistemic
probability judgments” (Kadane & Schum, 1996, p. 118). Schum’s approach allows,
as Mislevy notes, the analysis of “more complex interrelationships, multiple per-
spectives, and unique observations” (Mislevy, 2004, p. 239). Thus Mislevy asserts
that “there is no inherent conflict” between evidence-based probabilistic reasoning
and hermeneutics–a stance which is defensible. I can imagine ways in which the same
kind of post hoc analysis K&S brought to bear on this decades-old death penalty
case could illuminate decision-making processes among evaluators in performance
assessments. Consistent with K&S’s recommendation, probabilistic reasoning would
play a subsidiary role as “one possible check” (Kadane & Schum, 1996, p. 263) on
the conclusions developed.
However, we should distinguish, as Mislevy does, between the theoretical poten-
tial of Schum’s work and the set of applications Mislevy develops in his program
of “evidence-centered assessment design” which is based on a far more circum-
scribed (and efficient) set of practices (Mislevy, Almond, & Sternberg, 2003). As
Mislevy notes (2004), “collecting predetermined items of evidence, to be eval-
uated along predetermined lines, is a strategy for obtaining at relatively low cost
information that previous work suggests will be useful” (Mislevy, 2004, p. 239).
This reliance on predetermined models is not consistent with hermeneutics. From
the perspective of hermeneutics, the unique features of each case must be under-
stood in context, and the meaning of the guiding principles or criteria is necessar-
ily shaped by the particular case to which it is applied (much like precedents
function in the law). This is not a criticism of Mislevy’s work. That a tension
exists between Mislevy’s approach to evidence-based probabilistic reasoning and
hermeneutics is not an issue that requires resolution. The tension is productive: it
reminds us of the limitations and consequences of any given approach and the
value of alternative perspectives in illuminating them.
We may choose to limit the work of psychometrics to developing increas-
ingly sophisticated models that can be routinely applied to support large-scale
assessments—and that is a worthy goal. However, these are not the only meth-
ods through which important decisions about individuals and institutions can be
warranted (the law providing the most persuasive counter-example). In K-12
educational accountability, there has been a tendency to dismiss, or at least not
to pursue, assessments that aren’t grounded within psychometrics. It is some-
what ironic that educational measurement specialists routinely recommend test
scores be interpreted in light of other information about an individual (e.g., APA,
AERA, NCME, 1999), yet have little theoretical or practical advice to offer about
how to combine such disparate pieces of information to reach a well-warranted
conclusion—a task to which hermeneutics is well suited.

247
Moss
Whenever we settle on a definition of “reliability” or other criteria of validity,
we need to consider the consequences of that choice for the types of assessments
that will be considered sound and, in turn, for the nature of intellectual work that
students and teachers will be encouraged to undertake and the discourse about edu-
cational reform that will likely be fostered. One of the most productive ways to
illuminate such limitations and consequences is to consider perspectives from out-
side our own discipline. As Messick (1989) argued:

The very recognition of alternative perspectives . . . should be salutary in its


own right. This is so because to the extent that alternative perspectives are
perceived as legitimate, it is less likely that any one of these perspectives will
dominate our assumptions, our methodologies, or our thinking about the val-
idation of test use. (p. 88)

Notes
1
Bernstein (1985) provides a useful comparative introduction to hermeneutics.
2
See Brennan (2001) for a critical review of conceptions of “reliability” in item
response theory.

References
American Educational Research Association, American Psychological Association, and
National Council on Measurement in Education. (1999). Standards for educational and
psychological testing. Washington, DC: American Educational Research Association.
Bernstein, R. J. (1992). The new constellation. Cambridge, MA: MIT Press.
Bernstein, R. J. (1985). Beyond objectivism and relativism. Philadelphia, PA: University of
Pennsylvania Press.
Brennan, R. L. (2001). An essay on the history and future of reliability from the perspective
of replications. Journal of Educational Measurement 38(4), 295–317.
Guilford, J. P. (1954). Psychometric methods. New York: McGraw-Hill.
Kadane, J. B., & Schum, D. A. (1996). A probabilistic analysis of the Sacco and Vanzetti
evidence. New York: Wiley.
Li, H. (2003). The resolution of some paradoxes related to reliability and validity. Journal
of Educational and Behavioral Statistics, 28(2), 89–95.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading,
MA: Addison Wesley.
Messick, S. (1989). Validity. In Educational measurement (3rd ed.), (pp. 13–103). Wash-
ington, DC: American Council on Education and National Council on Measurement in
Education.
Mislevy, R. J. (1994). Can there be reliability without “reliability”? Princeton, NJ: Educa-
tional Testing Service.
Mislevy, R. J. (2004). Can there be reliability without “reliability”? Journal of Educational
and Behavioral Statistics, 29(2), 241–244.
Mislevy, R. J., Almond, R., & Steinberg, L. (2003). On the structure of educational assess-
ment. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3–62.

248
The Meaning and Consequences of “Reliability”
Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher,
23(2), 5–12.
Moss, P. A. & Schutz, A. (2001). Educational standards, assessment, and the search for con-
sensus. American Educational Research Journal, 38(1), 37–70.
Schum, D. A. (1994). The evidential foundations of probabilistic reasoning. New York:
Wiley.

Author
PAMELA A. MOSS is Associate Professor, School of Education, University of Michigan,
610 East University, Ann Arbor, MI 48109; pamoss@umich.edu. Her areas of special-
ization are at the intersections of educational assessment, validity theory, and interpre-
tive social science.

249

You might also like