Evaluation Models For Criterion-Referenced Testing: Views Regarding Mastery and Standard-Setting

Review of Educational Research
Winter 1976, Vol 46, No. l,Pp. 133-158
Evaluation Models for Criterion-Referenced

Testing: Views Regarding Mastery and
Standard-Setting
John A. Meskauskas
American Board of Internal Medicine
In the dozen years since Glaser's (1963) seminal article on

criterion-referenced testing, the acceptance of the concept of
mastery as an educational and, hence, evaluation goal has grown
tremendously. A large number of articles have been published,
curriculum programs have been devised t h a t employ criterion-
referenced testing, and yet writers still feel it necessary to define
what a criterion-referenced test is. Furthermore, the various
published definitions are by no means equivalent. One also
observes a shift in the interests and background of the authors of
papers over this period. In the Sixties, writers were primarily
advocating the adoption of criterion-referenced testing from an
educational or philosophical point of view in spite of the reserva-
tions of the classical measurement theorists, whereas in the
Seventies a new generation of measurement specialists have
begun to be involved, and the papers are much more mathemati-
cal. A number of mathematically-based techniques for deciding
The comments of Professor Frederick B. Davis, University of Pennsylvania

Graduate School of Education, and Dean D. Dax Taylor, Southern Illinois Univer-
sity School of Medicine, on an earlier draft of this paper are gratefully
acknowledged.
133
Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

REVIEW OF EDUCATIONAL RESEARCH Vol. 46, No. 1
on cutting scores and related issues such as test length have

been published—for a particularly good review see Millman
(1973)—but the authors' conceptualizations of the educational
task facing the learners has not always been clear. The purpose
of this paper is to investigate the mastery models underpinning
the techniques being proposed and to review the procedures
suggested for setting the pass-fail point.
The adoption of the criterion-referenced approach to evalua-
tion very quickly raises two measurement issues that have
relatively less importance in norm-referenced testing. These can
be broadly stated as the issue of the definition of mastery, and
the issue of a priori standards—closely intertwined but different
problems. Rigorous exploration of these to date has been quite
minimal. Perhaps the development of content aspects has been of
greater urgency. However, this area is receiving increasing
attention. Measurement specialists have turned their attention
to criterion-referenced measurement, introducing the use of
decision theory and Bayesian statistics.
The evolving models are alike in requiring tight specification of
content areas. Objectives are to be written in sufficient detail so
that the form and content of the measurements are implicit in
the educational objectives. They are quite unlike in their view of
mastery, however. The positions can be divided into two broad
categories—mastery as an area on a continuum (Continuum
models) and mastery as all-or-none (State models). However,
several senses of the term mastery need to be distinguished. On
the one hand, the term has been used to refer to a theoretical
relationship linking performance and time spent in the particu-
lar educational environment. When used in this sense, the word
relates to achievement as a hypothetical, generalized
function—an abstraction that is descriptive of the performance
of groups of learners. On the other hand, mastery is also used as a
label characterizing an individual's achievement. Continuum
and State models refer to mastery in the former sense.
Continuum Models
The characteristics common to all Continuum criterion-
referenced models, as defined in this paper, are the following:
1. Mastery is viewed as a continuously-distributed ability or
set of abilities.
2. An area is identified at the upper end of this continuum, and
if an individual equals or exceeds the lower bound of this
area, he is termed a master.
3. The goal of measurement is to obtain information for the
purposes of educational decision-making, which explicitly
follows the classification decision.
134

MESKAUSKAS EVALUATION MODELS
Of course, there are variations on the above theme. Some

writers have viewed mastery in terms of a continuum of skill
ranging from none to perfection. A test based on this view seeks
to describe the learner's position along a scale t h a t parallels, in
some meaningful way, the learner's achievement at a particular
point in time. Ebel (1971) describes this type of scale as follows:
In criterion-referenced measurement the scale is usually
anchored at the extremities, a score at the top of the
scale indicating complete or perfect mastery of some
defined abilities. The scale units consist of subdivisions
of this total scale range, (p. 282)
The above definition does not exclude the possibility of some
degree of heterogeneity in the mental function being measured
(note the reference to abilities in the plural). Interpretation of
performance on such a test proceeds from noting how close the
individual examinee is to either an end or a subdivision of the
scale, a concept t h a t is not very different from those employed in
nonreferenced testing. Instead of referring to an individual's
achievement relative to others, this view compares an individ-
ual's performance relative to the end-points or regions on the
scale. Since it is not clear that the subdivisions necessarily
represent steps along even an equal-interval scale, interpreta-
tion of this type of score is somewhat handicapped.
At a higher level of complexity, a continuum model with a
ratio-scale definition of the intervals has been proposed by
Kriewall (1972):
. . . suppose t h a t student a has completed some phase of
work with respect to learning objective*; (LOk). Further
imagine that we require the student to respond to all
items in the population of items defined by LOk. The
proportion of items to which the student exhibits a
correct response is a measure of his proficiency (p. 10)
Subdivisions of this continuum defining mastery and nonmas-
tery are established, along with an intermediate zone for those
who are in neither classification.
These varying views of mastery produce, as one would expect,
varying approaches to standard-setting.
Nedelsky's Minimum Pass Level (MPL) Method

In the late 1940's Nedelsky (1954) developed an approach
toward an "absolute standard" for the University of Chicago
departmental physics course. The department, which taught the
physics course by means of a common subject outline, generated
a common departmental comprehensive examination consisting
135

of well over 200 five-choice questions. Each of the approximately

six instructors who were teaching sections at a given point in
time were asked to look at the test prior to the candidates' taking
it, and decide, for each question, which of the distractors the
lowest passing student (in this instance a D student) should be
able to identify as incorrect. The minimum passing level (MPL)
for each item is the reciprocal of the number of remaining
alternatives. If, in a five-choice question, only one of the distrac-
tors is marked as one that the lowest passing student should be
able to eliminate, the minimum passing level for t h a t item is V4,
since there are four remaining alternatives.
Each question was rated by all the instructors in this manner,
and the minimum passing level for the examination, as rated by
each instructor, consisted of the summation of these individual
item MPL values. For a five-choice item, the possible values are
.20, .25, .33, .50, and 1.0. The instructors' total-test MPL values
were averaged _to arrive at the departmental minimum passing
level average, MFD. This procedure therefore assumes t h a t the
average borderline person will be successful on items classified
as having N nonrejectable choices an average of 1/N times.
Nedelsky felt that the standard deviation of the instructor
MPL-value distribution constituted a theoretical distribution of
the scores of the borderline students. He stated t h a t this could be
interpreted in terms of a normal distribution, so t h a t plus one
standard deviation of the instructor ratings, which he called aFD,
constituted a point where 86% of the students who were on the
F-D borderline would fail. The department chose a value of the
multiplier, K, which, if the assumptions were correct, would fail
95% of what they considered to be these borderline students.
Thus, the full formula that Nedelsky reported was:
MPL = MFD+KaFD. (1)
The value of K was agreed upon prior to the examination being

given. There appears to be no good reason to expect that the
mean and S.D. of instructors' MPL values should be related to
anything other than instructor/instruction variables. That being
the case, use of a Ka term seems unjustified although the basic
concept of fine-tuning the setting of a minimum-passing point on
the basis of a probability value may well have utility. A binomial
or Bayesian approach would appear preferable to a normal
model, however.
The net result of the Nedelsky method^ although flawed by
means of this questionable inclusion of K, represented an abso-
lute standard in that the minimum-passing point was not related
136

in a fixed way to any score distribution. Thus, no student had to

fail just because of his relative group standing.
Although this writer is not aware of continuing use of this
method at the University of Chicago, a number of applications of
this method have been carried out in the area of medical educa-
tion. The University of Illinois Medical School (Note 1) uses this
method to set the minimum passing level on their comprehensive
examinations for the M.D. degree. The value of K is set at 0 at the
University of Illinois, which seems to be common usage among
those who use this method in medical education. Taylor, Reid,
Senhauser, & Shively (1971) described the experiences of the
Pathology Department at the University of Missouri with this
technique. These two applications appeared to yield a satisfac-
tory result in t h a t an apparently reasonable standard was set.
However, the recent article by Levine and Forman (1973) pre-
sented an MPL study which failed an inordinately high number
of students. This appeared to be a result of the fact t h a t only one
faculty member was involved in setting the MPL, whereas, in the
Nedelsky paper, at least six instructors were charged with rating
the questions, and the highly divergent instructors' ratings were
dropped from consideration.
The present author carried out an MPL study as part of the
study of several alternative standard-setting methodologies for
use with a recertifying examination for physicians (Meskauskas
& Webster, 1975). The six judges were all members of the commit-
tee responsible for developing the examination. The range of the
MPL values was very large—36% to 80% for one-best-answer
questions and 48% to 89% for true-false items. Clearly it was not
feasible to use this method given the wide differences in ratings
found. The impact of item-by-item judgments on failure rate is
very clear, and if there exists a desire to assure a particular
outcome, the individual's judgments are different from what
they would be if some other mental state pertained. This
suggests t h a t the method appears to have usefulness, but the
procedure used to arrive at the MPL must ensure a broad base of
judgment and consensus so t h a t a common standard is, in fact,
achieved. The use of the Nedelsky technique by itself does not
eliminate the differences of opinion, often seen, as to what
represents adequate performance.
EbeVs Method of Passing Score Estimation

Ebel (1972) has developed a method for arriving at the
minimum passing score by considering the characteristics of the
items along two dimensions: relevance and difficulty. He pre-
sents an example where four relevance categories are used—
Essential, Important, Acceptable, and Questionable. Three diffi-
137

culty levels (easy, medium, and hard) are identified. This forms a
3 x 4 grid into which all questions are classified based on raters'
judgments as to the relevance and difficulty of the questions for
the minimally qualified examinee. Judgments are also made, for
each cell in the table, regarding the percentage of items in the
cell that the minimally-qualified candidate should be able to
answer. The number of questions in each cell is multiplied by the
appropriate percentage, and the sum of all cells is divided by the
total number of questions to yield the lowest passing score. This
is expressed in terms of percentage correct.
One gathers the impression t h a t Ebel is not committed to the
particular descriptors used along the two dimensions. Many test
constructors may wish to use somewhat different descriptors, as
the inclusion, in a test, of a category of items judged of question-
able relevance seems hard to defend. The Relevance dimension is
probably not independent of the Difficulty dimension. One would
think that, although Acceptable items can be of any degree of
difficulty, Essential items can only be judged Easy when used for
a postinstruction examination.
In Ebel's method, the judge must simulate the decision pro-
cess of the examinee to obtain an accurate judgment and thus set
an appropriate standard. Since the judge is more knowledgeable
than the minimally-qualified individual, and since he is not
forced to make a decision about each of the alternatives, it seems
likely that the judge would tend to systematically over-simplify
the examinee's task. Whereas the examinee has to choose among
a number of alternatives, the judge's tendency is to consider only
the correct answer in relation to the stem. Thus, the judge's
rating process is transformed from a consideration of how dif-
ficult a question is when considered in relation to its distractors
to merely the difficulty of the correct answer. Even if this occurs
only occasionally, it appears likely that, in contrast to the
Nedelsky method, the Ebel method would allow the rater to
ignore some of the fine discriminations t h a t an examinee needs
to make and would result in a standard t h a t is more difficult to
reach. However, perhaps the most troublesome feature of Ebel's
method is the requirement that a separate judgment be made
about the percentage of items in each cell along the relevance/
difficulty continuum t h a t the minimally-qualified examinee
should be required to answer. Unless there are external criteria
upon which to base this judgment, it seems entirely arbitrary.
The Kriewall Binomial-based Model

The two methods of standard-setting reviewed above focused
on decisions relating to the content of the test; the models to
138

follow deal with approaches that start by assuming a standard of

performance and then evaluating the classification errors result-
ing from its use. If the error rate is inappropriate, the decision-
maker adjusts the standard a bit and tries his equations again.
Thus the standard is set indirectly.
Kriewall (1972) developed a model that, although it uses the
assumption of an underlying continuous distribution of profi-
ciency, focuses on categorization of learners into several
categories. Three classifications are noted: nonmaster, which is a
lack of any skills in a class of tasks; master, which is the
possession of skills t h a t allow one to solve all the problems in a
class; and an in-between state where the student may have
developed some skills t h a t allow for solution of some but not all of
the problems in a task. This latter category is mentioned but
ignored in the subsequent development of the model. The focus of
KriewalFs model is a decision-theoretic application to a selection
problem within the classroom—how to assign students into
groups that are most like each other in terms of the three areas
denoted along the mastery continuum. The distributions of
scores estimating these mastery states are termed proficiency
distributions. Proficiency distributions are not considered to be
normal in shape, but are thought to be multimodal, bimodal, or
trimodal. Normal distributions occur about the modal points only
because of random factors generally classed as "error."
The test characteristics assumed with this model are very
similar to those t h a t Hively, Maxwell, Rabehl, Sension, and
Lundin (1973) have developed. Kriewall starts with the specifica-
tion of a learning objective (LO), together with a set of questions
("replacement sets") that measure the contents of the learning
objective. The individual items are thought to be homogeneous
and equivalent in difficulty by definition—they are part of the
replacement set of items associated with the LO. The likelihood
that a given individual will achieve a correct answer is con-
sidered to be fixed across all items over a given learning objective.
Any distribution of difficulty of questions for an individual
within a test is attributed to the functioning of randomly occur-
ring erroneous responses. However, in spite of the fact t h a t the
individual's proficiency score may be other than 100% or 0%, it is
implicitly assumed t h a t the function of measurement is to clas-
sify the student into one or the other of two categories—mastery
or nonmaster.
The probability model used to develop the likelihood of classifi-
cation error arises out of the Bernoulli (also called binomial)
model. This model views criterion-referenced performance as
" . . . a sequence of independent Bernoulli trials, each having the
same probability of success, z" (Kriewall, 1972, p. 12). Given a
139

total of n events (questions in the test in this case), the formula

for the probability for the occurrence of an individual event is:
f(x)= (£\ pxQn~x (2)

(n\= nl
\x) x\(n-x)ly
where
x = a test score,
n = the number of questions in the test,
p = z = proficiency of an individual, such that 0 ^ z ^ 1.00, and
q = z' = error rate of an individual = 1 - z.
Kriewall's model states t h a t the binomial distribution can be
used to determine the probability of several events. One must
decide upon values for ranges of performance on a test which
one will accept as being indicative of mastery and nonmastery.
Thus Zj is the lower bound of the mastery range (expressed as a
proportion of errors, i.e., .2) and Z2 the upper bound of the non-
mastery range. One must also define C, the standard t h a t will be
used to determine pass-fail, which is the maximum number of
allowable errors for masters. The recommended value of C is
midway between Z x and Z 2 . (Kriewall's emphasis on the errors
that individuals make on a test is unusual. Most authors, and in-
deed most people, wish to know how many questions a person
gets correct instead.) Given acceptable values for the above three
variables, the binomial distribution can be used to determine the
probability of several events. The following relation indicates the
probability t h a t an individual's score meets the error criterion,
C, when the observed number of errors is less than C.
c-i
S = Prob (W < C) = 2 [) zn~w (1 - Z)w, (3)
where ( J = —% ,n' rr , the binomial coefficient,

\w/ wl (n - w)\
and w = the observed number of errors.
140

The term to the right of the binomial coefficient is precisely the

term found in formula (2), since a person's score, x, is equal to the
total number of items, n, less the number of errors, w.
The same formula can be used to obtain the probability of a
false positive result (a nonmaster who scores in the mastery
range) and a false negative result (a master who scores in the
nonmastery range). The probability of obtaining a false negative
result is given as:
«=Z (^jzs^a-zr. (4)

w~c * '
The probability of a false positive result is:
£ = 2 (I) ^n~w(\-Z2), (5)
The result of the formula (4) is equivalent to obtaining the prob-

ability that, given a large number of equivalent trials, a person
whose true score is equal to the lowest score in the mastery
range will fall in the nonmastery range. How large this value is
depends on the difference between Z x and Z 2 for given values of
n and w. Beta, in a similar fashion, gives us the probability of a
true score at the highest level of nonmastery, Z 2 , falling into the
mastery range.
Kriewall suggests t h a t the value of C be set midway between
the limiting proficiency values for masters and nonmasters so as
to minimize the error. When S (termed the success rate) is
plotted against error rate, it becomes apparent t h a t the optimal
value for C is t h a t where the shape of a step function is approxi-
mated most closely. This is most likely to occur when C is ap-
proximately midway between Z x and Z 2 .
The Kriewall model differs from the Nedelsky and Ebel models
in very significant ways, but the most important is t h a t the
latter two models seek to make judgments about standards from
a data base of information about the content of the test, whereas
the former assumes control over content and seeks to determine
a standard-setting decision from a study of its impact. This kind
of process-product distinction has tremendous implications
about the way standards are set, but since neither type of ap-
proach has been validated to any significant extent, it is hard
to say at this juncture which is "better."
141

State Models
A widely-held view among educators conceptualizes mastery
as an all-or-none description of the student's learning state with
respect to a specified content domain. This is a natural out-
growth of the movement toward setting explicit educational
goals that should be met by all (or essentially all) students.
Bloom, Hastings, and Madaus (1971), for example, argue very
persuasively that the failure of students to achieve certain ob-
jectives may be the result of an educational process that is in-
sensitive to the educational needs and styles of the students.
The argument is advanced that an individualized educational
program will bring substantially all students (Bloom et al. feel
it may be 90%) to a mastery state.
There are several intertwined issues in the state-model posi-
tion which must be separated out for clarity. The first is that
the model implicitly demands 100% performance. Davis and
Diamond (1974, p. 133) note that:
Strictly speaking, mastery is defined as complete know-
ledge, skill, or control; so "partial mastery" is as self-
contradictory a phrase as "partial uniqueness." The
term "mastery," therefore, should be used to describe the
status of only those examinees who, it may be inferred,
can mark correctly all the items in the population of
which the subset that makes up a criterion-referenced
test is a representative sample, (p. 133)
The substitution of the concept "behaviors" for "items" would

make the above definition very generally applicable. In any case,
some may doubt t h a t mastery, as defined above, is a realizable
educational goal. Although all-or-none standards are unreason-
able when goals have not been defined and operationalized, the
educational system has been using an implicit state model all
along for many goals. The features common to state models are:
1. Criterion-referenced test (CRT) true-score performance is
viewed as an all-or-none dichotomous task.
2. The standard or cutting score that should be used in an
error-free situation is implied as part of the model.
3. Considerations of measurement error essentially always re-
sult in the adoption of standards that demand less than the
model seeks.
The theoretical shape of the acquisition "curve" for the be-
havior being measured by state models is probably that of a step
function. With such a distribution, one cannot meaningfully
speak of the rate of acquisition of proficiency as a function of
142

time, since the function is not continuous and has a slope of zero
at all but one point. The only logical question to be asked is
whether or not an individual has achieved mastery. But this
introduces a paradox. The model is seeking "perfection" in an im-
perfect world with imperfect measuring instruments. How are
we to deal with this problem?
The most common answer has undoubtedly been an intuitive
one. The decision-maker chooses a level of performance that
"seems right," so t h a t perfect performance is not demanded, but
yet there is an intuitively reasonable assurance t h a t those who
reach the chosen level have, in fact, achieved a state of mastery.
This has not only been true for individual teachers, but for large
curriculum p r o g r a m s as well. Hambleton (1973) reviewed
Glaser's decision rules used for Individually Prescribed Instruc-
tion (IPI), Flanagan's Program for Learning in Accordance with
Needs (PLAN), and those of Carroll, Bloom, and Block concern-
ing Mastery Learning and found t h a t either program-wide deci-
sions were made on the standard to be used (80-85% for IPI) or
the decision was left to each individual teacher.
The alternative approach is to construct a decision model
which takes into account factors t h a t introduce measurement
error—the deficiencies of the examination and the functioning
of random processes of various types. The models to follow are
of this type.
Emrick's Mastery Testing Evaluation Model

The Emrick (1971) model assumes t h a t the measurement de-
vice being used is composed of homogeneous questions that are
measuring a homogeneous content area. Besel (1971), in com-
paring Emrick's mastery learning model with KriewalPs model,
came to the conclusion that Emrick's work was most applicable
to very short tests of five items or less that measure very
specifically stated instructional objectives. Thus, a summative
examination designed to conform to this model would cover
many tightly defined skill areas and be composed of many sub-
tests that would result in a diagnostic profile of the individual's
skills. State-model tests are to be used as pre- or posttests only,
since it doesn't make much sense to test except for diagnostic
purposes (pretest) or learning-unit mastery confirmation (post-
test) purposes.
Emrick draws upon decision theory to provide the best possible
solution to the paradox noted above—although one wishes to see
individuals perform at the level of perfection on the examination,
neither the examination nor the examination conditions are per-
fect. Starting with the assumption that a person can be only in
a mastery state or in a nonmastery state, the measurement
143

problem becomes simplified because a person should answer all

of the questions in a skill area correctly, or else he should get
them wrong. When answering an item, if an individual is a
master, he will either answer the question correctly, in which
case his performance is typical of that to be expected of a master;
or, as a result of error, he may answer the question incorrectly.
If he answers it incorrectly, and the evaluator classifies him as
a nonmaster, the evaluator commits a classification error. This is
called (3 (Type 2) error, or false negative classification. If, on the
other hand, the individual is a nonmaster and obtains a correct
answer to the question, then another type of measurement error
occurs—called a (Type 1) error. Classifying the individual as a
master in this case constitutes a false positive error. It is, of
course, obvious t h a t the use of every test introduces a risk of a
certain amount of a and /3 error.
Emrick notes that consideration should also be given to factors
outside the test itself. An examination operates in a milieu that
bears upon the decisions made from the use of the examination
results. Emrick points to three classes of what he terms decision
errors. He suggests a cost-benefit analysis of the variables that
belong in these three classes. The first class is statistical, such as
item reliability, test length, and other considerations of this
type. If the decisions to be made from an individual examination
are extremely important, then statistical considerations become
very important. If, on the other hand, the examination is being
used primarily for diagnostic purposes, then perhaps the statisti-
cal considerations are not as critical.
The centrality (importance) of content is the second class. Ob-
jectives that are central to the educational process should be
emphasized in mastery testing, whereas objectives that are
viewed as secondary receive less attention. Some learning objec-
tives have high inherent importance; others are important be-
cause they are prerequisite to other topics. If these are not
measured adequately and precisely, a large evaluative "regret"
(which may be viewed as a summation of errors within a type,
across classes) occurs.
The third class of factors is that of psychological costs that re-
sult from decision errors. Although this area should be con-
sidered in any evaluation procedure, whether this is of im-
portance depends on the use of the test. For children, this
factor can be important, but for adults it seems somewhat less so.
Yet another class needs to be added to the above three. In
certain cases, such as summative examinations, which, if passed,
serve as gatekeepers for the provision of professional services to
a public, societal costs need to be considered. Costs associated
with allowing incompetent or unready professionals to practice
need to be taken into account, along with the less dramatic but
144

no less important risk of holding back people who are ready to

supply a needed service. What these costs might be would be in-
fluenced by the potential harm an individual may cause. A jet
aircraft mechanic's skills need to undergo (and are given) more
careful scrutiny than an automobile mechanic's, for example.
Emrick presents a formula which requires a determination of
the a and j8 errors, the Ratio of Regret, and the test length.
The Ratio of Regret is obtained by evaluating the various classes
of decision errors and noting the summed risks. This formula
allows the calculation of the optimal value of the percentage of
questions t h a t should be answered correctly on a particular test
in order for one to conclude t h a t an individual is a master. Of
course, in the absence of quantified values of the Type 1 and 2
errors, estimates of the various parameters can be used. The
formula for K y the optimal cut point on a given test, is:
log Y^ + VndogRR)
(6)
K =
loo- 20
i0g
(1 - a)(l - j8)
where
K = the cut point expressed as a percentage score on the
test,
a = estimated probability of Type 1 item error (false positive
error),
j8 = estimated probability of Type 2 item error (false nega-
tive error),
RR = Ratio of Regret of Type 2 to Type 1 decision errors, and
n = test length (number of items).
It is ironic t h a t most of the published individualized instruc-
tion or other curriculum programs are strongly oriented toward
the state-model of evaluation, yet adopt an arbitrary approxima-
tion as to the level of performance t h a t is indicative of mastery.
Emrick's model provides a firm methodological case for this type
of instructional approach and seems worthwhile to pursue. How-
ever, empirical quantification of the variables is likely to be a
difficult and time-consuming matter.
Roudabush's Dichotomous True-Score Models
Two forms of state models were considered by Roudabush
(Note 2). The first form involves a dichotomous measure of a
dichotomous true score. This, of course, is the case of a one-item
test. Although no decision rules are developed for this case,
145

Roudabush makes the interesting point t h a t error scores in such

a case are negatively correlated with true scores—a clear viola-
tion of classical test theory.
The second form to be considered is that of a "pseudo-con-
tinuous" measure of a dichotomous true score. One surmises that
the reason for the term "pseudo-continuous" relates to the fact
that the measure is apparently continuous, but is not hypothe-
sized to be such. A cut-point, or criterion score, is assumed as a
boundary between mastery and nonmastery—hence the Rouda-
bush state model is dichtomous, but not necessarily all-or-none.
The model does not allow the determination of an optimal cutting
score per se, but it does allow a determination of the extent to
which a set of data agrees with the model. Suppose there exists
a dichotomous criterion, which could be another test, teacher's
ratings, or perhaps direct observation. The table of observed
frequencies of these data could be displayed as follows:
Table 1
Observed Frequencies on Test and Criterion
Performance on Criterion
0 1
Performance 0 '00 '01 '0

on CRT
1 '10 '11 '1
'0 '1 N
Roudabush treats criterion-measure performance as if it were

a true score. However, in this error-free measurement situation,
there would be some number of cases, N0, who had not achieved
mastery, and another group, of size Nl9 who had. This table of
frequencies would be:
Table 2
Expected Frequencies for Error-Free CRT and Criterion Performance
Performance on Criterion Measure
0 1
NQ NQ
Performance 0 0
on CRT
N
1 0 "1 l
NQ N
"1
146

The real-life measurement situation results in a situation that

contains a mixture of true scores and error. Four types of error
are possible in this case:
let
ax = PQL^XC I T = 0) = the probability t h a t nonmasters show
mastery on the CRT,
a2 = P(X^XC | T = 0) = the probability t h a t nonmasters show
mastery on the criterion,
ft = PQC <XC | T = 1) = the probability t h a t masters show
nonmastery on the CRT, and
ft = P(X <XC | T = 1) = the probability t h a t masters show
nonmastery on the criterion,
where
T = a true score,
X = an obtained score, and
X c = the cutting score or standard.
Given the above definitions, Roudabush (Note 2) notes that the
relations below can be written:
f00 = N 0 (l - «l) (1 - OC2) + Nfifa,

U = N 0 (l - «i) « 2 + NM1 - /32),
/io = Wo «id " «2) + AT!(1 - fr^, (7)
/ n = N o a ^ + Nt(X - ftXl - f t ) .
Since N = N0 + Nu and iV is known, there are five unknowns in

the above set of relations, but there are only three independent
equations. Once three equations have been solved, the fourth can
be obtained by subtracting the sum of the results from N. If the
assumption t h a t the criterion is error-free is tenable, then a 2 and
ft are zero, and the equations can be solved.
Another possibility is to introduce a third measure, with its
associated a and ft errors. Seven equations can be written, and
there are seven unknowns. Hence one could directly solve for
each of the values without having to make an assumption (that
the criterion is error-free), which is unlikely in the average
situation.
Roudabush (Note 2) presents some interesting data t h a t follow
the predictions to be made from this model for the two-measure
case. Further work is needed to determine if this is a general
finding or an isolated phenomenon and to extend the inquiry to
the three-measure case.
147

The focus of this model is clearly oriented toward the determi-

nation of error rates and hence is essentially concerned with the
reliability issue. Other models to be taken up later in this paper
investigate the impact of test length on error rates. Both types
can easily be adapted to an investigation of the impact of the use
of different standards by systematically varying the cutting
score and observing the predicted measurement errors.
Decision Methods Not Referenced to Mastery Models

The papers discussed in this section omit a statement as to the
type of mastery model that underlies the decision rules. One
suspects that the models may be "mixed," as features of both
continuum and state models are often combined, typically by
carrying out the instruction under an implicit state model, but
evaluating the results with a continuum model. The approaches
reviewed below have been targeted toward a solution of the test-
length problem, but this is inseparable from the standard-setting
problem.
Millrnan's Binomial-based Decision Model

Millman (1972, 1973) has developed a set of tables based on the
binomial distribution that indicate the error rates to be expected
for various combinations of true score, test length, and passing
score. These are based upon the assumption t h a t the test con-
sists of a random set of 0-1 scored items from some defined
universe. The binomial distribution can be used to obtain the
dispersions around particular assumed values of true score for
given test lengths for large tests and the hypergeometric distri-
bution for short tests. The familiar analogy of the urn containing
two colors of balls applies. The assumed score, or true level-of-
functioning, is analogous to the result t h a t would be obtained if
all the balls in the urn were counted. Since this is hypothesized
to be impractical, we estimate the universe statistic by means of
random samples of some size. The use of either the hyper-
geometric or binomial distribution allows us to predict the rela-
tive frequencies of various proportions of balls of one color were
we to pick successive samples of a given size. This is equivalent
to giving random samples of a fixed number of items to a student
whose proficiency is known and who does not change during the
experiment. This allows us to plot the relative frequency of the
various percentages of questions answered correctly. A portion
of such a table is shown below.
The error rates in this table are alarmingly high, but this is
deceptive because the table starts with true scores and computes
error rates by true score. A decision-maker will have observed
148

Table 3
Percentage of Students of Various True Levels of Functioning Expected to be
Misclassified at a Passing Score of 80%
Passing No. of
Score Items Student's True Level of Functioning
(False positive (False negative

error rates) error rates)
ho 50 60 70 75 85 90 95
1 40 50 60 70 75 15 10 5
2 16 25 36 49 56 28 19 10
3 6 13 22 34 42 39 27 14
4 3 6 13 24 32 48 34 19
5 9 19 34 53 63 16 8 2
6 4 11 23 42 53 22 11 3
7 2 6 16 33 44 28 15 4
8 1 4 11 26 37 34 19 6
9 2 7 20 30 40 23 7
10 1 5 17 38 53 18 7 1
scores that fall into some sort of distribution. Although proce-

dures exist for the estimation of true scores (Davis, 1964;
Hambleton & Novick, 1973), they require an estimate of the reli-
ability of the test. This has been a hotly debated issue (see, e.g.,
Harris, 1972; Ivens, 1970; Livingston, 1972a, 1972b) mainly be-
cause of the fact t h a t classical test theory bases reliability, in
part, on observed-score variance. If a teacher brings everybody
up to criterion as measured by posttest, there may be no vari-
ance and hence the reliability would be zero. However, even if
the decision-maker can get past t h a t problem and achieve a true-
score estimate, to evaluate his decision strategy he must multi-
ply the proportion of people "at risk" of misclassification at each
of the true-score levels by the number of people he actually has
in that category. Thus, although it is disturbing t h a t of the 53%
of those with true-score proficiency of 75% would be misclassified
(as false positives) at a cut-point of 80% on a 10-item test; if
there are no individuals of this proficiency level in the distribu-
tion, the issue is not a problem.
Millman's work answers very nicely the question of error rates
relative to various true-score points, but a more important ques-
149

tion for the decision-maker concerns the error rates for the
entire interval above or below the cutting score.
The Davis and Diamond Bayesian Method

An approach based on a Bayesian model developed by Davis
and Diamond (1974) provided an answer to the above criticism.
They prepared a table that estimated, based only on an observed
score, the probability that a student's true competency level was
at or above a number of chosen levels. Table 4 presents success-
ful classification rates, rather than error rates, for a situation
where no prior information is available.
From this table we see that, for a 5-item test, the probability
that a student with an obtained score of 5 has a true competency
level of .99 and higher is .1135, a rather low figure. The table
also shows the cutting scores that would be necessary at two
levels of correct classification—.50 and .85. The cutting score for
tests up to 20 items, if the decision-maker wished 85% correct
classifications of students with competence levels of 90% and
higher, should be set equal to the number of items in the test.
This goal would not be attained until a test length of more than
12 but less than 20 items were used.
The Work of Novick and Collaborators

A further development based on the Bayesian model has been
provided by Melvin Novick and co-workers (Hambleton & Novick,
1973; Novick, Lewis, & Jackson, 1973; and Novick & Lewis, 1974).
Their model combines features of a number of others discussed
above. Three types of information must be supplied—the cutting
score or criterion level, a prior distribution of performance
probabilities, and relative losses associated with false positive
and false negative errors. The choice of a cutting score based
solely on opinion is, of course, commonplace—the necessity of
supplying a prior distribution is not. The basic notion here is that
information exists, in addition to the test score, t h a t can be used
to increase the accuracy of the decision process if such (prior) in-
formation were to be used in conjunction with test score. This
prior knowledge or expectation is stated in the form of a probabil-
ity distribution. The prior distribution is combined with the test-
score information to yield a posterior probability distribution
which, because it is based on more information, provides more
sensitive (and, hopefully, accurate) probability distributions.
The Novick model, in common with Emrick's (1971) state-model
approach, is concerned not so much with the absolute probability
of false positive and false negative error as with the proportion
of the two. These errors occur around the criterion cutting score
150

Table 4
Probability of Correct Classification for Selected Levels of
Competence, Obtained Scores, and Test Lengths
Cutting Score
When Probabil-
ity of Correct
Categorization
of Examinee Is
Competency Level
at or Above Obtained Score .5000 .8500
5-Item Test
5 4 3 2
.99 .1135 .0028 .0000 .0000 5 5

.95 .3085 .0385 .0026 .0001 5 5
.90 .5006 .1235 .0174 .0016 5 5
.85 .6459 .2343 .0505 .0067 5 5
.80 .7539 .3558 .0940 .0186 5 5
12-Item Test
12 11 10 9
.99 .2300 .0140 .0008 .0000 12 12

.95 .5509 .1568 .0312 .0043 12 12
.90 .7784 .4021 .1493 .0398 12 12
.85 .8950 .6204 .3273 .1292 11 12
.80 .9524 .7790 .5173 .2660 10 12
20-Item Test
20 19 18 17 16
.99 .3441 .0348 .0033 .0002 .0000 20 20

.95 .7269 .3202 .1054 .0255 .0048 20 20
.90 .9135 .6618 .3818 .1720 .0619 19 20
.85 .9749 .8587 .6537 .4135 .2165 18 19
.80 .9939 .9479 .8359 .6501 .4364 17 19
TT0 and are referred to as errors of "threshold loss." Given an in-

dividual's true mastery level, which is indicated as TT, the ques-
tion is to find out whether or not the value of TT is greater than
TT0 with the lowest acceptable amount of threshold loss. The esti-
mated value of the student's true mastery level, TTb is obtained by
sampling the items in the domain which TT indicates mastery of.
151

Classification of an individual is carried out by calculating the

probability t h a t the observed score exceeds the given require-
ment (Formula 9) and comparing this result to t h a t of a similar
function which indicates the probability t h a t the score does not
(Formula 8).
a [Prob. (7r < 7r0 | data)]. (8)
b LProb. (TT ^ 7T0 | data)]. (9)

In the above relations, a and b are a function of the relative
severity of the two types of errors.
Formulas 8 and 9 take into account the relative severity of
losses associated with false positives (a) and false negatives (6).
Depending on which result is greater, the individual is classified
as a master or a nonmaster. Although no practical methods are
suggested for determining the values of a and b to be used in the
loss function, an approach to the determination of the threshold
loss probabilities has been developed (Novick & Lewis, 1974).
This approach differs from the Davis and Diamond (1974) de-
velopment in the respect that different prior distributions are
used. The point is made that only infrequently would there be so
little information available that a uniform prior, as was assumed
in the development of Table 4, represents the best available
choice. Particularly in the context of a curriculum-embedded
test, considerable information is available t h a t could be put to
use. An example of the improvement in accuracy of estimation
will be apparent upon examination of Tables 5 and 6. The differ-
ence between the two is due solely to the use of different assump-
tions about the prior distribution. Table 5 is based on a uniform
prior, and Table 6 is based on the belief t h a t approximately 75%
of students will achieve scores equal to or greater than a chosen
criterion level of 80%.
Inspection of Table 5 shows that, for a 12-item test and a
cutting score of 10, the probability t h a t a student with such a
score will exceed a criterion level of 80% is only 50%, whereas the
identical entry for Table 6 shows a 75% probability. Of course,
the choice of an unreasonable prior distribution yields a mis-
leading result—hence the decision-maker must choose with care.
The work of Novick (1973), Novick and Jackson (1974), and
Novick, Lewis, and Jackson (1973) can be consulted for help in
choosing priors.
To use this table in a situation where the ratio of a to 6 is other
than one, the results of the decision rules in formulas 8 and 9
would be compared. For example, suppose it were three times
more costly to incorrectly advance a student as to incorrectly
152

MESKAUSKAS
Table 5
Probability That a Student's TVite Level of Functioning is Greater Than 1*0
Given a Uniform Prior Distribution
Minimum Criterion Level—ir0

Advancement No. of Posterior
Score Test Items Distribution 50 55 60 65 70 75 80 85 90 95
6 8 P (7, 3) 91 85 77 66 54 40 26 14 5 1
7 8 P (8, 2) 98 96 93 88 80 70 56 40 23 7
8 8 P (9, 1) 100 100 99 98 96 92 87 77 61 37
7 9 P (8, 3) 95 90 83 74 62 47 32 18 7 1
8 9 P (9, 2) 99 98 95 91 85 76 62 46 26 9
9 9 P do, i) 100 100 99 99 97 94 89 80 65 40
7 10 P (8, 4) 89 81 70 57 43 29 16 7 2 —
8 10 P (9, 3) 97 93 88 80 69 54 38 22 9 2
9 10 P (10, 2) 99 99 97 94 89 80 68 51 30 10
8 11 j8 (9, 4) 93 87 77 65 51 35 21 9 3 —
9 11 P (10, 3) 98 96 92 85 75 61 44 26 11 2
EVALUATION MODELS
10 11 P (11, 2) 100 99 98 96 92 84 73 56 34 12
9 12 j8 (10, 4) 95 91 83 72 58 42 25 12 3 —
10 12 P (11, 3) 99 97 94 89 80 67 50 31 13 2
11 12 j8 (12, 2) 100 100 99 97 94 87 77 60 38 14
153

154
REVIEW OF EDUCATIONAL RESEARCH

Table 6
Probability That A Student's True Level of Functioning Is Greater Than TT0
Given a (10.254, 1.1%6) Prior Distribution
Minimum
Advancement No. of Posterior
Score Test Items Distribution 50 55 60 65 70 75 80 85 90 95
6 8 0 (16.254, 3.746) 100 100 98 96 90 78 60 37 15 2

7 8 0 (17.254, 2.746) 100 100 100 99 97 92 81 62 36 10
8 8 13 (18.254, 1.746) 100 100 100 100 99 98 94 85 66 32
7 9 13 (17.254, 3.746) 100 100 99 97 92 82 65 41 17 2
8 9 13 (18.254, 2.746) 100 100 100 99 98 93 84 66 39 11
9 9 /3 (19.254, 1.746) 100 100 100 100 100 98 95 87 69 34
7 10 0 (17.254, 4.746) 100 99 97 93 84 68 47 24 7 1
8 10 0 (18.254, 3.746) 100 100 99 98 93 84 68 45 19 3
9 10 f3 (19.254, 2.746) 100 100 100 99 98 95 86 69 42 12
8 11 0 (18.254, 4.746) 100 99 98 94 87 72 51 27 8 1
9 11 0 (19.254, 3.746) 100 100 100 98 95 87 72 48 22 3
10 11 /3 (20.254, 2.746) 100 100 100 100 99 96 88 72 45 13
9 12 13 (19.254, 4.746) 100 100 99 96 89 76 55 30 10 1
10 12 /3 (20.254, 3.746) 100 100 100 99 96 89 75 52 24 4
11 12 0 (21.254, 2.746) 100 100 100 100 99 96 90 75 48 14
Vol. 46, No. 1

retain. Here a is 3, and b is 1. The situation in the above para-

graph, where the criterion level was 80%, the advancement score
10, and test length 12, the false positive error would be 3 (25); the
false negative error would be 1 (75). Since the two results are
identical, one would probably examine the loss ratio a bit more
closely. If one were not quite sure t h a t it should be exactly three
to one, the decision should be to advance the student.
Discussion
This paper has been organized around the views of various
authors regarding the nature of mastery and the way it is ac-
quired because it is clear that differences in such conceptualiza-
tion result in very different approaches to evaluation. Those
utilizing the Continuum model feel t h a t measurement occurs at
a point when the learner is in transition from a lower proficiency
level to a higher one. Depending on the task, it may not make
sense to conceptualize a 100% level of performance. The func-
tion of measurement is to estimate with the best possible ac-
curacy what the individual's proficiency level is and whether it
exceeds a given minimal level of competence or not. When an in-
dividual is identified as having exceeded this minimum level, he
will be shifted to another instructional sequence (unless the test
occurs at the end of a student's education). Since there will be
others who attain more, but are also put into the next sequence,
the learners will be bringing varying accomplishments to the
new task. This suggests that this learning model is not applicable
to hierarchical sets of learning tasks, since an incomplete grasp
at one level tends to insure failure at another.
The State model may appear, at first glance, to represent an
unreasonable approach to learning and evaluation. Perfection
often appears to be something to strive for, but not to reach.
And yet a great deal of what is learned, particularly in situa-
tions where errorless replication will be required, follows this
model. Both the Emrick and Roudabush approaches, though
they have some aspects that would be difficult to quantify,
should be pursued. These are currently the most developed
models known to this writer that are applicable to hierarchical
learning tasks and critical behaviors.
The existence of a model does not imply t h a t the approach can
be taken "off-the-shelf and applied without further work. Of
the approaches reviewed in this paper, only the Nedelsky method
has received extensive practical use. The other methods need to
be validated to provide users with data on which to make choices.
In particular, it is important to know whether the acquisition-
function assumptions of the evaluation models are supported by
data, whether there exist practical, reliable ways to obtain the
155

information required to use the model, and whether predictions

are accurate.
Decision models that are not specifically designed for applica-
tion to either a Continuum or State mastery model are incom-
plete. There is a danger of misapplication of incompletely-stated
models, because a decision procedure may be applied to an essen-
tially incompatible mastery model, with a resulting erroneous
decision. Also, the acquisition function should be useable as part
of the prior knowledge available to the decision-maker. If this
were quantifiable, it could help to improve accuracy of classifi-
cation.
The complexity of the task facing the measurement specialist
is hopefully commensurate with the potential gains. If advance-
ment decisions in individualized curriculum programs took into
account the various factors that introduce error, it seems reason-
able to believe t h a t the quality of the decisions would be in-
creased, resulting in higher efficiency and greater learner satis-
faction. This outcome is neither the sole responsibility nor the
sole interest of measurement specialists. Just as Baker (1975)
makes a convincing argument t h a t the day of the solitary
curriculum innovator is past, so is (or should be) the day of the
solitary evaluator. To be fully effective, the evaluator needs to be
part of the curriculum team from the start, to understand the
processes, both theoretical and practical, that are operative. It
is also clear t h a t as evaluational models take into account, to
increasing extents, the complexities of the educational process,
the ability or willingness of local decision-makers to utilize the
models adequately will be quickly strained. Decision procedures
that are practical and easy to use by instructional personnel
must be developed and made a formal part of curricula before
the promise of greater accuracy of decisions can be fulfilled.
Reference Notes
1. Setting standard of competence—The minimum pass level, January 1967.
Chicago: University of Illinois, College of Medicine, the Evaluation Unit,
Center for the Study of Medical Education.
2. Roudabush, G. E. Models for a beginning theory of criterion-referenced tests.
Paper presented at the meeting of the National Council for Measurement in
Education, Chicago, April 1974.
References
Baker, R. F. Educational publishing and educational research and development:
Selfsame, symbiosis, or separate. Educational Researcher, 1975, -4, 10-13.
Besel, R. A comparison of Emrick and Adam's mastery-learning test model with
KriewalVs criterion-referenced test model (Technical memorandum No.
5-71-04). Inglewood, Calif.: Southwest Regional Laboratory, 1971.
Bloom, B. J., Hastings, J. T., & Madaus, G. F. Handbook on formative and summa-
tive evaluation of student learning. New York: McGraw-Hill, 1971.
156

Cronbach, L. J., & Gleser, G. C. Psychological tests and personnel decisions.

Urbana: University of Illinois Press, 1965.
Davis, F. B. Educational measurements and their interpretation. Belmont, Calif.:
Wadsworth, 1964.
Davis, F. B. Criterion-referenced tests: A critique. Paper presented at the meet-
ing of the American Educational Research Association, New York, February,
1971. (ERIC Document Reproduction Service No. ED 050 154, 11 pp.)
Davis, F. B., & Diamond, J. J. The preparation of criterion-referenced tests. In
C. W. Harris, M. C. Alkin, & W. J. Popham (Eds.), Problems in criterion-
referenced measurement. Los Angeles: UCLA Graduate School of Education,
Center for the Study of Evaluation, 1974.
Ebel, R. L. Criterion-referenced measurement: Limitations. School Review, 1971,
69, 282-288.
Ebel, R. L. Essentials of educational measurement. Englewood Cliffs, N. J.:
Prentice-Hall, 1972.
Emrick, J. A. An evaluation model for mastery testing. Journal of Educational
Measurement, 1971, 8, 321-326.
Glaser, G. R. Instructional technology and the measurement of learning out-
comes. American Psychologist, 1963,18, 519-521.
Glaser, G. R., & Nitko, A. J. Measurement in learning and instruction. In R. L.
Thorndike (Ed.), Educational Measurement (2nd ed.). Washington, D.C.:
American Council on Education, 1971, 625-670.
Hambleton, R. K. A review of testing and decision-making procedures for selected
individualized instructional programs (ACT Technical Bulletin, No. 15). Iowa
City, Iowa: The American College Testing, Program, 1973.
Hambleton, R. K., & Novick, M. R. Toward an integration of theory and method
for criterion-referenced tests. Journal of Educational Measurement, 1973,
10, 159-170. (ERIC Document Reproduction Service No. ED 072 117, 15 pp.)
Harris, C. W. An interpretation of Livingston's coefficient for criterion-
referenced tests. Journal of Educational Measurement, 1972, 9, 27-29.
Hively, W., Maxwell, G., Rabehl, G., Sension, D., & Lundin, S. Domain-referenced
curriculum evaluation: A technical handbook and a case study from the
MINNEMAST project. Los Angeles: UCLA Graduate School of Education,
Center for the Study of Evaluation, 1973.
Ivens, S. H. A pragmatic approach to criterion-referenced measures. (ERIC Docu-
ment Reproduction Service No. ED 064 406, 9 pp.)
Kriewall, T. E. Aspects and applications of criterion-referenced tests. Downers
Grove, 111.: Institute for Educational Research, April 1972. (ERIC Document
Reproduction Service No. ED 063 333, 27 pp.)
Levine, H. G., & Forman, P. M. A study of retention of knowledge of neuro-
sciences information. Journal of Medical Education, 1973, U8, 867-869.
Livingston, S. A. Criterion-referenced applications of classical test theory.
Journal of Educational Measurement, 1972, 9, 13-26. (a)
Livingston, S. A. A reply to Harris' "An interpretation of Livingston's reliability
coefficient for criterion referenced tests." Journal of Educational Measure-
ment, 1972, 9, 31. (b)
Meskauskas, J. A., & Webster, G. W. The American Board of Internal Medicine
recertification examination process and results. Annals of Internal Medicine,
1975, 82, 577-581.
Millman, J. Tables for determining number of items needed on domain-referenced
tests and number of students to be tested (Technical Paper No. 5). Los Angeles:
Instructional Objective Exchange, April, 1972.
Millman, J. Passing scores and test lengths for domain-referenced measures. Re-
view of Educational Research, 1973, US, 205-216. (ERIC Document Repro-
duction Service No. ED 065 555, 17 pp.)
Nedelsky, L. Absolute grading standards for objective tests. Educational and
Psychological Measurement, 1954, 1U, 3-19.
157

Nitko, A. J. Criterion-referenced testing in the context of instruction. In Testing

in turmoil: A conference on problems and issues in educational measure-
ment. The thirty-fifth annual conference of Educational Records Bureau,
October 1970. Framingham, Mass.: Educational Records Bureau, 1970.
Novick, M. R. High school attainment: An example of a computer-assisted
Bayesian approach to data analysis. International Statistical Review, 1973,
41, 264-271.
Novick, M. R., & Jackson, P. H. Statistical methods for educational and psycho-
logical research. New York: McGraw-Hill, 1974.
Novick, M. R., & Lewis, C. Prescribing test length for criterion-referenced
measurement. In C. W. Harris, M. C. Alkin, & W. J. Popham (Eds.), Problems
in criterion-referenced measurement. Los Angeles: UCLA Graduate School of
Education, Center for the Study of Evaluation, 1974.
Novick, M. R., Lewis, C, & Jackson, P. H. The estimation of proportions in m
groups. Psychometrika, 1973, 38, 19-46.
Taylor, D. D., Reid, J. C, Senhauser, D. A., & Shively, J. A. Use of minimum
pass levels on pathology examinations. Journal of Medical Education, 1971,
46, 876-881.
AUTHOR
JOHN A. MESKAUSKAS Address: American Board of Internal Medicine, 3930
Chestnut Street, Philadelphia, Pa. 19104. Title: Assistant Director of Re-
search and Development. Degrees: B.S., M.S., Illinois Institute of Technology.
Specialization: Learning and measurement.
158

Evaluation Models For Criterion-Referenced Testing: Views Regarding Mastery and Standard-Setting

Uploaded by

Copyright:

Available Formats

You might also like

Evaluation Models For Criterion-Referenced Testing: Views Regarding Mastery and Standard-Setting

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Evaluation Models For Criterion-Referenced Testing: Views Regarding Mastery and Standard-Setting

Uploaded by

Copyright:

Available Formats

Review of Educational Research

Winter 1976, Vol 46, No. l,Pp. 133-158

Evaluation Models for Criterion-Referenced

American Board of Internal Medicine

In the dozen years since Glaser's (1963) seminal article on

The comments of Professor Frederick B. Davis, University of Pennsylvania

Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

on cutting scores and related issues such as test length have

Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

Of course, there are variations on the above theme. Some

Nedelsky's Minimum Pass Level (MPL) Method

Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

of well over 200 five-choice questions. Each of the approximately

MPL = MFD+KaFD. (1)

The value of K was agreed upon prior to the examination being

Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

in a fixed way to any score distribution. Thus, no student had to

EbeVs Method of Passing Score Estimation

Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

The Kriewall Binomial-based Model

Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

follow deal with approaches that start by assuming a standard of

Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

total of n events (questions in the test in this case), the formula

f(x)= (£\ pxQn~x (2)

where ( J = —% ,n' rr , the binomial coefficient,

and w = the observed number of errors.

Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

The term to the right of the binomial coefficient is precisely the

«=Z (^jzs^a-zr. (4)

The probability of a false positive result is:

£ = 2 (I) ^n~w(\-Z2), (5)

The result of the formula (4) is equivalent to obtaining the prob-

Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

The substitution of the concept "behaviors" for "items" would

Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

Emrick's Mastery Testing Evaluation Model

Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

problem becomes simplified because a person should answer all

Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

no less important risk of holding back people who are ready to

Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

Roudabush makes the interesting point t h a t error scores in such

Performance 0 '00 '01 '0

Roudabush treats criterion-measure performance as if it were

Performance on Criterion Measure

Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

The real-life measurement situation results in a situation that

f00 = N 0 (l - «l) (1 - OC2) + Nfifa,

Since N = N0 + Nu and iV is known, there are five unknowns in

Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

The focus of this model is clearly oriented toward the determi-

Decision Methods Not Referenced to Mastery Models

Millrnan's Binomial-based Decision Model

Downloaded from http://rer.aera.net at PENNSYLVANIA STATE UNIV on September 12, 2016

(False positive (False negative

scores that fall into some sort of distribution. Although proce-