Professional Documents
Culture Documents
Evaluation Models For Criterion-Referenced Testing: Views Regarding Mastery and Standard-Setting
Evaluation Models For Criterion-Referenced Testing: Views Regarding Mastery and Standard-Setting
Evaluation Models For Criterion-Referenced Testing: Views Regarding Mastery and Standard-Setting
John A. Meskauskas
133
Continuum Models
The characteristics common to all Continuum criterion-
referenced models, as defined in this paper, are the following:
1. Mastery is viewed as a continuously-distributed ability or
set of abilities.
2. An area is identified at the upper end of this continuum, and
if an individual equals or exceeds the lower bound of this
area, he is termed a master.
3. The goal of measurement is to obtain information for the
purposes of educational decision-making, which explicitly
follows the classification decision.
134
135
136
137
culty levels (easy, medium, and hard) are identified. This forms a
3 x 4 grid into which all questions are classified based on raters'
judgments as to the relevance and difficulty of the questions for
the minimally qualified examinee. Judgments are also made, for
each cell in the table, regarding the percentage of items in the
cell that the minimally-qualified candidate should be able to
answer. The number of questions in each cell is multiplied by the
appropriate percentage, and the sum of all cells is divided by the
total number of questions to yield the lowest passing score. This
is expressed in terms of percentage correct.
One gathers the impression t h a t Ebel is not committed to the
particular descriptors used along the two dimensions. Many test
constructors may wish to use somewhat different descriptors, as
the inclusion, in a test, of a category of items judged of question-
able relevance seems hard to defend. The Relevance dimension is
probably not independent of the Difficulty dimension. One would
think that, although Acceptable items can be of any degree of
difficulty, Essential items can only be judged Easy when used for
a postinstruction examination.
In Ebel's method, the judge must simulate the decision pro-
cess of the examinee to obtain an accurate judgment and thus set
an appropriate standard. Since the judge is more knowledgeable
than the minimally-qualified individual, and since he is not
forced to make a decision about each of the alternatives, it seems
likely that the judge would tend to systematically over-simplify
the examinee's task. Whereas the examinee has to choose among
a number of alternatives, the judge's tendency is to consider only
the correct answer in relation to the stem. Thus, the judge's
rating process is transformed from a consideration of how dif-
ficult a question is when considered in relation to its distractors
to merely the difficulty of the correct answer. Even if this occurs
only occasionally, it appears likely that, in contrast to the
Nedelsky method, the Ebel method would allow the rater to
ignore some of the fine discriminations t h a t an examinee needs
to make and would result in a standard t h a t is more difficult to
reach. However, perhaps the most troublesome feature of Ebel's
method is the requirement that a separate judgment be made
about the percentage of items in each cell along the relevance/
difficulty continuum t h a t the minimally-qualified examinee
should be required to answer. Unless there are external criteria
upon which to base this judgment, it seems entirely arbitrary.
138
139
where
x = a test score,
n = the number of questions in the test,
p = z = proficiency of an individual, such that 0 ^ z ^ 1.00, and
q = z' = error rate of an individual = 1 - z.
Kriewall's model states t h a t the binomial distribution can be
used to determine the probability of several events. One must
decide upon values for ranges of performance on a test which
one will accept as being indicative of mastery and nonmastery.
Thus Zj is the lower bound of the mastery range (expressed as a
proportion of errors, i.e., .2) and Z2 the upper bound of the non-
mastery range. One must also define C, the standard t h a t will be
used to determine pass-fail, which is the maximum number of
allowable errors for masters. The recommended value of C is
midway between Z x and Z 2 . (Kriewall's emphasis on the errors
that individuals make on a test is unusual. Most authors, and in-
deed most people, wish to know how many questions a person
gets correct instead.) Given acceptable values for the above three
variables, the binomial distribution can be used to determine the
probability of several events. The following relation indicates the
probability t h a t an individual's score meets the error criterion,
C, when the observed number of errors is less than C.
c-i
S = Prob (W < C) = 2 [) zn~w (1 - Z)w, (3)
140
141
State Models
A widely-held view among educators conceptualizes mastery
as an all-or-none description of the student's learning state with
respect to a specified content domain. This is a natural out-
growth of the movement toward setting explicit educational
goals that should be met by all (or essentially all) students.
Bloom, Hastings, and Madaus (1971), for example, argue very
persuasively that the failure of students to achieve certain ob-
jectives may be the result of an educational process that is in-
sensitive to the educational needs and styles of the students.
The argument is advanced that an individualized educational
program will bring substantially all students (Bloom et al. feel
it may be 90%) to a mastery state.
There are several intertwined issues in the state-model posi-
tion which must be separated out for clarity. The first is that
the model implicitly demands 100% performance. Davis and
Diamond (1974, p. 133) note that:
Strictly speaking, mastery is defined as complete know-
ledge, skill, or control; so "partial mastery" is as self-
contradictory a phrase as "partial uniqueness." The
term "mastery," therefore, should be used to describe the
status of only those examinees who, it may be inferred,
can mark correctly all the items in the population of
which the subset that makes up a criterion-referenced
test is a representative sample, (p. 133)
142
time, since the function is not continuous and has a slope of zero
at all but one point. The only logical question to be asked is
whether or not an individual has achieved mastery. But this
introduces a paradox. The model is seeking "perfection" in an im-
perfect world with imperfect measuring instruments. How are
we to deal with this problem?
The most common answer has undoubtedly been an intuitive
one. The decision-maker chooses a level of performance that
"seems right," so t h a t perfect performance is not demanded, but
yet there is an intuitively reasonable assurance t h a t those who
reach the chosen level have, in fact, achieved a state of mastery.
This has not only been true for individual teachers, but for large
curriculum p r o g r a m s as well. Hambleton (1973) reviewed
Glaser's decision rules used for Individually Prescribed Instruc-
tion (IPI), Flanagan's Program for Learning in Accordance with
Needs (PLAN), and those of Carroll, Bloom, and Block concern-
ing Mastery Learning and found t h a t either program-wide deci-
sions were made on the standard to be used (80-85% for IPI) or
the decision was left to each individual teacher.
The alternative approach is to construct a decision model
which takes into account factors t h a t introduce measurement
error—the deficiencies of the examination and the functioning
of random processes of various types. The models to follow are
of this type.
143
144
log Y^ + VndogRR)
(6)
K =
loo- 20
i0g
(1 - a)(l - j8)
where
K = the cut point expressed as a percentage score on the
test,
a = estimated probability of Type 1 item error (false positive
error),
j8 = estimated probability of Type 2 item error (false nega-
tive error),
RR = Ratio of Regret of Type 2 to Type 1 decision errors, and
n = test length (number of items).
It is ironic t h a t most of the published individualized instruc-
tion or other curriculum programs are strongly oriented toward
the state-model of evaluation, yet adopt an arbitrary approxima-
tion as to the level of performance t h a t is indicative of mastery.
Emrick's model provides a firm methodological case for this type
of instructional approach and seems worthwhile to pursue. How-
ever, empirical quantification of the variables is likely to be a
difficult and time-consuming matter.
Roudabush's Dichotomous True-Score Models
Two forms of state models were considered by Roudabush
(Note 2). The first form involves a dichotomous measure of a
dichotomous true score. This, of course, is the case of a one-item
test. Although no decision rules are developed for this case,
145
Performance on Criterion
0 1
'0 '1 N
Table 2
Expected Frequencies for Error-Free CRT and Criterion Performance
0 1
NQ NQ
Performance 0 0
on CRT
N
1 0 "1 l
NQ N
"1
146
let
ax = PQL^XC I T = 0) = the probability t h a t nonmasters show
mastery on the CRT,
a2 = P(X^XC | T = 0) = the probability t h a t nonmasters show
mastery on the criterion,
ft = PQC <XC | T = 1) = the probability t h a t masters show
nonmastery on the CRT, and
ft = P(X <XC | T = 1) = the probability t h a t masters show
nonmastery on the criterion,
where
T = a true score,
X = an obtained score, and
X c = the cutting score or standard.
Given the above definitions, Roudabush (Note 2) notes that the
relations below can be written:
147
148
Table 3
Percentage of Students of Various True Levels of Functioning Expected to be
Misclassified at a Passing Score of 80%
Passing No. of
Score Items Student's True Level of Functioning
ho 50 60 70 75 85 90 95
1 40 50 60 70 75 15 10 5
2 16 25 36 49 56 28 19 10
3 6 13 22 34 42 39 27 14
4 3 6 13 24 32 48 34 19
5 9 19 34 53 63 16 8 2
6 4 11 23 42 53 22 11 3
7 2 6 16 33 44 28 15 4
8 1 4 11 26 37 34 19 6
9 2 7 20 30 40 23 7
10 1 5 17 38 53 18 7 1
149
tion for the decision-maker concerns the error rates for the
entire interval above or below the cutting score.
150
Table 4
Probability of Correct Classification for Selected Levels of
Competence, Obtained Scores, and Test Lengths
Cutting Score
When Probabil-
ity of Correct
Categorization
of Examinee Is
Competency Level
at or Above Obtained Score .5000 .8500
5-Item Test
5 4 3 2
12-Item Test
12 11 10 9
20-Item Test
20 19 18 17 16
151
152
6 8 P (7, 3) 91 85 77 66 54 40 26 14 5 1
7 8 P (8, 2) 98 96 93 88 80 70 56 40 23 7
8 8 P (9, 1) 100 100 99 98 96 92 87 77 61 37
7 9 P (8, 3) 95 90 83 74 62 47 32 18 7 1
8 9 P (9, 2) 99 98 95 91 85 76 62 46 26 9
9 9 P do, i) 100 100 99 99 97 94 89 80 65 40
7 10 P (8, 4) 89 81 70 57 43 29 16 7 2 —
8 10 P (9, 3) 97 93 88 80 69 54 38 22 9 2
9 10 P (10, 2) 99 99 97 94 89 80 68 51 30 10
8 11 j8 (9, 4) 93 87 77 65 51 35 21 9 3 —
9 11 P (10, 3) 98 96 92 85 75 61 44 26 11 2
EVALUATION MODELS
10 11 P (11, 2) 100 99 98 96 92 84 73 56 34 12
9 12 j8 (10, 4) 95 91 83 72 58 42 25 12 3 —
10 12 P (11, 3) 99 97 94 89 80 67 50 31 13 2
11 12 j8 (12, 2) 100 100 99 97 94 87 77 60 38 14
153
Minimum
Advancement No. of Posterior
Score Test Items Distribution 50 55 60 65 70 75 80 85 90 95
Discussion
This paper has been organized around the views of various
authors regarding the nature of mastery and the way it is ac-
quired because it is clear that differences in such conceptualiza-
tion result in very different approaches to evaluation. Those
utilizing the Continuum model feel t h a t measurement occurs at
a point when the learner is in transition from a lower proficiency
level to a higher one. Depending on the task, it may not make
sense to conceptualize a 100% level of performance. The func-
tion of measurement is to estimate with the best possible ac-
curacy what the individual's proficiency level is and whether it
exceeds a given minimal level of competence or not. When an in-
dividual is identified as having exceeded this minimum level, he
will be shifted to another instructional sequence (unless the test
occurs at the end of a student's education). Since there will be
others who attain more, but are also put into the next sequence,
the learners will be bringing varying accomplishments to the
new task. This suggests that this learning model is not applicable
to hierarchical sets of learning tasks, since an incomplete grasp
at one level tends to insure failure at another.
The State model may appear, at first glance, to represent an
unreasonable approach to learning and evaluation. Perfection
often appears to be something to strive for, but not to reach.
And yet a great deal of what is learned, particularly in situa-
tions where errorless replication will be required, follows this
model. Both the Emrick and Roudabush approaches, though
they have some aspects that would be difficult to quantify,
should be pursued. These are currently the most developed
models known to this writer that are applicable to hierarchical
learning tasks and critical behaviors.
The existence of a model does not imply t h a t the approach can
be taken "off-the-shelf and applied without further work. Of
the approaches reviewed in this paper, only the Nedelsky method
has received extensive practical use. The other methods need to
be validated to provide users with data on which to make choices.
In particular, it is important to know whether the acquisition-
function assumptions of the evaluation models are supported by
data, whether there exist practical, reliable ways to obtain the
155
Reference Notes
1. Setting standard of competence—The minimum pass level, January 1967.
Chicago: University of Illinois, College of Medicine, the Evaluation Unit,
Center for the Study of Medical Education.
2. Roudabush, G. E. Models for a beginning theory of criterion-referenced tests.
Paper presented at the meeting of the National Council for Measurement in
Education, Chicago, April 1974.
References
Baker, R. F. Educational publishing and educational research and development:
Selfsame, symbiosis, or separate. Educational Researcher, 1975, -4, 10-13.
Besel, R. A comparison of Emrick and Adam's mastery-learning test model with
KriewalVs criterion-referenced test model (Technical memorandum No.
5-71-04). Inglewood, Calif.: Southwest Regional Laboratory, 1971.
Bloom, B. J., Hastings, J. T., & Madaus, G. F. Handbook on formative and summa-
tive evaluation of student learning. New York: McGraw-Hill, 1971.
156
157
AUTHOR
JOHN A. MESKAUSKAS Address: American Board of Internal Medicine, 3930
Chestnut Street, Philadelphia, Pa. 19104. Title: Assistant Director of Re-
search and Development. Degrees: B.S., M.S., Illinois Institute of Technology.
Specialization: Learning and measurement.
158