Professional Documents
Culture Documents
Criterion Referenced Tests
Criterion Referenced Tests
Criterion Referenced Tests
OF RELIABILITY
Khairuddin
Department of Medical Record and Health Information
State Health Polytechnic of Malang
achievement tests will be directly based on and midpoints in a course as an achievement test
course objectives and will therefore be criterion- at the end.
referenced. Such tests will typically be Perhaps the most effective use of a
administered at the end of a course to determine diagnostic test is to report the performance level
how effectively students have mastered the on each objective (in a percentage) to each
instructional objectives. student so that they can decide how and where
Achievement tests must not only be to most profitably invest their time and energy.
very specifically designed to measure the For example, telling the student that she scored
objectives of a given course, but also must be 100% on the first objective (selecting the main
flexible enough to help teachers readily respond idea of a paragraph) but only 20% of the second
to what they learn from the tests about the objective (guessing vocabulary from context)
students’ abilities, the students’ needs, and the would tell that student that she is good at finding
students’ learning of the course objectives. In the main idea of a paragraph but needs to focus
other words, a good achievement test can tell her energy on guessing vocabulary from
teachers a great deal about their students’ context.
achievement and about the adequacy of the It would also be useful to report the
course. Hence, while achievement tests should average performance level for each class on each
definitely be used to make decisions about objective (in percentage terms) to the teacher(s)
students’ levels of learning, they can also be along with indications of which students have
used to affect curriculum changes. particular strengths or weaknesses on each
From time to time, teachers may also objective.
take an interest in assessing the strengths and
weaknesses of each individual student in terms Interpretation of CRT
of the instructional objectives for the purpose of The interpretation of scores on a CRT
correcting an individual’s deficiencies “before it is considered absolute. Each student’s score is
is too late.” To that end, classroom-level meaningful without reference to the other
diagnostic decisions are typically made at the student’s scores, like in NRT interpretation. In
beginning or middle of the term and are aimed other words, a student’s score in a particular
at fostering achievement by promoting strengths objective indicates the per cent of the knowledge
and eliminating the weaknesses of individual or skill in that objective that the student has
students. Naturally, the primary concern of the learned. Moreover, the distribution of scores on
teacher must be the entire student. Clearly, this a CRT need not necessarily be normal. If all the
last category of decision is concerned with students know 100% of the material on all the
diagnosing problems that students may be objectives, then all the students should receive
having in leaning process. While diagnostic the same score with no variation at all. The
decisions are definitely related to achievement, purpose of a CRT is to measure the amount of
diagnostic testing often requires more detailed learning that a student has accomplished on each
information about which specific objectives objective. In most cases, the students should
students can already do well and which they still know in advance what type of questions, tasks,
need to work on. The purpose is to help students and content to expect for each objective because
and their teachers to focus their efforts where the question content should be implied (if not
they will be most effective. explicitly stated) in the objectives of the course.
As with achievement tests, diagnostic In terms of the type of interpretation,
tests are designed to determine the degree to each student’s performance on a CRT is
which the specific instructional objectives of the compared to a particular criterion in absolute
course have already been accomplished. Hence, terms. Some confusion has developed over the
they should be criterion-referenced in nature. years about what criterion in a criterion-
While achievement decisions are usually referenced testing refers to. This confusion is
focused on the degree to which the objectives understandable because two definitions have
have been accomplished at the end of the evolved for criterion. For some authors, the
program or course, diagnostic decisions are material that the students are supposed to learn
normally made along the way as the students are in a particular course is the criterion against
learning the language. As a result, diagnostic which they are being measured. For other
tests are typically administered at the beginning authors, the term criterion refers to the standard,
or in the middle of a language course. In fact, if also called a criterion level or cut-point against
well constructed to reflect the instructional of which each student’s performance is judged. For
objectives, one CRT in three equivalent forms instance, if the cut-point for passing a CRT is set
could serve as a diagnostic tool at the beginning at 70 per cent, that is the criterion level.
students stay on track, and the test results should Mean score of odd-numbered items: 9.2
provide useful feedback to both groups on the Standard deviation: 2.6
effectiveness of the teaching and learning Mean score of even-numbered items: 8.1
[processes. In short, teaching to the test, if the Standard deviation: 2.8
test is a well developed CRT, should help the Correlation coefficient split-half reliability: 0.64
teachers and the students rather than constrain (Pearson-product moment)
them. Full-test reliability: 0.80
(Cronbach alpha)
Reliability of CRT Internal consistency reliability
As noted previously, CRTs will not estimates the consistency of a test using only
necessarily produce normal distributions, information internal to a test, that is available on
especially if they are functioning correctly. If all one administration of a single test. Split-half
the students have learned the material, the tester method is one way to estimate the consistency
would like them all to score 100 per cent on the or reliability of an NRT test. The procedure
end-of-course achievement CRT. Hence, a CRT requires a more or less normal distribution of
that produces little variance in scores is an ideal scores to the greatest extent possible, hence a
that test developers seek in developing CRTs. In large standard deviation. As the procedure, a
other words, a low standard deviation on the single test is divided into two parts on the basis
course post-test may actually be a positive of odd-numbered and even-numbered items.
byproduct of developing a sound CRT. This is The odd-numbered and even-numbered items
quite the opposite of the goals and results when are scored separately as though they were two
developing a good NRT, which ideally should different forms. A correlation coefficient is then
approximate a normal distribution of scores to calculated for the two sets of scores. This
the greatest extent possible (Brown 2005). coefficient gives the reliability for either the
As far back as Popham and Husek odd-numbered items or the even-numbered
(1969) the appropriateness of using correlational items—either half, but just half of the test. The
strategies for estimating the reliability of CRTs adjustment of the half-test correlation to
was questioned, because such analyses all estimate the full-test reliability is accomplished
depend in one way or another on normal by using the Cronbach Alpha coefficient, one of
distribution and a large standard deviation. the others.
Consider the test-retest and equivalent-forms The mean of the 30 students’ NRT-
strategies. In both cases, a correlation coefficient based scores in the odd category is 9.20, and in
is calculated. Since correlation coefficient are the even category 8.10. The standard deviation
designed to estimate the degree to which to sets for either group respectively is 2.66 and 2.80.
of numbers go together, scores that are tightly The split-half reliability is 0.66, and when
grouped, because of skewing or homogeneity of converted to full-test reliability using Cronbach
ability levels, will probably not line the students alpha coefficient, a reliability level of 0.80 is
up in similar ways. As that standards deviation achieved. 0.80 coefficient point is a high
approaches zero, so do any associated reliability level given the highest point of
correlation coefficients. Correlation coefficients correlation coefficient is 1. So interpretation can
used for estimating interrater and intrarater be made, that the test developed, by moving the
reliability will be similarly affected by such decimal two places to the right, is 80 per cent
circumstances. Therefore, reliability measure consistent or reliable with 20 per cent
strategies in NRT may be quite inappropriate for measurement error.
CRTs because CRTs are not developed for the Now look when the same procedure is
purpose of producing variance in scores. applied to CRT-based scores, where distribution
The following example split-half of the scores is not normal but in fact clustered
reliability case would show how the application around one extreme, to estimate the internal
of NRT-based reliability estimate strategy is consistency of the test through split-half
inappropriate when applied to measure the method.
reliability of a CRT. Odd-numbered items: 13, 13, 13, 12, 12, 13, 12,
Odd-numbered: 13, 13, 12, 12, 12, 11, 12, 11, 12, 12, 12, 12, 12, 12, 12,
12, 10, 9, 10, 11, 11, 9, 9, 8, 8, 12, 13, 13, 12, 13, 13, 13,
9, 9, , 5, 8, 8, 6, 6, 9, 7, 6, 3, 5 13, 13, 13, 13, 13, 13, 13,
Even-numbered: 14, 14, 14, 12, 12, 10, 9, 9, 7, 13, 12
8, 9, 8, 7, 7, 8, 8, 8, 8, 7, 6, 9, 6, Even-numbered items: 12, 12, 12, 12, 12, 12, 12,
6, 7, 7, 3, 5, 5, 7, 3 12, 12, 12, 13, 13, 13, 12,
Number of students: 30 12, 12, 12, 12, 14, 14, 13,
Number of items in each group: 15
13, 13, 13, 13, 13, 13, 13, Fortunately many other strategies have
13, 12 been worked out for investigating the
Number of students: 30 consistency of criterion-referenced test—
Number of items in each group: 15 strategies that do not depend on a high standard
Mean score of odd-numbered items: 12.56 deviation. In general these approaches fall into
Standard deviation: 0.50 three categories (Berk 1984, p. 235 in Brown
Mean score of even-numbered items: 12.53 2005): threshold loss agreement, squared-error
Standard diation: 0.56 loss agreement, and domain score
Correlation coefficient split-half reliability: 0.42 dependability. These three strategies have been
(Pearson-product moment) developed specifically for CRT consistency
Full-test reliability: 0.60 estimation. However, they are not described in
(Cronbach alpha) the present article for to attempt so requires at
The mean of the 30 students’ CRT- least as many pages as have been used for this
based scores in the odd-numbered category is article this far which is not affordable here. For
12.56, and in the even category 12.53. Looking the present, suffice it to be aware of the
at the superficial odd-numbered and even- problematic application of NRT-related
numbered paired scores and at the mean value of reliability estimates for estimating CRT-related
both scores groups, the numbers impress us of a reliability.
very high consistency between the two groups of
scores. In other words, when we look at the Conclusion
distribution of scores in both score groups we CRT has very different use compared
find very slim difference between the students’ with NRT that Classroom language teachers
performance as shown in the odd-numbered should be very aware of. Its main purpose is to
group and in the even-numbered group. measure instructional objectives achievement by
Therefore, we would expect to get a high students. The criterion set up as reference to
reliability level of the test. However, because the judge students’ performance should be specified
scores are too far from normal distribution and in details before translated or transformed into
are spread too largely, as indicated by the numbers or points as representing minimum
standard deviation values of 0.53 for odd- criteria to base the students’ performance
numbered items, and 0.56 for even-numbered judgment so that achievement decisions such, as
items. The split-half reliability derived from passing or failing the test, can be convincingly
applying pearson correlation formula is 0.42, made. Reliability estimate of a test developed
and when converted to full-test reliability using should be afforded by applying appropriate
Cronbach alpha coefficient, a reliability level of formula. CRTs have very different
0.60 is achieved. 0.6 coefficient point is a low characteristics from NRTs in terms of the
reliability level given the highest point of distribution of scores and their dispersion, where
correlation coefficient is 1. So interpretation can CRTs do not expect the test results to
be made that the test developed, by moving approximate normal distribution of scores nor
decimal two places to the right, is only 60 per great extent of scores dispersion or a large
cent consistent or reliable with 40 per cent standard deviation value. Therefore applying
measurement error. This result of calculation inappropriate procedures or formulas to measure
make us ask ourselves a question: How could a the reliability of a CRT may tell wrong
test which seemingly has a very high internal information, and thus raise a question about the
consistency through looking at the slim quality of the test developed.
difference between the two scores distributions
and through the close mean value of the two
score groups, but have a very low level of
reliability? This unexpected result is due to the
problematic use of reliability measure of NRT
when applied in CRT.
References:
Brown, Douglass. 2004. Language Assessment, Principles and Classroom Practice. San Francisco:
Longman
Cohen, Andrew. 1996. Assessing Language Ability in the Classroom. Boston: Heinle & Heinle
Publishers.
Djiwandono, Soenardi. 2007. Tes Bahasa: Pegangan Bagi Pengajar Bahasa. Universitas Negeri
Malang: Draft.
Khaeruddin is a lecturer at Department of Medical Record and Health Information, State Health
Polytechnic of Malang, Indonesia. He earned his Bachelor in Syiah Kuala University Aceh in and
Master degree in State University of Malang. In 2008-2009 he joined the Visiting Scholar Program at
Indiana University USA. His scholarly interest covers the areas of psycholinguistics, language
testing, and translation studies. He can be contacted at fadilkhairuddin@yahoo.com.