Pg. 179-195 Rahma

WEEK 3 – RELIABILITY
PAGE (176-192)
Kuder-Richardson reliability coefficients
There is way of estimating the average of all the possible split-half coefficients on the
basis of the statistical characteristics of the test items. This approach, which was developed
by Kuder and Richardson (1937), involves computing the means and variances of the items
that constitute the test. The mean of a dichotomous item (one that is scored as either right or
wrong) is the proportion, symbolized as p , of individuals who answer the item correctly. The
proportion of individuals who answer the item incorrectly is equal to 1 - p, and is
symbolized as q. The variance of a dichotomous item is the product of these two proportions,
or pq. The reliability coefficient provided by Kuder-Richardson formula 20 (KR-20), which
is based on the ratio of the sum of the item variances to the total test score variance:
k : the number of items on the test

Σpq : the sum of the item variances
s²x : the total test score variance
The Kuder-Richardson formula makes certain assumptions, specifically, that the items
are of nearly equal difficulty and independent of each other. If it happens that the items are of
equal difficulty, the reliability coefficient can be computed by using a formula that is both
easier to compute and that requires less information. This formula, Kuder-Richardson
formula 21 (KR-21), is as follows:
These formulas will generally yield a reliability coefficient that is lower than that
given by KR-20." Like the Guttman split-haif coefficient, the Kuder-Richardson formulae are
based on total score variance, and thus they do not require any correction for length.
Coefficients alpha
Cronbach (1951) developed a general formula for estimating internal consistency
which he called ‘coefficient alpha’, and which is often referred to as ‘Cronbach’s alpha.
k : the number of items on the test

Σs² : the sum of the variances of the different parts of the test
s²x : the variance of the test scores
In the case of dichotomous items Σs² = Σpq (the sum of the item variances)
The reliability of a set of test scores can be estimated on the basis of a single test
administration only if certain assumptions about the characteristics of the parts of the test are
satisfied. In using the Spearman-Brown split-half approach, we must assume that the two
halves are equivalent and independent of each other, while with the Guttman split-half, we
need only assume that the halves are independent of each other. In the Spearman-Brown
approach the reliability is underestimated to the extent that the halves are not equivalent. For
both, the reliability is overestimated to the extent that the halves are not independent of each
other. The Kuder-Richardson estimates based on item variances assume that the items in the
test are equivalent and independent of each other. As with the split-half estimates, the Kuder-
Richardson formulae underestimate reliability s are not equivalent, and overestimate when
the items dent of each other. The assumptions that must be satisfied in applying these
estimates of internal consistency and the effects of violating these assumptions are
summarized in Table 6.1. cases where either of the assumptions id violated, as may be true
for cloze and dictation tests, one of the other approaches (parallel forms or test-retest) to
reliability discussed below should be used.
Two approaches to estimating internal consistency have been discussed:

1. An estimate based on correlation (the Spearman-Brown split-half estimate)
2. Estimates which are based on ratios of the variances of parts of the test (halves or items)
to total test score variance (the Guttman split-half, the Kuder-Richardson formulae, and
coefficient alpha).
Furthermore, these two approaches generalize to the other methods for estimating reliability
within the classical model discussed below. In general, we will see that the variables that are
correlated (different ratings, separate administrations of the same test, parallel forms) in the
first approach also provide the basis for examining variance ratios in the second approach.
Rater consistency
In the case of a single rater, we need to be concerned about the consistency within that
individual’s ratings, or with intrarater reliability. When there are several different raters, we
want to examine the consistency across raters, or inter-rater reliability. In both cases, the
primary causes of inconsistency will be either the application of different rating criteria to
different samples or the inconsistent application of the rating criteria to different samples.
A. Intra-rater reliability
In order to examine the reliability of ratings of a single rater, we need to obtain at least
two independent ratings from this rater for each individual language sample. This is typically
accomplished by rating the individual samples once and then re-rating them at a later tibe in a
different, random order. Once the two sets of ratings have been obtained, the reliability
between them can be estimated in two ways. One way is to treat the two sets of ratings as
scores from parallel tests and compute the appropriate correlation coefficient (commonly the
Spearman rank-order coefficient) between the two sets of ratings, interpreting this as an
estimate of reliability. In this approach it is essential that the test user determine the extent to
which the two sets of ratings constitute parallel tests. This is because a high positive
correlation can be obtained even though the two sets of ratings are quite different, as long as
the rank orders of the ratings are similar.
Another approach to examining the consistency of multiple ratings is to compute a
coefficient alpha, treating the independent ratings as different parts. If one rater assigns two
independent ratings to every individual in a group, a combined rating for each individual can
be obtained by adding, or summing, the two ratings for that individual. To estimate the
reliability of such ratings, we then compute coefficient alpha:
s²r1 and s²r2 : the variances of two ratings

s²r1+r2: the variance of the sum ratings
B. Inter-rater reliability
In examining inter-rater consistency, we can use essentially the same approaches as were
discussed above with intra-rater consistency We can compute the correlation between two
different raters and interpret this as an estimate of reliability. When more than two raters are
involved, however, rather than computing correlations for all different pairs, a preferable
approach is that recommended by Ebel (1979), in which we sum the ratings of the different
raters and then estimate the reliability of these summed ratings by computing a coefficient
alpha.
A major practical advantage of internal consistency estimates of reliability is that they
can be made on the basis of a single test administration. In many testing situations there are
sources of error that are related to factors outside the test itself, such as differences in the
conditions of test administration, or in the test takers themselves, that are not addressed by
internal consistency estimates. Such is the case in situations where the same test may be
given more than once or where alternate forms of a test are used. For these reasons, other
ways of estimating reliability are needed.
Stability (test-retest reliability)

As indicated above, for tests such as cloze and dictation we cannot appropriately estimate
the internal consistency of the scores because of the interdependence of the parts of the test.
There are also testing situations in which it may be necessary to administer a test more than
once. For example, if a researcher were interested in measuring subjects’ language ability at
several different points in time, as part of a time-series design, she would like to rule out the
possibility that changes in observed test scores were a result of increasing familiarity with the
test. This might also be the concern of a language program evaluator who is interested in
relating changes in language ability to teaching and learning activities in the program. In
situations such as these, reliability can be estimated by giving the test more than once to the
same group of individuals. This approach to reliability is called the ‘test-retest’ approach,
and it provides an estimate of the stability of the test scores over time.
Equivalence (parallel flows reliability)

Another approach to estimating the reliability of a test is to examine the equivalence of
scores obtained from alternate forms of a test. Like the test-retest approach, this is an
appropriate means of estimating the reliability of tests for which internal consistency
estimates are either inappropriate or not possible. It is of particular interest in testing
situations where alternate forms of the test may be actually used, either for security reasons,
or to minimize the practice effect.
In order to estimate the reliability of alternate forms of a given test, the procedure used is
to administer both forms to a group of individuals. Since administering two alternate forms to
the same group of individuals will necessarily involve giving one form first and the other
second, there is a possibility that individuals may perform differently because of the order in
which they take the two forms. One way to minimize the possibility of an ordering effect is to
use a ‘counterbalanced’ design, in which half of the individuals take one form first and the
other half take the other form first.
The means and standard deviations for each of the two forms can then be computed and
compared to determine their equivalence, after which the correlation between the two sets of
scores can be computed. This correlation is then interpreted as an indicator of the equivalence
of the two tests, or as an estimate of the reliability of either one.
Summary of classical true score approaches to reliability

The particular approach that use will depend on what we believe the sources of error are
in our measures, given the particular type of test, administrative procedures, types of test
takers, and the use of the test. Internal consistency estimates are concerned with sources of
error such as differences in test tasks and item formats, and inconsistencies within and among
scorers. Estimates of equivalence also examine inconsistencies, or within a single form, but
across different forms of a test. Finally, stability estimates are concerned with inconsistencies
that may arise as a function of time, such as random changes in individuals’ health or state of
mind, or changes in administration procedures, such as differences in temperature, lighting,
timing, and audibility of administrators’ voices.
For example, that we need to develop an oral interview that will measure individuals’
listening comprehension and speaking skills for use in admitting and placing students into a
language program. This test will also be given periodically to determine if students are ready
to begin job training that will be conducted in the target language. This job training will
involve listening to descriptions and explanations, and following instructions given by several
different trainers, as well as responding verbally to their questions. We must interview new
students for admission and placement every two weeks, as well as at the end of each unit of
instruction to determine if they are ready to begin job training. In order to minimize practice
effects, we will need several alternate sets of interview questions. We will also need to use
several different interviewers. In this situation, therefore, we need to be concerned with three
possible sources of error - inconsistency of questions, lack of equivalence of different sets of
questions, and lack of consistency among interviewer/raters.
Problems with the classical true score model

In many testing situations these apparently straightforward procedures for estimating
the effects of different sources of error are complicated by the fact that the different sources
of error may interact with each other, even when we carefully design our reliability study. In
the previous example, distinguishing lack of equivalence from interviewer inconsistency may
be problematic. Suppose we had four sets of questions. In an ideal situation, we might have
each interviewer administer all four sets of questions and give four separate ratings based on
these sets to each test taker. With six interviewers this would give us 24 ratings for each test
taker. We could compute a coefficient alpha for the ratings based on the four sets of questions
and interpret this as an estimate of equivalence reliability. Similarly, we could interpret the
coefficient alpha for the ratings of the different raters as indicators of inter-rater reliability.
Two kinds of problems with the CTS model:
1. It treats error variance as homogeneous in origin. Each of the estimates of reliability
discussed above addresses one specific source of error, and treats other potential
sources either as part of that source, or as true score.
For example:
Equivalent forms estimates; do not distinguish inconsistencies between the forms
from inconsistencies within the forms. In estimating the equivalence of forms,
therefore, any inconsistencies among the items in the forms themselves will lower our
estimate of equivalence. Thus, if two tests turn out to have poor equivalence, the test
user will not know whether this is because they are not equivalent or whether they are
not internally consistent, unless he examines internal consistency separately.
2. The CTS model considers all error to be random, and consequently fails to distinguish
systematic error from random error. Factors other than the ability being measured that
regularly affect the performance of some individuals and not others can be regarded
as sources of systematic error or test bias. The subject of test bias in general has been
dealt with extensively in the measurement literature, and several sources of bias in
language tests have also been investigated
Generalizability Theory
Generalizability theory (G-theory), is grounded in the framework of factorial design
and the analysis of variance as a broad model for investigating the relative effects of different
sources of variance in test scores. It constitutes a theory and set of procedures for specifying
and estimating the relative effects of different factors on observed test scores, and thus
provides a means for relating the uses or interpretations to be made of test scores to the way
test users specify and interpret different factors as either abilities or sources of error.
The more reliable the sample of performance, or test score, is, the more generalizable
it is. Reliability, then, is a matter of generalizability, and the extent to which we can
generalize from a given score is a function of how we define the universe of measures and to
define a given universe of measures will depend upon the universe o f generalization -the
decisions or inferences we expect to make on the basis of the test results.
There are two stages in application of G-theory:
1. Considering the uses that will be made of the test scores, the test developer designs
and conducts a study to investigate the sources of variance that are of concern or
interest. It involves:
 Identifying the relevant sources of variance
 Designing procedures for collecting data
 Administering the test according to this design
 Conducting the appropriate analyses
2. The application of G-theory thus enables test developers and test users to specify the
different sources of variance that are of concern for a given test use, to estimate the
relative importance of these different sources simultaneously, and to employ these
estimates in the interpretation and use of test scores.
G-theory can thus be seen as an extension of CTS theory, since it can examine the different
sources of error variance discussed above. The CTS model is a special case of G-theory in
which there are only two sources of variance: a single ability and a single source of error.
Universes of Generalization and Universes of Measures

When we want to develop or select a test, we generally know the use or uses for which it
is intended, and may also have an idea of what abilities we want to measure. It called
universe of generalization as a domain of uses or abilities (or both) to which we want test
scores to generalize. In universe of generalization there is likely to be a wide range of
possible tests that could be used. In developing a test for a given situation, therefore, we need
to ask what types of test scores we would be willing to accept as indicators of the ability to be
measured for the purpose intended. That is, we need to define the universe of possible
measures that would be acceptable for our purposes.
In identifying various samples within a universe of possible measures that are relevant to
this particular testing context - the universe of generalization. This universe can be defined in
terms of specific characteristics, or facets, and the different conditions that may vary within
each facet. Thus, the facets and conditions we specify define a given universe or set of
possible measures. The facets and conditions for the example above are illustrated bellow.
In this example, testing technique is a facet that has three conditions: multiple-choice,
dictation, oral interview. Within the multiple-choice and dictation there is the facet of
passage content, with economics, history, and physics as conditions, while within the oral
interview is the facet of rater, with three conditions, Wayne, Susan, and Molly. In order to do
this we would use the information obtained from our Gstudy to determine the comparability
of scores obtained from the different techniques, or the extent to which we can generalize
across techniques. Within the multiple-choice and dictation tests, we would need to determine
that passage content does not affect test takers' scores differentially. Or to put it another way,
we would have to be sure that a test score based on the economics passage would generalize
to performance on other passages. Finally, we would want to assure generalizability of ratings
across raters, which is a facet within the oral interview technique. In developing the test and
testing procedures, therefore, we would conduct a G-study to investigate the extent to which
the test score is generalizable across the different conditions within each of the facets that
define the possible measures for this particular testing context.
Populations of persons
To defining the universe of possible measures, we must define the group, or
population of persons about whom we are going to make decisions or inferences. The way
in which we define this population will be determined by the degree of generalizability we
need for the given testing situation the characteristics of the population of persons need to be
defined so that test users will know to which groups the results of the test will generalize.
Some typical characteristics are age, level of language ability (for example, beginning,
intermediate, advanced), and language use characteristics (for example, academic, vocational,
professional).
Universe score
A universe score xp is thus defined as the mean of a person's scores on all measures
from the universe of possible measures (this universe of possible measures being defined by
the facets 2nd conditions of concern for a given test use). The universe score is thus the G-
theory analog of the CTS-theory true score. The variance of a group of persons' scores on all
measures would be equal to the universe score variance s²p, which is similar to CTS true
score variance in the sense that it represents that proportion of observed score variance that
remains constant across different individuals and different measurement facets and
conditions.
The universe score is different from the CTS true score, however, in that an individual
is likely to have different universe scores for different universes of measures. This
conceptualization of generalizability makes explicit the fact that a given estimate of
generalizability is limited to the specific universe of measures and population of persons
within which it is defined, and that a test score that is ‘True’ for all persons, times, and places
simply does not exist.
Generalizability coefficients
The G-theory analog of the CTS-theory reliability coefficient is the generalizability
coefficient, which is defined as the proportion of observed score variance that is universe
score variance:
s²p : universe score variance

s²x : observed score variance, which includes both universe score and error variance

Pg. 179-195 Rahma

Uploaded by

Copyright:

Available Formats

You might also like

Pg. 179-195 Rahma

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pg. 179-195 Rahma

Uploaded by

Copyright:

Available Formats

WEEK 3 – RELIABILITY

Kuder-Richardson reliability coefficients

k : the number of items on the test

k : the number of items on the test

Two approaches to estimating internal consistency have been discussed:

s²r1 and s²r2 : the variances of two ratings

Stability (test-retest reliability)

Equivalence (parallel flows reliability)

Summary of classical true score approaches to reliability

Problems with the classical true score model

Universes of Generalization and Universes of Measures

s²p : universe score variance

You might also like