Professional Documents
Culture Documents
Pg. 179-195 Rahma
Pg. 179-195 Rahma
Pg. 179-195 Rahma
PAGE (176-192)
There is way of estimating the average of all the possible split-half coefficients on the
basis of the statistical characteristics of the test items. This approach, which was developed
by Kuder and Richardson (1937), involves computing the means and variances of the items
that constitute the test. The mean of a dichotomous item (one that is scored as either right or
wrong) is the proportion, symbolized as p , of individuals who answer the item correctly. The
proportion of individuals who answer the item incorrectly is equal to 1 - p, and is
symbolized as q. The variance of a dichotomous item is the product of these two proportions,
or pq. The reliability coefficient provided by Kuder-Richardson formula 20 (KR-20), which
is based on the ratio of the sum of the item variances to the total test score variance:
The Kuder-Richardson formula makes certain assumptions, specifically, that the items
are of nearly equal difficulty and independent of each other. If it happens that the items are of
equal difficulty, the reliability coefficient can be computed by using a formula that is both
easier to compute and that requires less information. This formula, Kuder-Richardson
formula 21 (KR-21), is as follows:
These formulas will generally yield a reliability coefficient that is lower than that
given by KR-20." Like the Guttman split-haif coefficient, the Kuder-Richardson formulae are
based on total score variance, and thus they do not require any correction for length.
Coefficients alpha
Cronbach (1951) developed a general formula for estimating internal consistency
which he called ‘coefficient alpha’, and which is often referred to as ‘Cronbach’s alpha.
The reliability of a set of test scores can be estimated on the basis of a single test
administration only if certain assumptions about the characteristics of the parts of the test are
satisfied. In using the Spearman-Brown split-half approach, we must assume that the two
halves are equivalent and independent of each other, while with the Guttman split-half, we
need only assume that the halves are independent of each other. In the Spearman-Brown
approach the reliability is underestimated to the extent that the halves are not equivalent. For
both, the reliability is overestimated to the extent that the halves are not independent of each
other. The Kuder-Richardson estimates based on item variances assume that the items in the
test are equivalent and independent of each other. As with the split-half estimates, the Kuder-
Richardson formulae underestimate reliability s are not equivalent, and overestimate when
the items dent of each other. The assumptions that must be satisfied in applying these
estimates of internal consistency and the effects of violating these assumptions are
summarized in Table 6.1. cases where either of the assumptions id violated, as may be true
for cloze and dictation tests, one of the other approaches (parallel forms or test-retest) to
reliability discussed below should be used.
Rater consistency
In the case of a single rater, we need to be concerned about the consistency within that
individual’s ratings, or with intrarater reliability. When there are several different raters, we
want to examine the consistency across raters, or inter-rater reliability. In both cases, the
primary causes of inconsistency will be either the application of different rating criteria to
different samples or the inconsistent application of the rating criteria to different samples.
A. Intra-rater reliability
In order to examine the reliability of ratings of a single rater, we need to obtain at least
two independent ratings from this rater for each individual language sample. This is typically
accomplished by rating the individual samples once and then re-rating them at a later tibe in a
different, random order. Once the two sets of ratings have been obtained, the reliability
between them can be estimated in two ways. One way is to treat the two sets of ratings as
scores from parallel tests and compute the appropriate correlation coefficient (commonly the
Spearman rank-order coefficient) between the two sets of ratings, interpreting this as an
estimate of reliability. In this approach it is essential that the test user determine the extent to
which the two sets of ratings constitute parallel tests. This is because a high positive
correlation can be obtained even though the two sets of ratings are quite different, as long as
the rank orders of the ratings are similar.
Another approach to examining the consistency of multiple ratings is to compute a
coefficient alpha, treating the independent ratings as different parts. If one rater assigns two
independent ratings to every individual in a group, a combined rating for each individual can
be obtained by adding, or summing, the two ratings for that individual. To estimate the
reliability of such ratings, we then compute coefficient alpha:
B. Inter-rater reliability
In examining inter-rater consistency, we can use essentially the same approaches as were
discussed above with intra-rater consistency We can compute the correlation between two
different raters and interpret this as an estimate of reliability. When more than two raters are
involved, however, rather than computing correlations for all different pairs, a preferable
approach is that recommended by Ebel (1979), in which we sum the ratings of the different
raters and then estimate the reliability of these summed ratings by computing a coefficient
alpha.
A major practical advantage of internal consistency estimates of reliability is that they
can be made on the basis of a single test administration. In many testing situations there are
sources of error that are related to factors outside the test itself, such as differences in the
conditions of test administration, or in the test takers themselves, that are not addressed by
internal consistency estimates. Such is the case in situations where the same test may be
given more than once or where alternate forms of a test are used. For these reasons, other
ways of estimating reliability are needed.
Generalizability Theory
Generalizability theory (G-theory), is grounded in the framework of factorial design
and the analysis of variance as a broad model for investigating the relative effects of different
sources of variance in test scores. It constitutes a theory and set of procedures for specifying
and estimating the relative effects of different factors on observed test scores, and thus
provides a means for relating the uses or interpretations to be made of test scores to the way
test users specify and interpret different factors as either abilities or sources of error.
The more reliable the sample of performance, or test score, is, the more generalizable
it is. Reliability, then, is a matter of generalizability, and the extent to which we can
generalize from a given score is a function of how we define the universe of measures and to
define a given universe of measures will depend upon the universe o f generalization -the
decisions or inferences we expect to make on the basis of the test results.
There are two stages in application of G-theory:
1. Considering the uses that will be made of the test scores, the test developer designs
and conducts a study to investigate the sources of variance that are of concern or
interest. It involves:
Identifying the relevant sources of variance
Designing procedures for collecting data
Administering the test according to this design
Conducting the appropriate analyses
2. The application of G-theory thus enables test developers and test users to specify the
different sources of variance that are of concern for a given test use, to estimate the
relative importance of these different sources simultaneously, and to employ these
estimates in the interpretation and use of test scores.
G-theory can thus be seen as an extension of CTS theory, since it can examine the different
sources of error variance discussed above. The CTS model is a special case of G-theory in
which there are only two sources of variance: a single ability and a single source of error.
In this example, testing technique is a facet that has three conditions: multiple-choice,
dictation, oral interview. Within the multiple-choice and dictation there is the facet of
passage content, with economics, history, and physics as conditions, while within the oral
interview is the facet of rater, with three conditions, Wayne, Susan, and Molly. In order to do
this we would use the information obtained from our Gstudy to determine the comparability
of scores obtained from the different techniques, or the extent to which we can generalize
across techniques. Within the multiple-choice and dictation tests, we would need to determine
that passage content does not affect test takers' scores differentially. Or to put it another way,
we would have to be sure that a test score based on the economics passage would generalize
to performance on other passages. Finally, we would want to assure generalizability of ratings
across raters, which is a facet within the oral interview technique. In developing the test and
testing procedures, therefore, we would conduct a G-study to investigate the extent to which
the test score is generalizable across the different conditions within each of the facets that
define the possible measures for this particular testing context.
Populations of persons
To defining the universe of possible measures, we must define the group, or
population of persons about whom we are going to make decisions or inferences. The way
in which we define this population will be determined by the degree of generalizability we
need for the given testing situation the characteristics of the population of persons need to be
defined so that test users will know to which groups the results of the test will generalize.
Some typical characteristics are age, level of language ability (for example, beginning,
intermediate, advanced), and language use characteristics (for example, academic, vocational,
professional).
Universe score
A universe score xp is thus defined as the mean of a person's scores on all measures
from the universe of possible measures (this universe of possible measures being defined by
the facets 2nd conditions of concern for a given test use). The universe score is thus the G-
theory analog of the CTS-theory true score. The variance of a group of persons' scores on all
measures would be equal to the universe score variance s²p, which is similar to CTS true
score variance in the sense that it represents that proportion of observed score variance that
remains constant across different individuals and different measurement facets and
conditions.
The universe score is different from the CTS true score, however, in that an individual
is likely to have different universe scores for different universes of measures. This
conceptualization of generalizability makes explicit the fact that a given estimate of
generalizability is limited to the specific universe of measures and population of persons
within which it is defined, and that a test score that is ‘True’ for all persons, times, and places
simply does not exist.
Generalizability coefficients
The G-theory analog of the CTS-theory reliability coefficient is the generalizability
coefficient, which is defined as the proportion of observed score variance that is universe
score variance: