Language Test Reliability: A Test Should Contain

Language Test Reliability
A test should contain:
Reliability: Same results under the same condition no matter who, where,
when the test is taken
Validity: Scale to measure the size of the
Usability or Practicality: Not too difficult, practical to use
Variance measures how far a set of numbers is spread out.
Variance of Zero: Identical values

Small Variance: Expected value close to mean
High Variance: Spread out values, far from mean
Sources of Variance
u Meaningful Variance
Those creating variance related to the purposes of the test
To gain the above goal items should be related to the purpose of designed test &
students’ knowledge on topic, thus on a test will be defined here as that variance which
is directly attributable to the testing purposes.
 Measurement error, or Error variance
Those generating variance due to other extraneous sources
Potential sources of meaningful test variance for communicative competence
COMPONENTS OF LANGUAGE COMPETENCE:
Organizational Competence
Grammatical Competence
Vocabulary
Morphology
Syntax
Phonology/graphemes
Textual Competence
Cohesion
Rhetorical organization
Pragmatic competence
Illocutionary Competence
Ideational functions
Manipulative functions
Heuristic functions
Imaginative functions
Sociolinguistic Competence
Sensitivity to differences in dialect or variety
Sensitivity to differences in register
Sensitivity to naturalness
Ability to interpret cultural references and figures of speech
Types of issues related to error variance
1. Variance due to the environment: (The first potential source of measurement

error shown in Table 8.2 is the environment in which the test is
administered. The very location of the test administration can be one source of
measurement error if it affects the performance of the students. Consider for
instance the possible effects of administering a test to a group of students in a
library with people quietly talking nearby, as opposed to administering it in a
quiet auditorium that contains only examinees and
proctors. Indeed, lighting, ventilation, weather, or any other environmental
factors can serve as potential sources of measurement error if they affect the
students’ performances on a test. ).
2. Variance due to the administration procedure: (Another potential source of
measurement error involves the procedures that are used to administer the
test. For instance, if the directions for filling out the answer sheets or for doing
the actual test are not clear, score variance may be created that has nothing to do
with the purpose of the test. If the results from several administrations are to be
combined and the directions are inconsistent from administration to
administration, another source of measurement error will exist. Likewise, if the
quality of the equipment and the timing are not the same each time a test is
administered, sources of measurement error are being created.
Indeed, any issues related to the mechanics of testing may inadvertently become

sources of measurement error. Again, careful attention to the checklist shown in
Table 2.5 should help to minimize the effects of administration procedures as a
source of error variance. ).
3. Variance due to examinees: (A large number of potential sources of error
variance are directly related to the condition of the students when they take the
test. The sources include physical characteristics like differences among students
in their fatigue, health, hearing, or vision. For example, if five students in a class
are coming down with the flu at the time that they are taking a test, their poor
physical health may be a variable that should be considered as a potential source
of measurement error. In addition, just by chance, through classes or life
experience, some of the students may have topic knowledge that will help them
with certain of the questions on a test in a way that is not related to the purpose
of the test. )
4. Variance due to scoring procedure: (Errors in doing scoring. Subjective nature of
scoring procedure. )
5. Variance due to test and test items: (The last general source of measurement
error is the test itself and its items. The type of items chosen can also be an issue
if that type is new to some of the students or is a mismatch with the purpose of
the test. The number of items used on a test is also a potential source of
measurement error. If only a small number of items is used, we know that the
measurement will not be as accurate as a larger number of items.
Once that premise is accepted, differences in the accuracy of measurement for

other numbers of items simply become a matter of degrees. The quality of the
items can also become a source of measurement error if that quality is poor or
uneven. To minimize the effects of the test itself and the test items on
measurement error, testers should use Tables 2.4 , 2.5 , and 3.1 to 3.3 as carefully
as possible. All the foregoing sources of measurement error may be affecting
students’ scores on any given test.
Such effects are undesirable because they are creating variance in the students’
scores that is unrelated to the purpose of the test. In the remainder of this
chapter, I will cover ways of estimating the effects of error variance on the overall
variance in a set of test scores. Knowing about the relative reliability of a test can
help me decide the degree to which I should be concerned about ail the potential
sources of measurement error presented in Table 8.2 .)
Reliability of NRTs
In general, test reliability is defined as the extent to which the results can be considered
consistent or stable. For example, if language teachers administer a placement test to
their students on one occasion, they would like the scores to be very much the same if
they were to administer the same test again one week later. Since most language
teachers are responsible language professionals, they want the placement of their
students to be as accurate and consistent as possible so they can responsibly serve their
students’ language learning needs. The degree to which a test is consistent, or reliable,
can be estimated by calculating a reliability coefficient .
A reliability coefficient is like a correlation coefficient in that it can go as high as +1.00

for a perfectly reliable test. But the reliability coefficient is also different from a
correlation coefficient in that it can only go as low as 0.00 because a test cannot logically
have less than zero reliability. 91% consistent, or reliable, with 9% measurement error ,
or random variance
Test-retest Reliability
Of the three basic reliability strategies, test-retest reliability is the one most appropriate
for estimating the stability of a test over time. The first step in this strategy is to
administer whatever test is involved two times to a group of students. The testing
sessions should be far enough apart time-wise so that students are not likely to
remember the items on the test, yet close enough together so that the students have not
changed in any fundamental way . This reliability estimate can then be interpreted as
the percent of reliable variance on the test.
How is reliability measured?
By comparing two sets of scores for a single assessment (such as two rater scores for the
same person). After having two sets of scores for a group of students, we can determine
how similar they are by computing a statistic known as the reliability coefficient.
2. Equivalent-Forms Reliability:
Situation: Testing of same people on different but comparable forms of the test. (Forms
A & B)
Procedure: correlate the scores from the two tests which yields a coefficient of
equivalence.
Meaning: the consistency of response to different item samples (where testing is

immediate) and across occasions (where testing is delayed).
Appropriate use: to provide information about the equivalence of forms.
3. Internal Consistency Reliability:
This approach is very similar to the equivalent-forms technique except that, in this case,
the equivalent forms are created from the single test being analyzed by dividing it into
two equal parts. The test is usually split on the basis of odd- and even-numbered items.
The odd-numbered and even-numbered items are scored separately as though they were
two different forms. A correlation coefficient is then calculated for the two sets of scores.
If all other things are held constant, a longer test will usually be more reliable than a
short one, and the correlation calculated between the odd-numbered and even
numbered items must therefore be adjusted to provide a coefficient that represents the
full-test reliability. This adjustment of the half-test correlation to estimate the full-test
reliability is accomplished by using the Spearman-Brown prophecy formula:
Internal Consistency Strategies
All items in the test should be homogenous. And there should be a relationship among
them.
 Split – Half Reliability
In split-half reliability we randomly divide all items that purport to measure the same
construct into two sets. We administer the entire instrument to a sample of people and
calculate the total score for each randomly divided half. the split-half reliability
estimate, as shown in the figure, is simply the correlation between these two total
scores.
 Cronbach Alplpha
k = the number of items we want to estimate the reliability for divided by the number of
items we have reliability for.
It is used only if the item scores are other than 0 & 1. (Such as Likert scale). )This is
advisable for essay items, problem solving and 5-scaled items. ; based on 2 or more
parts of the test, requires only one administration of the test.
 Kuder-Richardson Formulas
Kuder and Richardson believed that all items in a test are designed to measure a single
trait. KR21 is the most practical, frequently used and convenient method of estimating
reliability.
K – R20 = most advisable if the p values vary a lot
K – R21 = most advisable if the items do not vary much in difficulty,
i.e., the p values are more or less similar.
Reliability of Rater Judgments
Two other types of reliability may be necessary in language testing situations
where raters make judgments and give scores for the language produced by students.
Raters usually are necessary when testing students’ productive skills (speaking and
writing) as in composition, oral interviews, role plays, etc. Testers most often rely on
interrater and intrarater reliabilities in such situations.
Inter-rater Reliability
Having a sample of test papers (essays) scored independently by two examiners.
Inter-rater reliability is a measure of reliability used to assess the degree to which

different judges or raters agree in their assessment decisions. Inter-rater reliability is
useful because human observers will not necessarily interpret answers the same way;
raters may disagree as to how well certain responses or material demonstrate knowledge
of the construct or skill being as object or phenomenon.
Intra-rater Reliability
The degree of stability observed when a measurement is repeated under identical

conditions by the same rater.
NB: Intra-rater reliability makes it possible to determine the degree to which the results
obtained by a measurement procedure can be replicated.
Standard Error of Measurement
For any test, the higher the reliability estimate, the lower the error
The standard error or measurement is the average standard deviation of the error
variance over the number of people in the sample.
We never know the t rue score
All tests scores contain some error
Can be used to estimate a range within which a true score would likely fall.
The higher the reliability the lower the standard measurement error, hence the
greater confidence that we can put in the accuracy of the test score of an
individual.
We may determine the likelihood of the true score being within those limits by
knowing the S.E.M. and by understanding the normal curve.
FACTORS AFFECTING THE RELIABILITY OF NRTS
A number of factors affect the reliability of any norm-referenced test (see

Tables 8.1 & 8.2). Some of these factors are more directly within the
control of testers than arc other factors. However, language test developers
and users must realize that, if all other factors are held constant, the
following statements are usually true:
1. A longer test tends to be more reliable than a short one;
2. A well-designed and carefully-written test tends to be more reliable
than a shoddy one;
3. A test made up of items that assess similar language material tends to
be more reliable than a test that assesses a wide variety of material;
4. A test with items that discriminate well tends to be more reliable than a
test with items that do not discriminate well;
5. A test that is well-centered and disperses the scores efficiently (that is,
a test that produces normally-distributed scores) tends to be more
reliable than a test that has a skewed distribution;
6. A test that is administered to a group of students with a wide range of
abilities tends to be more reliable than a test administered to a group
with a narrow range of abilities
As noted previously, CRTs will not necessarily produce normal distributions, especially
if they are functioning correctly. On some occasions, such as at the beginning of
instruction, CRTs may produce normal distributions, but the tester cannot count on the
normal distribution as part of the strategy for demonstrating the reliability of a CRT.
This is quite the opposite of the goals and results when developing a good NRT, which
ideally should approximate a normal distribution of scores to the greatest extent
possible. Popham and Husek stated appropriateness of using correlational strategies for
estimating the reliability of CRTs was questioned, because such analyses all depend in
one way or another on normal distribution and a large standard deviation.
Consider the test-retest and equivalent-forms strategies. A quick glance back at the K-
R20 and K-R21 formulas will also indicate that, as the standard deviation goes down
relative to all other factors, so do these internal-consistency estimates. In short, all the
strategies for reliability discussed in Chapter 8 are fine for NRTs because they are very
sensitive to the magnitude of the standard deviation, and a relatively high standard
deviation is one result of developing a norm-referenced test that effectively spreads
students out into a normal distribution. However, those same reliability strategies may
be quite inappropriate for CRTs because CRTs are not developed for the purpose of
producing variance in scores.
Notice in the previous paragraph that the terms agreement and dependability are used
with reference to CRTs in lieu of the term reliability. In this book, the terms agreement
and dependability are used exclusively for estimates of the consistency of CRTs, while
the term reliability is reserved for NRT consistency estimates. This distinction helps
teachers and testers keep the notions of NRT reliability separate from the ideas of CRT
agreement and dependability. The agreement coefficient provides an estimate of the
proportion of students who have been consistently classified as masters and non-
masters on two administrations of a CRT. To apply this approach, the test should be
administered twice, such that enough time has been allowed between the
administrations for students to forget the test, but not so much time that they have
learned any substantial amount.
Kappa Coefficient
The kappa coefficient (k) was developed to adjust for this problem of a chance lower
limit by adjusting to the proportion of consistency in classifications beyond that which
would occur by chance alone. The adjustment is given in the following formula:
The kappa coefficient is an estimate of the classification agreement that occurred
beyond what would be expected by chance alone and can be interpreted as a percentage
of agreement by moving the decimal two places to the right. Since kappa represents the
percentage of classification agreement beyond chance, it is usually lower than the
agreement coefficient. Like the agreement coefficient, it has an upper limit of 1.00, but
unlike the agreement coefficient with its chance lower limit, the kappa coefficient has
the more familiar lower limit of .00.
Estimating threshold loss agreement from a single test administration
Once the tester has the standardized cutpoint score and an internal-consistency
reliability estimate in hand, it is just a matter of checking the appropriate table. In either
table, you can find the value of the respective coefficient by looking in the first column
for the z value closest to the obtained value, and scanning across that row until reaching
the column headed by the reliability coefficient closest to the observed reliability value.
Where the row for the z value meets the column for the reliability coefficient, an
approximate value is given for the threshold agreement of the CRT in question.
Squared-error Loss Agreement Approaches
Only the phi dependability index is presented here because it is the only squared-
error loss agreement index that can be estimated using a single test administration, and
because Brennan has provided a short-cut formula for calculating this index that can be
based on raw score test statistics:
Domain Score Dependability
All the threshold loss and squared-error loss agreement coefficients described
previously have been criticized because they are dependent in one way or another on the
cut-score. Alternative approaches, called domain score estimates of dependability, have
the advantage of being independent of the cut-score. However, in principle, they apply
to domain-referenced interpretations rather than to all criterion-referenced
interpretations. Domain-referenced tests (DRTs) are defined here as a type of CRT that
is distinguished primarily by the ways in which items are sampled. For DRTs, the items
are sampled from a general, but well-defined, domain of behaviors (e.g., overall business
English ability), rather than from individual course objectives (e.g., the course objectives
of a specific intermediate level business English class), as is often the case in what might
be called objectives-referenced tests (ORTs), The results on a DRT can therefore be used
to describe a student’s status with regard to the domain in a manner similar to the way
in which ORT results are used to describe the student’s status on small subtests for each
course objective.
CONFIDENCE INTERVALS
One last statistic in this section on CRT dependability isthe confidence interval (CI). The
CI functions for CRTs in a manner analogous to the standard error of measurement
(SEM) that I described in Chapter 8 for NRTs.
The Phi(lambda) Coefficient
Step-by-step, that formula is saying:
1. Begin those calculations working to the right of the first parenthesis by dividing 1 by
the isolated result of the number of items (AF36) minus one, and isolate that result in
parentheses.
2. Then multiply the mean of the proportion scores (AH32) times the isolated result of 1
minus the mean of the proportion score (AH32) and isolate the result in parentheses.
3. Subtract the result of Step 2 minus the variance of the proportion scores (AH34) and
isolate that result in parentheses.
4. Then subtract the mean of the proportion scores (AH32) minus the cut-point (AH40)
and isolate the result in parentheses.
5. Multiply the result of Step 4 times itself and isolate the result in parentheses.
6. Add the result of Step 5 to the variance of the proportion scores (AH34) and isolate
the result in parentheses.
7. Divide Step 3 by Step 6 and isolate the result in parentheses.
8. Multiply the result of Step 1 times the result of Step 7, and isolate the result in
parentheses.
9. Subtract 1 minus the result of Step 8 and hit enter.
10. The final result of .8247101 shown in Cell AH42 of Screen 9.3 can now be rounded to
.82.Naturally, you will want to save these results, probably under a new file name, so
you don’t lose them if something goes wrong with your computer.
The Phi Coefficient and Confidence Interval
Step-by-step, the calculations for the Phi(top) value are:
1. Begin by multiplying the number of examinees, 30 in this case, times the variance of
the proportion scores (AH34), and isolate the result in parentheses.
2. Then subtract 1 from the number of examinees and isolate the result in parentheses.
3. Divide the result of Step 1 by the result of Step 2 and isolate the result in parentheses.
4. Multiply the result of Step 3 times the K-R20 (with seven places to the right of the
decimal in AH38) and hit the enter key to get the Phi(top) result in CellAH43.
Next, I calculate the Phi(error) in its linear algebra equivalent as shown in Cell AH44 of
Screen 9.2, which shows =((AH32*(1-AH32))-AH34)/(AF36-1). Step-by-step, the
calculations for the Phi(error) value are:
1. Multiply the mean of the proportion scores (AH32) times the isolated result of 1
minus the mean of the proportion score (AH32).
2. Subtract the result of Step 1 minus the variance of the proportion scores (AH34) and
isolate that result in parentheses.
3. Then divide the result of Step 2 by the isolated result of the number of items (AF36)
minus one and hit the enter key to get the value of the Phi(error).
To calculate the phi coefficient and CI, I begin by labeling the two in Cells AG45 and
AG46, respectively. I calculate the phi coefficient in Cell AH45 by dividing the Phi(top)
(AH43) by the isolated result of the Phi(top) (AH43) plus the Phi(error) (AH44) and
hitting the enter key using the following: =AH43/(AH43+AH44).As shown in Screen
9.2, to calculate the CI in Cell AH46,1 simply take the square root of the Phi(error) and
hit the enter key using the following: =SQRT(AH44).The results of all these calculations
for Phi(top), Phi(error), phi, and CI are shown in Cells AH43 to AH46 in Screen 9.3,
which are very similar to the results obtained in the formulas in the text above. The
slight differences are very minor and are due to rounding.
FACTORS AFFECTING THE CONSISTENCY OF CRTS
1. A longer test will tend to be more consistent than a short one;
2. A well-designed and carefully-written test will tend to be more consistent than a

shoddy one;
3. A test made up of items that test similar language material will tend to be more
consistent than a test assessing a wide variety of material;
4. A test with items that have relatively high difference indexes, or B-indexes, will tend
to be more consistent than a test with items that have low ones;
5. A test that is clearly related to the objectives of instruction will tend to be more
consistent than a test that is not obviously related to what the students have learned.

Language Test Reliability: A Test Should Contain

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Language Test Reliability: A Test Should Contain

Uploaded by

Copyright:

Available Formats

Language Test Reliability

A test should contain:

Variance measures how far a set of numbers is spread out.

Variance of Zero: Identical values

Those creating variance related to the purposes of the test

 Measurement error, or Error variance

Those generating variance due to other extraneous sources

Potential sources of meaningful test variance for communicative competence

COMPONENTS OF LANGUAGE COMPETENCE:

Sensitivity to differences in dialect or variety

Sensitivity to differences in register

Ability to interpret cultural references and figures of speech

Types of issues related to error variance

1. Variance due to the environment: (The first potential source of measurement

Indeed, any issues related to the mechanics of testing may inadvertently become

Once that premise is accepted, differences in the accuracy of measurement for

A reliability coefficient is like a correlation coefficient in that it can go as high as +1.00

How is reliability measured?

Meaning: the consistency of response to different item samples (where testing is

Appropriate use: to provide information about the equivalence of forms.

3. Internal Consistency Reliability:

Internal Consistency Strategies

K – R20 = most advisable if the p values vary a lot

K – R21 = most advisable if the items do not vary much in difficulty,

i.e., the p values are more or less similar.

Reliability of Rater Judgments

Two other types of reliability may be necessary in language testing situations

Having a sample of test papers (essays) scored independently by two examiners.

Inter-rater reliability is a measure of reliability used to assess the degree to which

The degree of stability observed when a measurement is repeated under identical

Standard Error of Measurement

A number of factors affect the reliability of any norm-referenced test (see

Estimating threshold loss agreement from a single test administration

Squared-error Loss Agreement Approaches

Domain Score Dependability

Step-by-step, that formula is saying:

7. Divide Step 3 by Step 6 and isolate the result in parentheses.

9. Subtract 1 minus the result of Step 8 and hit enter.

The Phi Coefficient and Confidence Interval

Step-by-step, the calculations for the Phi(top) value are:

FACTORS AFFECTING THE CONSISTENCY OF CRTS

1. A longer test will tend to be more consistent than a short one;

2. A well-designed and carefully-written test will tend to be more consistent than a

You might also like