9 Reliability

CHAPTER 9.
RELIABILITY
Reliability – refers to the consistency of scores obtained by the same
person when re-examined with the same test on different occasions, or
with different sets of equivalent items, or under other variable
examining condition.
This mainly refers to the attribute of consistency in measurement.
Remember…
 Measurement error is common in all fields of science.
 Tests that are relatively free of measurement error are considered to be
reliable while tests that contain relatively large measurement error are
considered to be unreliable.
Classical Test Score Theory

This assumes that each person has a true score that would be obtained
if there were no errors in measurement.
Measuring instruments are imperfect, therefore the observed score for
each person almost always differ from the person’s true ability or
characteristic.
Measurement error – the difference between the observed score and the
true score results.
E = X - T
(error) (observed score) - (true score)
Standard error of measurement – the standard deviation of the
distribution of errors for each repeated application of the same test on
an individual.
Error (E) can either be positive or negative. If E positive, the Obtained
Score (X) will be higher than the True Score (T); if E is negative, then X
will be lower than T.
 Although it is impossible to eliminate all measurement error, test
developers do strive to minimize psychometric nuisance through careful
attention to the sources of measurement error.
 It is important to stress that true score is never known.
Factors that contribute to consistency:
 These consist entirely of those stable attributes of the individual,

which the examiner is trying to measure.
Factors that contribute to inconsistency:
 These include characteristics of the individual, test, or situation,
which have nothing to do with the attribute being measured, but
which nonetheless affect test scores.
Domain Sampling Model
 There is a problem in the use of limited number of items to represent a

larger and more complicated construct.
 A sample of items is utilized instead of the infinite pool of items of the
construct.
 The greater the number of items, the higher the reliability.
Sources of Measurement Error
A. Item selection
One source of measurement error is the instrument itself. A test developer
must settle upon a finite number of items from a potentially infinite pool of
test question.
Which items should be included? How should they be worded?
Although psychometricians strive to obtain representative test items, the
particular set of questions chosen for a test might not be equally fair to all
persons.
B. Test Administration
General environmental conditions may exert an untoward influence on the
accuracy of measurement, such as uncomfortable room temperature, dim
lighting, and excessive noise.
Momentary fluctuation in anxiety, motivation, attention, and fatigue level of
the test taker may also introduce sources of measurement error.
The examiner may also contribute to the measurement error in the process of
test administration.
C. Test Scoring
Whenever psychological test uses a format other than machine-scored
multiple choice items, some degree of judgment is required to assign points
to answers.
Most tests have well-defined criteria for answers to each question. These
guidelines help minimize the impact of subjective judgment in scoring.
Consistent and reliable scores have reliability coefficient near 1.0; conversely,
tests, which reflect large amount of measurement error, produce inconsistent
and unreliable score and their reliability coefficients are close to 0.
Item Response Theory (IRT)

 With the help of a computer, the item difficulty is calibrated to the
mental ability of the test taker.
 If you got several easy items correct, the computer will then move to
more difficult items.
 If you get several difficult items wrong, the computer moves back to
average items.
The Correlation Coefficient
A correlation coefficient (r) expresses the degree and magnitude of a linear
relationship between two sets of scores obtained from the same person.
It can take on values ranging from -1.00 to +1.00.
FORMS OF RELIABILITY
A. Test-Retest Reliability
It is established by comparing the scores obtained from two successive
measurements of the same individuals and calculating a correlation between
the two sets of scores.
It is also known as time sampling reliability since it measures the error
associated with administering a test at two different times.
This is used when we measure only traits or characteristics that do not change
over time. (e.g. IQ) .
 Example: You took an IQ test today and you will take it again after
exactly a year. If your scores are almost the same (e.g. 105 and 107),
then the measure has a good test-retest reliability.
 Error variance – corresponds to the random fluctuations of
performance from one test session to the other.
 Clearly, this type of reliability is only applicable to stable traits.
Limitations of Test-Retest Reliability
Carryover effect – occurs when the first testing session influences the results
of the second session and this can affect the test-retest reliability of a
psychological measure.
Practice effect – a type of carryover effect wherein the scores on the second
test administration are higher than they were on the first.
If the results of the first and second administration has a low correlation, it
might mean that the test has poor reliability
A major changed had occurred on the subjects between the first and second
administration.
A combination of low reliability and major change have occurred.
Sometimes, a poor test-retest correlation do not mean that the test is
unreliable. It might mean that the variable under study has changed.
B. Parallel Forms Reliability

It is established when at least two different versions of the test yield almost
the same scores.
It is also known as item sampling reliability or alternate forms reliability since
it compares two equivalent forms of a test that measure the same attribute to
make sure that the items indeed assess a specific characteristic.
The correlation between the scores obtained on the two forms represents the
reliability coefficient of the test.
 Examples:
The Purdue Non-Language Test (PNLT) has Forms A and B and both
yield slightly identical scores of the test taker.
The SRA Verbal Form has parallel forms A and B and both yield almost
identical scores of the test taker.
 The error of variance in this case represents fluctuations in
performance from one set of items to another, but not fluctuations over
time.
 Tests should contain the same number of items and the items should
be expresses in the same form and should cover the same type of
content.
 The range and level of difficulty of the items should also be equal.
 Instructions, time limits, illustrative examples, format and all other
aspects of the test must likewise be checked for equivalence.
Limitations of Parallel Forms Reliability
One of the most rigorous and burdensome assessments of reliability since test
developers have to create two forms of the same test.
Practical constraints make it difficult to retest the same group of individuals.
C. Inter-rater Reliability
It is the degree of agreement between two observers who simultaneously
record measurements of the behaviors.
 Examples:
Two psychologists observe the aggressive behavior of elementary school
children. If their individual records of the construct are almost the
same, then the measure has a good inter-rater reliability.
Two parents evaluated the ADHD symptoms of their child. If they both
yield identical ratings, then the measure has good inter-rater reliability.
This uses the kappa statistic in order to assess the level of agreement among
several raters using nominal scales.
Kappa Coefficient Qualitative

Interpretation
> 0.75 Excellent agreement

0.40 – 0.75 Satisfactory agreement
< 0.40 Poor agreement
D. Split-Half Reliability
It is obtained by splitting the items on a questionnaire or test in half,
computing a separate score for each half, and then calculating the degree of
consistency between the two scores for a group of participants.
The test can be divided according to the odd and even numbers of the items
(odd-even system).
This model of reliability measures the internal consistency of the test which
is the degree to which each test item measures the same construct. It is simply
the intercorrelations among the items.
If all items on a test measure the same construct, then it has a good internal
consistency.
Spearman-Brown, Kuder-Richardson, and Cronbach’s alpha are the formula
used to measure the internal consistency of a test.
Spearman –Brown Formula
A statistics which allows a test developer to estimate what correlations
between the two halves would have been if each half had been the length of
the whole test and have equal variances.
Cronbach’s coefficient alpha

A statistics which allows the test developer to confirm that a test has
substantial reliability in case the two halves of a test have unequal variances.
Kuder-Richardson 20 (KR20) Formula

The statistics used for calculating the reliability of a test in which the items
are dichotomous or scored as 0 or 1.
Brief Synopsis of Methods for Estimating Reliability
Method No. of No. of Sources of

Forms Sessions Error
Variance
Test - 1 2 Changes over

Retest time
Alternative 2 1 Item
– Forms sampling
(Immediate)
Alternative 2 2 Item
– Forms sampling
changes over
(Delayed)
time
Split – Half 1 1 Item
sampling
(Spearman-
Brown) Nature of
split
Coefficent 1 1 Item
Alpha & sampling
Kuder -
Test
Richardson
Heterogeneity
Inter -Rater 1 1 Scorer

Differences
Which Type of Reliability is Appropriate?

• For tests that has two forms, use parallel forms reliability.
• For tests that are designed to be administered to an individual more
than once, use test-retest reliability.
• For tests with factorial purity, use Cronbach’s coefficient alpha.
• For tests with continuum Likert scale, use Cronbach’s coefficient
alpha.
• For tests with items carefully ordered according to difficulty, use split-
half reliability.
• For tests which involve some degree of subjective scoring, use inter-
rater reliability.
• For tests which involve dichotomous items or forced choice items, use
KR20.
Factors that influence Reliability
• Test length. The longer the tests, the more reliable
• Homogeneity of test items. The more homogenous the test items, the
more reliable.
• Heterogeinity of test group. The more diverse/heterogenous the test
group, the more reliable.
What to do about Low Reliability
• Increase the number of items.
• Use factor analysis and item analysis.
• Use the correction for attenuation formula.

9 Reliability

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

9 Reliability

Uploaded by

Copyright:

Available Formats

CHAPTER 9.

Classical Test Score Theory

 These consist entirely of those stable attributes of the individual,

 There is a problem in the use of limited number of items to represent a

Item Response Theory (IRT)

B. Parallel Forms Reliability

Kappa Coefficient Qualitative

> 0.75 Excellent agreement

< 0.40 Poor agreement

Cronbach’s coefficient alpha

Kuder-Richardson 20 (KR20) Formula

Brief Synopsis of Methods for Estimating Reliability

Method No. of No. of Sources of

Test - 1 2 Changes over

Inter -Rater 1 1 Scorer

Which Type of Reliability is Appropriate?

You might also like