5.concepts of Reliability

Concepts of Reliability
University of Balamand
PSYC 243
Presentation Plan
I. Introduction
II. Theory and Assumptions
A. Classical Test Theory
B. Item Response Theory
III. Sources of measurement error
IV. Measurement error and reliability
V. Standard error of measurement
VI. Types of Reliability
I- Introduction
• Reliability is the attribute of consistency in

measurement.
• Few measures of physical or psychological
characteristics are completely consistent, even
from one moment to the next.
I- Introduction
• The concept of reliability is best viewed as a

continuum ranging from minimal consistency of
measurement to near perfect repeatability of
results. (simple reaction time; weight)
• Several statistical methods
• What are the sources of consistency and
inconsistency in psychological test results?
• Reliability is not absolute.
• Context
– What type of reliability?
– Reliable for what applications?
• Continuum
– Range of reliabilities
– Highly consistent ← → highly inconsistent
II- Theory and Assumptions
A. Classical Test Theory

B. Item Response Theory
Classical Test Theory
 Charles Spearman
• Psychometrician
• Developed factor analysis
• Developed classical test theory
A - Classical Test Theory
• Classical Test Theory (CTT) – often called the “true

score model”
• Called classic:
– developed from simple assumptions made by test
theorists since the inception of testing.
– relative to Item Response Theory (IRT) which is a more
modern approach
• CTT describes a set of psychometric procedures used
to test items and scales reliability, difficulty,
discrimination, etc.
• CTT analyses are the easiest and most widely used

form of analyses. The statistics can be computed by
readily available statistical packages (or even by
hand)
• CTT Analyses are performed on the test as a whole

rather than on the item and although item statistics
can be generated, they apply only to that group of
students on that collection of items
• Test scores result from the influence of two factors
1. Factors that contribute to consistency. These consist

entirely of the stable attributes of the individual,
which the examiner is trying to measure.
2. Factors that contribute to inconsistency. These include

characteristics of the individual, test, or situation that
have nothing to do with the attribute being measured,
but that nonetheless affect test scores.
• Assumes that every person has a true score on an

item or a scale if we can only measure it directly
without error
• CTT analyses assumes that a person’s test score is

comprised of their “true” score plus some
measurement error.
• This is the common true score model
X=T+e
• X is the obtained score
• T is the true score
• e represents errors of measurements.
• Although it is impossible to eliminate all

measurement error, test developers do strive to
minimize this psychometric nuisance through
careful attention to the sources of measurement
error.
• The true score is never known. We can obtain a
probability that the true score resides within a
certain interval and we can derive a best estimate
of the true score.
Without  meas
With  meas
X=T+E
True Scores
T1 T2 T3
Measurement error around a T can be large or small

Sources of Measurement Error
1. Item Selection
2. Test Administration
3. Test Scoring
4. Systematic Measurement Error
1 - Item Selection
• One source of measurement error is the instrument itself.

• A test developer must settle on a finite number of items
from a potentially infinite pool of test questions. Which
questions should be included? How should they be
worded? Item selection is crucial to the accuracy of
measurement.
• In a well-designed test the measurement error from item
sampling will be minimal.
• However, a test is always a sample and never the totality
of a person’s knowledge and behavior.
• Item selection is always a source of measurement error in
psychological testing.
2 - Test Administration
• Numerous sources of measurement error arise

from the circumstances of administration.
• General environment conditions: uncomfortable
room temperature, dim lighting, and excessive
noise.
• Momentary fluctuations in anxiety, motivation,
attention, and fatigue level.
• The examiner: unconsciously convey that the test
taker is on the right track or intimidate a test taker.
3 - Test Scoring
• Whenever a psychological test uses a format other

than machine-scored multiple-choice items, some
degree of judgment is required to assign points to
answers.
• Fortunately, most tests have well-defined criteria for
answers to each question. These guidelines help
minimize the impact of subjective judgment in scoring.
• However, subjectivity of scoring as a source of
measurement error can be a serious problem in the
evaluation of projective tests or essay questions.
4 - Systematic Measurement Error
• The previous source of error are referred to as

unsystematic measurement error, meaning that their
effects are unpredictable and inconsistent.
• A systematic measurement error arises when,
unknown to the test developer, a test consistently
measures something other than the trait for which it
was intended.
• Example: scale to measure social introversion also taps
anxiety in a consistent fashion.
• Systematic measurement errors serve as a reminder
that it is very difficult, if not impossible, to truly assess a
trait in pure isolation from other traits.
Measurement Error and
Reliability
• Measurement error reduces the reliability or
repeatability of psychological test results.
• How consistent is psychological test?
• In classical test theory, unsystematic measurement
errors act as random influences; they behave like
random variables.
• The CTT accepts this essential randomness of
measurement error as an axiomatic (obvious)
assumption.
• CTT assumptions:
– Because they are random events, unsystematic

measurement errors are equally likely to be positive or
negative and will, therefore, average out to zero across a
large group of subjects.
– The mean error of measurement is zero.
– No correlation with true scores.
– Not correlated with errors on other tests.
Main features of Classical Theory
1. Measurement errors are random

2. Mean error of measurement = 0
3. True scores and errors are uncorrelated: rTe = 0
4. Errors on different tests are uncorrelated: r12 = 0
The Reliability Coefficient
• Reliability is
theoretically the
correlation between a
test-score and the true
score, squared.
• Essentially the
proportion of X that is T
The Reliability Coefficient
• Reliability can be viewed as a measure of consistency

or how well as test “holds together”
• Reliability is measured on a scale of 0-1. The greater
the number the higher the reliability.
• The approach to estimating reliability depends
on
– Estimation of “true” score
– Source of measurement error
The Correlation Coefficient
• A correlation coefficient (r) expresses the degree of

linear relationship between two sets of scores
obtained from the same persons.
• Can take values ranging from – 1.00 to + 1.00.
• + 1.00 signifies a perfect linear relationship
between the two sets of scores.
• - 1.00 signifies an equally strong relationship but
with inverse correspondence: the highest score on
one variable corresponding to the lowest score on
the other, and vice versa.
The Correlation Coefficient
• Correlations of + 1.00 or – 1.00 are extremely rare

in psychological research.
• Example: the correlation coefficient between height
and weight = + .80
The Correlation Coefficient as a Reliability
Coefficient
• One use of the correlation coefficient is to

determine the consistency of psychological test
scores.
• If test results are highly consistent, then the scores
of persons taking the test on two occasions will be
strongly correlated, perhaps even approaching the
theoretical upper limit of + 1.00.
• In this context, the correlation coefficient is also a
reliability coefficient.
Methods for Estimating
Reliability
a) Temporal Stability Approaches
b) Internal Consistency Approaches
a) Temporal Stability Approaches
• Types of reliability
1) Test-retest
2) Alternate-Forms
1) Test-Retest Reliability
• Evaluates the error associated with

administering a test at two different times.
• Time Sampling Error
• How-To:
– Give test at Time 1
– Give SAME TEST at Time 2
– Calculate r for the two scores
• Easy to do; one test does it all.
Test-Retest Reliability
• Sources of error
– random fluctuations in performance
– uncontrolled testing conditions
• extreme changes in weather
• sudden noises / chronic noise
• other distractions
– internal factors
• illness, fatigue, emotional strain, worry
• recent experiences
• Generally used to evaluate constant traits.

– Intelligence, personality
• Not appropriate for qualities that change rapidly over time.
– Mood, hunger
• Problem: Carryover Effects
– Exposure to the test at time #1 influences scores on the test at
time #2
• Only a problem when the effects are random.
• If everybody goes up 5pts, you still have the same
variability
• Practice effects
– Type of carryover effect
– Some skills improve with practice
• Manual dexterity, ingenuity or creativity
– Practice effects may not benefit everybody in the same way.
• Carryover & Practice effects more of a problem with
short inter-test intervals (ITI).
• But, longer ITI’s have other problems
– developmental change, maturation, exposure to historical
events
2) Alternate-Forms Reliability
• Evaluates the error associated with selecting a

particular set of items.
• Item Sampling Error
• How To:
– Develop a large pool of items (i.e. Domain) of varying
difficulty.
– Choose equal distributions of difficult / easy items to
produce multiple forms of the same test.
– Give both forms close in time.
– Calculate r for the two administrations.
Alternate-Forms Reliability
• Can give parallel forms at different points in time;

produces error estimates of time and item sampling.
• One of the most rigorous assessments of reliability
currently in use.
• Infrequently used in practice – too expensive to
develop two tests.
b) Reliability as Internal Consistency
1) Split-Half Reliability
2) The Spearman-Brown Formula
3) Coefficient Alpha
4) Interscorer Reliability
1) Split-Half Reliability
• What if we treat halves of one test as parallel

forms? (Single test as whole domain)
• That’s what a split-half reliability does
• This is testing for Internal Consistency
– Scores on one half of a test are correlated with
scores on the second half of a test.
• Big question: “How to split?”
– First half vs. last half
– Odd vs Even
Split-Half Reliability
• How to:
– Compute scores for two halves of single test,
calculate r.
• In addition of calculating a Pearson r between
scores on the two equivalent halves of the
test, the computation of a coefficient of split-
half reliability entails an additional step:
adjusting the half-test reliability using the
Spearman-Brown formula.
2) The Spearman Brown Formula
• Estimates the reliability for the entire test based on

the split-half
• Can also be used to estimate the affect changing the
number of items on a test has on the reliability
The Spearman Brown Formula
• Where r* is the
estimated reliability, r is
the correlation between
the halves, j is the new
length proportional to
the old length
• For a split-half it would

be
• Since the full length of
the test is twice the
length of each half
• Example: a 30 item test

with a split half
reliability of .65
• The .79 is a much better
reliability than the .65
3) Coefficient Alpha
• If items are measuring the same construct they

should elicit similar if not identical responses
• Coefficient OR Cronbach’s Alpha is a widely used
measure of internal consistency for continuous data
• Knowing that a composite is a sum of the variances
and covariances of a measure we can assess
consistency by how much covariance exists between
the items relative to the total variance
Coefficient Alpha
• Coefficient Alpha is considered a lower-bound

estimate of the reliability of continuous items
• It was developed by Cronbach in the 50’s but is based
on an earlier formula by Kuder and Richardson in the
30’s that tackled internal consistency for
dichotomous (yes/no, right/wrong) items
4) Interscorer Reliability
• What if you’re not using a test but instead observing

individual’s behaviors as a psychological assessment
tool?
• How about projective tests, tests of moral
development and creativity?
• How can we tell if the judge’s (assessor’s) are
reliable?
Interscorer Reliability
 Typically a set of criteria are established for judging

the behavior and the judge is trained on the criteria
 Then to establish the reliability of both the set of
criteria and the judge, multiple judges rate the same
series of behaviors
 The correlation between the judges is the typical
measure of reliability
 But, couldn’t they agree by accident? Especially on
dichotomous or ordinal scales?
Interscorer Reliability
• Kappa is a measure of inter-rater reliability that

controls for chance agreement
• Values range from -1 (less agreement than expected
by chance) to +1 (perfect agreement)
• +.75 “excellent”
• .40 - .75 “fair to good”
• Below .40 “poor”
B- Item Response Theory
• Beginning slowly in the 1960s.

• Item response theory (IRT) or latent trait theory.
• A collection of mathematical models and statistical
tools.
• The applications include:
– analyzing items and scales,
– developing homogeneous psychological measures,
– measuring individuals on psychological constructs,
– administering psychological tests by computer
IRT
• The foundational elements of IRT include:

1. Item response functions
2. Information functions
3. The assumption of invariance
1- Item Response Functions (IRF)
• Mathematical equation that describes the relation

between the amount of a latent trait an individual
possesses and the probability that he or she will give a
designated response to a test item designed to measure
that construct.
• It can be used to determine the difficulty level of test
items.
• In IRT, difficulty is indexed by how much of the trait is
needed to answer the item correctly while in CTT, difficulty
level is of an item is equivalent to the proportion of
examinees in a standardization sample who pass the item.
IRF
• Item discrimination parameter, which is a gauge of

how well the item differentiates among individuals
at a specific level of the trait in question.
• The appealing advantage of the IRT approach to
measurement is that the probability of a
respondent answering a particular question
correctly can be expressed as a precise
mathematical equation.
2 - Information Functions
• In psychological measurement, information

represents the capacity of a test item to
differentiate among people.
• Using a simple mathematical conversion, an item
information function can be derived from the IRF
for each item. This function portrays graphically the
relationship between the trait level of examinees
and the information provided by the test item.
Information Function
CTT IRT
A single reliability (precision of Precision of measurement can vary,
measurement) is typically calculated for depending on where an individual falls in
the entire test. the trait range.
A single set of items is typically A different collection of test items might
administered to all examinees. be used for each examinee to obtain a
predetermined precision of
measurement.
3 – Invariance in IRT
• Invariance means that an examinee’s position on a

latent-trait continuum (his or her score) can be
estimated from the responses to any set of test
items with known IRFs.
• IRFs do not depend on the characteristics of a
particular population.
• The scale of the trait exists independently of any
set of items and independently of any particular
population.
IRT Approach
• Although IRT analyses typically require large

samples – several hundred or thousands of
respondents – the necessary software is
straightforward and commonly available.
• IRT approaches to test development likely will

become increasingly prominent in the years ahead.
Standard Error of Measurement
• So far we’ve talked about the standard error of

measurement as the error associated with trying to
estimate a true score from a specific test
• This error can come from many sources
Standard Error of Measurement
• We can calculate it’s
size by
• s is the standard
deviation;
• r is reliability

5.concepts of Reliability

Uploaded by

Copyright:

Available Formats

You might also like

5.concepts of Reliability

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5.concepts of Reliability

Uploaded by

Copyright:

Available Formats

Concepts of Reliability

• Reliability is the attribute of consistency in

• The concept of reliability is best viewed as a

A. Classical Test Theory

• Classical Test Theory (CTT) – often called the “true

• CTT analyses are the easiest and most widely used

• CTT Analyses are performed on the test as a whole

• Test scores result from the influence of two factors

1. Factors that contribute to consistency. These consist

2. Factors that contribute to inconsistency. These include

• Assumes that every person has a true score on an

• CTT analyses assumes that a person’s test score is

• This is the common true score model

• Although it is impossible to eliminate all

Measurement error around a T can be large or small

• One source of measurement error is the instrument itself.

• Numerous sources of measurement error arise

• Whenever a psychological test uses a format other

• The previous source of error are referred to as

– Because they are random events, unsystematic

1. Measurement errors are random

• Reliability can be viewed as a measure of consistency

• A correlation coefficient (r) expresses the degree of

• Correlations of + 1.00 or – 1.00 are extremely rare

• One use of the correlation coefficient is to

• Evaluates the error associated with

• Generally used to evaluate constant traits.

• Evaluates the error associated with selecting a

• Can give parallel forms at different points in time;

• What if we treat halves of one test as parallel

• Estimates the reliability for the entire test based on

• For a split-half it would

• Example: a 30 item test

• If items are measuring the same construct they

• Coefficient Alpha is considered a lower-bound

• What if you’re not using a test but instead observing

 Typically a set of criteria are established for judging

• Kappa is a measure of inter-rater reliability that

• Beginning slowly in the 1960s.

• The foundational elements of IRT include:

• Mathematical equation that describes the relation

• Item discrimination parameter, which is a gauge of

• In psychological measurement, information

• Invariance means that an examinee’s position on a

• Although IRT analyses typically require large

• IRT approaches to test development likely will

• So far we’ve talked about the standard error of

You might also like