RELIABILITY

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 58

PSYCHOLOGICAL ASSESSMENT

RELIABILITY

Prepared & Presented By:


Ms. Giselle C. Honrado
LEARNING OBJECTIVES

1 Define reliability, reliability coefficient, and


reliability estimates.

Discuss the sources of error variance, reliability


2 estimates, and how to use and interpret a
coefficient of reliability.
THE CONCEPT OF RELIABILITY

synonym for dependability or consistency


in the language of psychometrics, reliability refers to
consistency in measurement
A reliable test is consistent in its measurements
Reliability Coefficient - an index of reliability, a proportion
that indicates the ratio between the true score variance on a
test and the total variance
CONCEPTUALIZATION OF ERROR

Error does not imply that a mistake has been


made. Rather, error implies that there will
always be some inaccuracy in
measurements.
Task: to find the magnitude of such errors
and to develop ways to minimize them

Reference: Kaplan, R. & Saccuzzo, D. (2018). Psychological Testing: Principles,


Applications, and Issues, 9th Ed. Cengage Learning.
CONCEPTUALIZATION OF ERROR

In psychology, the measurement task is more


difficult:
First, researchers are rarely interested in
measuring simple qualities such as width.
Instead, they usually pursue complex traits
such as intelligence or aggressiveness, which
one can neither see nor touch.

Reference: Kaplan, R. & Saccuzzo, D. (2018). Psychological Testing: Principles,


Applications, and Issues, 9th Ed. Cengage Learning.
CONCEPTUALIZATION OF ERROR

In psychology, the measurement task is more


difficult:
Further, with no rigid yardsticks available to
measure such characteristics, testers must use
“rubber yardsticks”; these may stretch to
overestimate some measurements and shrink
to underestimate others (Kline, 2015b; Thomas,
2012)

Reference: Kaplan, R. & Saccuzzo, D. (2018). Psychological Testing: Principles,


Applications, and Issues, 9th Ed. Cengage Learning.
CONCEPTUALIZATION OF ERROR

Psychologists must assess their measuring


instruments to determine how much “rubber” is
in them.
A psychologist who is attempting to understand
human behavior on the basis of unreliable tests is
like a carpenter trying to build a house with a
rubber measuring tape that never records the
same length for the same piece of board.

Reference: Kaplan, R. & Saccuzzo, D. (2018). Psychological Testing: Principles,


Applications, and Issues, 9th Ed. Cengage Learning.
THE CONCEPT OF RELIABILITY: ERRORS
IN MEASUREMENT

Error - the component of the observed test score that does not have to do with
the test taker's ability
X = T + E (where X is the observed score, T is the true score, and E is the error)
Variance - used to describe sources of test score variability; the standard
deviation squared
True variance - variance from true differences; assumed to be stable across
time and repeated measures
Error variance - variance from irrelevant, random sources
Total variance = True variance + Error variance
THE CONCEPT OF RELIABILITY

Reliability refers to the proportion of the


total variance attributed to true
variance
The greater the proportion of the total
variance attributed to true variance, the
more reliable the test
HISTORY: SPEARMAN'S EARLY STUDIES
Psychology owes the advanced development of
reliability assessment to the early work of the
British psychologist Charles Spearman.
Pioneer of factor analysis.
He gave the concept of General intelligence
or more commonly g factor
His most famous statistical invention
Spearman’s rank correlation coefficient is
used to measure statistical dependence
between two variables.

Reference: Kaplan, R. & Saccuzzo, D. (2018). Psychological Testing: Principles,


Applications, and Issues, 9th Ed. Cengage Learning.
HISTORY: SPEARMAN'S EARLY STUDIES

In 1733, Abraham De Moivre introduced the


basic notion of sampling error (Stanley, 1971);
and:
In 1896, Karl Pearson developed the product
moment correlation (Pearson, 1901)
Reliability theory puts these two concepts
together in the context of measurement.

Reference: Kaplan, R. & Saccuzzo, D. (2018). Psychological Testing: Principles,


Applications, and Issues, 9th Ed. Cengage Learning.
HISTORY: SPEARMAN'S EARLY STUDIES

In 1904, Spearman, worked out most of the basics of


contemporary reliability theory and published his work
in an article entitled “The Proof and Measurement of
Association between Two Things.”
Spearman published his work in the American Journal
of Psychology which quickly became known in the
United States.

Reference: Kaplan, R. & Saccuzzo, D. (2018). Psychological Testing: Principles,


Applications, and Issues, 9th Ed. Cengage Learning.
HISTORY: SPEARMAN'S EARLY STUDIES
The article came to the attention of measurement pioneer
Edward L. Thorndike, who was then writing the first edition
of An Introduction to the Theory of Mental and Social
Measurements (1904).
During the 1920’s Thorndike developed a test of
intelligence that consisted of completion, arithmetic,
vocabulary, and directions test, known as the CAVD. This
test is saturated with g. This instrument was intended to
measure intellectual level on an absolute scale. The logic
underlying the test predicted elements of test design that
eventually became the foundation of modern intelligence
tests.

Reference: Kaplan, R. & Saccuzzo, D. (2018). Psychological Testing: Principles,


Applications, and Issues, 9th Ed. Cengage Learning.
HISTORY: SPEARMAN'S EARLY STUDIES

In 1937 article by Kuder and Richardson, in which


several new reliability coefficients were
introduced.
Later, Cronbach and his colleagues (Cronbach,
1989, 1995) made a major advance by developing
methods for evaluating many sources of error in
behavioral research.

Reference: Kaplan, R. & Saccuzzo, D. (2018). Psychological Testing: Principles,


Applications, and Issues, 9th Ed. Cengage Learning.
HISTORY: SPEARMAN'S EARLY STUDIES

Reliability theory continues to evolve. More


recently, item response theory (IRT) has taken
advantage of computer technology to advance
psychological measurement significantly
(Drasgow & Olson-Buchanan, 1999; McDonald,
1999; Michell, 1999).

Intro to IRT

Reference: Kaplan, R. & Saccuzzo, D. (2018). Psychological Testing: Principles,


Applications, and Issues, 9th Ed. Cengage Learning.
THE CONCEPT OF RELIABILITY:
MEASUREMENT ERRORS
Measurement Error - all of the factors associated with the process of measuring
some variable, other than the variable being measured; example would be the
test administration procedures, test taker's culture, etc
Categories:
Random Error - caused by unpredictable fluctuations and inconsistencies
of other variables in the measurement process; sometimes referred to as
"noise"
Systematic Error - source of error in measuring a variable that is typically
constant or proportionate to what is presumed to be the true value of the
variable being measured (example: ruler)
SOURCES OF ERROR
VARIANCE
Test Construction - item sampling or content
sampling; terms that refer to variation among items
within a test as well as to variation among items
between tests
Item sampling - the way the items are worded to
the point that test takers may have different
interpretation of the item, making the test
unreliable
Content sampling - the extent to which a test
taker's score is affected by the content sampled on
a test and by the way the content is sampled
SOURCES OF ERROR
VARIANCE
Test Administration - may influence the test taker's
attention or motivation; test taker's reactions to
those influences are the source of one kind of error
variance
Test environment
Test taker
Examiner
Test Scoring and Interpretation - scoring criteria
vary as a function of who is doing the testing and
scoring; if subjectivity is involved in scoring, then
the scorer (or rater) can be a source of error
variance
SOURCES OF ERROR
VARIANCE
Other sources of error
Sampling error - a statistical error that occurs
when an analyst/researcher does not select a
sample that represents the entire population
of data.
Methodological error - e.g. interviewers may
not have been trained properly, the wording
in the questionnaire may have been
ambiguous, or the items may have somehow
been biased
HOW TO MEASURE
TEST-RETEST RELIABILITY ESTIMATE
RELIABILITY?
Test-retest reliability estimate - an estimate
of reliability obtained by correlating pairs of
scores from the same people on two
different administrations of the same test.
Coefficient of stability - an index of reliability
determined via a test–retest method, in
which the same test is administered to the
same respondents at two different points in
time. (Source: APA Dictionary of Psychology:
https://dictionary.apa.org)
Only applicable to tests that measure
something that is stable over time
Reference: Kaplan, R. & Saccuzzo, D. (2018). Psychological Testing: Principles,
Applications, and Issues, 9th Ed. Cengage Learning.
HOW TO MEASURE
TEST-RETEST RELIABILITY ESTIMATE
RELIABILITY?
Test-retest reliability is relatively easy to
evaulate. However, one need to consider many
other details besides the methods for
calculating the test-retest reliability coefficients.
Memory and practice effect can influence
the accuracy of the test-retest method
(Carryover effect)
Test takers sometimes remember their
answers from the first time they took the
test.
Test takers score better because they have
sharpened their skills by having taken the
Reference: Kaplan, R. & Saccuzzo, D. (2018). Psychological Testing: Principles, test the first time.
Applications, and Issues, 9th Ed. Cengage Learning.
PARALLEL-FORMS & ALTERNATE-FORMS
HOW TO MEASURE RELIABILITY ESTIMATE

RELIABILITY? Parallel forms reliability measures the


correlation between two equivalent
versions of a test. You use it when you have
two different assessment tools or sets of
questions designed to measure the same
thing.
Coefficient of equivalence - used to
compute the reliability of alternate forms of
a test, based either upon two
administrations or on a single
administration with odd and even items
constituting separate forms.
(Source:
https://www.csus.edu/indiv/d/deaner/glossary.htm#c)
PARALLEL-FORMS & ALTERNATE-FORMS
HOW TO MEASURE RELIABILITY ESTIMATE

RELIABILITY?
Parallel forms of a test exist when, for each for of
the test, the means and the variances of the
observed test scores are equal.
Parallel forms reliability (Equivalent forms) - an
estimate of the extent to which item sampling
and other errors have affected test scores on
versions of the same test when, for each form of
the test, the means and variances of observed
test scores are equal.
Reference: Henchy, Alexandra Marie, “REVIEW AND
EVALUATION OF RELIABILITY GENERALIZATION RESEARCH”
(2013). Theses and Dissertations–Educational, School, and
Counseling Psychology. Paper 5.
http://uknowledge.uky.edu/edp_etds/5
PARALLEL-FORMS & ALTERNATE-FORMS
HOW TO MEASURE RELIABILITY ESTIMATE

RELIABILITY? Parallel forms reliability (also called equivalent


forms reliability) uses one set of questions divided
into two equivalent sets (“forms”), where both sets
contain questions that measure the same
construct, knowledge or skill. The two sets of
questions are given to the same sample of people
within a short period of time and an estimate of
reliability is calculated from the two sets.
PARALLEL-FORMS & ALTERNATE-FORMS
HOW TO MEASURE RELIABILITY ESTIMATE

RELIABILITY? Step 1: Give test A to a group of 50 students on a


Monday.
Step 2: Give test B to the same group of students
that Friday.
Step 3: Correlate the scores from test A and test B.
Reference: Henchy, Alexandra Marie, “REVIEW AND EVALUATION
OF RELIABILITY GENERALIZATION RESEARCH” (2013). Theses and
Dissertations–Educational, School, and Counseling Psychology.
Paper 5. http://uknowledge.uky.edu/edp_etds/5
PARALLEL-FORMS & ALTERNATE-FORMS
HOW TO MEASURE RELIABILITY ESTIMATE

RELIABILITY? Alternate forms - different versions of a test that


have been constructed so as to be parallel; typically
designed to be equivalent with respect to variables
such as content and level of difficulty.
Alternate forms - a set of test items that are
developed to be similar to another set of test
items, so that the two sets represent different
versions of the same test. (APA Dictionary of
Psychology)
PARALLEL-FORMS & ALTERNATE-FORMS
HOW TO MEASURE RELIABILITY ESTIMATE

RELIABILITY? Alternate forms reliability - estimate of the extent


to which these different forms of the same test
have been affected vy item sampling error, or the
other error.
Similarities in these two estimates (Parallel Forms
and Alternate Forms:
Two test administrations with the same group
are required
Test scores may be affected by factors such as
motivation, fatigue, or intervening events such
as practice, learning, or therapy (although not
as much as when the same test is administered
twice).
PARALLEL-FORMS & ALTERNATE-FORMS
HOW TO MEASURE RELIABILITY ESTIMATE

RELIABILITY? Question: What is the difference between alternate


forms and parallel forms of a test?
Answer: In order to call the forms “parallel”, the
observed score must have the same mean and
variances. If the tests are merely different versions
(without the “sameness” of observed scores), they
are called alternate forms.
PARALLEL-FORMS & ALTERNATE-FORMS
HOW TO MEASURE RELIABILITY ESTIMATE

RELIABILITY? Parallel Forms of test


Disadvantages: (1) One have to create a large
number of questions that measure the same
construct. (2) Proving that the two test versions
are equivalent (parallel) can be a challenge.
Advantage: Parallel forms reliability can avoid
some problems inherent with test-retesting.

Alternate forms of test


Disadvantage: Time consuming and expensive
Advantage: to the test user; minimizes the
effect of memory for the content of a previously
administered form of the test.
SPLIT-HALF RELIABILITY ESTIMATE
HOW TO MEASURE
RELIABILITY? Split-half reliability estimate - obtained by
correlating two pairs of scores obtained from
equivalent halves of a single test administered
once.
A useful measure of reliability when it is
impractical or undesirable to assess reliability
with two sets or to administer a test twice
(because of factors such as time or expense).
SPLIT-HALF RELIABILITY ESTIMATE
HOW TO MEASURE
RELIABILITY? Steps:
Step 1: Divide the test into equivalent halves.
Step 2: Calculate a Pearson r between scores
on the two halves of the test.
Step 3: Adjust the half-test reliability using
the Spearman-Brown formula.
Acceptable ways to split a test:
Random assignment of test items
Odd-even reliability
Divide the test by content
SPLIT-HALF RELIABILITY ESTIMATE
HOW TO MEASURE
RELIABILITY? Primary objective of splitting a test in half: to
create "mini-parallel-forms" with each half equal
to the other in format, stylistic, statistical, and
related aspects.
SPLIT-HALF RELIABILITY ESTIMATE
HOW TO MEASURE
RELIABILITY? The Spearman-Brown formula - allows a test
developer or user to estimate internal
consistency reliability from a correlation of two
halves of a test.

After the initial three (3) steps:


Where:
rSB = reliability adjusted by Spearman-Brown
formula
r = Pearson r in the original-length test
n = factor by which the length of the test is changed;
number of items in the revised version divided by
the number of items in the original version
SPLIT-HALF RELIABILITY ESTIMATE
HOW TO MEASURE
RELIABILITY? Spearman-Brown formula for the adjustment of
split-half reliability:

Next:
SPLIT-HALF RELIABILITY ESTIMATE
HOW TO MEASURE
RELIABILITY?
Usually, but not always, reliability increases as
test length increases.
Estimates of reliability based on consideration of
the entire test tend to be higher than those
Take note: based on half of a test.
If test developers/users wish to shorten a test,
the Spearman-Brown formula may be used to
estimate the effect of the shortening on the
test's reliability.
Spearman-Brown formula could also be used to
determine the number of items needed to attain
a desired level of reliability.
OTHER METHODS OF ESTIMATING
HOW TO MEASURE INTERNAL CONSISTENCY

RELIABILITY? Internal (inter-item) consistency - the degree


of correlation among all the items in a scale.
refers to the extent of consistency
between multiple items measuring the
same construct.
Test of Homogeneity - tests are said to be
homogenous if they contain items that
measure a single trait.
Test of Heterogeneity - the degree to which
a test measures different factors.
The more homogenous a test is, the more
inter-item consistency it can be expected to
have.
OTHER METHODS OF ESTIMATING
HOW TO MEASURE INTERNAL CONSISTENCY

RELIABILITY? The more homogenous a test is, the more


inter-item consistency it can be expected to
have.
Test homogeneity is desirable because it
allows relatively straightforward test-score
interpretation.
Though desirable, homogenous test is
often an insufficient tool for measuring
multifaceted psychological variables such
as intelligence or personality.
To circumvent this potential source of
difficulty: administer some component of
a heterogenous variable through test
battery.
OTHER METHODS OF ESTIMATING
HOW TO MEASURE INTERNAL CONSISTENCY
RELIABILITY?

Kuder-Richardson formula (KR-20)


- statistic of choice for
determining the inter-item
consistency of dichotomous items,
primarily those items that can be
scored right or wrong (multiple
choice items).
KR-21 Formula
Developed by Kuder and Richardson
Use to obtain approximation of KR-20
May be used if there is a reason to
assume that all the test items have
approximately the same degree of
difficulty (this assumption is seldom
justified)
has become outdated in an era of
calculators and computers
One variant of KR-20: coefficient a-20;
this expression incorporates both the
Greek letter alpha (a) and the number
20.
OTHER METHODS OF ESTIMATING
HOW TO MEASURE INTERNAL CONSISTENCY

RELIABILITY? Coefficient alpha / Cronbach's alpha -


appropriate for use on tests containing
non-dichotomous items (scale of 1-5 item
response)
preferred statistic for obtaining an
estimate of internal consistency
reliability.
yields value from 0.00 to 1.00 (unlike
Pearson's r from -1.00 to +1.00)
Bigger is not always better. A value of
alpha above .90 may be "too high"
Reference: https://www.statisticshowto.com/probability-and-
statistics/statistics-definitions/cronbachs-alpha-spss/ and indicate redundancy in the items.
OTHER METHODS OF ESTIMATING
HOW TO MEASURE INTERNAL CONSISTENCY

RELIABILITY?
Note: Coefficient alpha / Cronbach's alpha is
not a measure of homogeneity and not a
measure of one-dimensionality
Coefficient alpha/Cronbach's alpha can be
computed in SPSS, Jamovi, Excel
OTHER METHODS OF ESTIMATING
HOW TO MEASURE INTERNAL CONSISTENCY

RELIABILITY?

Another variation:
OTHER METHODS OF ESTIMATING
HOW TO MEASURE INTERNAL CONSISTENCY

RELIABILITY?
Assumes that each item has equal
weight/strength/sensitivity to the construct
being measured
McDonald's omega is a reliability
coefficient which similar to Cronbach's
Alpha. The main advantage of Omega is
,compared to Cronbach's alpha, Omega
has the advantage of taking into account
the strength of association between
items.
OTHER METHODS OF ESTIMATING
HOW TO MEASURE INTERNAL CONSISTENCY

RELIABILITY?
McDonald omega coefficient - reflects
true population estimates of reliability
through the removal of certain scale of
item.
(ω) is computed as ratio of the variance
due to the common attribute (i.e., factor)
to the total variance.
OTHER METHODS OF ESTIMATING
HOW TO MEASURE INTERNAL CONSISTENCY

RELIABILITY? Average Proportional Distance (APD)- a relatively


new measure for evaluating the internal
consistency of a test.
a measure used to evaluate the internal
consistency of a test that focuses on the degree
of difference that exists between item scores.
Steps:
Step 1: Calculate the absolute difference
between scores for all the items.
Step 2: Average the difference between scores.
Step 3: Obtain the APD by dividing the average
difference between scores by the number of
response options on the test, minus one.
OTHER METHODS OF ESTIMATING
HOW TO MEASURE INTERNAL CONSISTENCY

RELIABILITY? Average Proportional Distance (APD)-


General "rule of thumb" for interpreting APD is
that an obtained value of:
.2 or lower is indicative of excellent internal
consistency
.25 to .2 is in the acceptable range;
suggestive of problems with the internal
consistency of the test.
Potential advantage: APD index is not
connected to the number of items on a
measure. Cronbach's alpha will be higher when
a measure has more than25 items.
The best course of action when evaluating
internal consistency: integrate Cronbach's
alpha, mean inter-item correlations, and the
APD.
MEASURES OF INTER-SCORER
HOW TO MEASURE RELIABILITY

RELIABILITY? Inter-Scorer Reliability - the degree of


agreement or consistency between two or more
scorers (or judges or raters) with regard to a
particular measure.
often used when coding nonverbal behavior
Coefficient of inter-scorer reliability - determines
the degree of consistency among scorers in the
scoring of a test
General Steps
1.Look for at least two raters.
2.Teach the scoring system.
3.Administer / facilitate test or performance.
4.Let two raters conduct their evaluation.
REPORTING RELIABILITY
Test reports and manuals, reliability is expressed as a
correlation, which is referred to as a reliability
coefficient
1.00
The closer a reliability coefficient is to 1.00, the more
reliable the scores generated by the instrument

.70 & .80


Any reliability coefficient less than 1.00 indicates the
presence of error in test scores
Reliability estimates in the range of .70 and .80 are
good enough for most purposes of basic research.
Typical range: 0.80 to 0.95; however, an acceptable
reliability coefficient depends on the purpose of the
test Reference: Kaplan, R. & Saccuzzo, D. (2018). Psychological Testing: Principles,
Applications, and Issues, 9th Ed. Cengage Learning.
REPORTING RELIABILITY
In clinical setting, high reliability is extremely
important.
When test are used to make important decision’s
about someone’s future, evaluators must be certain to
minimized any error in classification.
Thus, a test with a reliability of .90 might not be good
enough.
r should be greater than .96.
INTERPRETATION OF RELIABILITY INFORMATION
FROM TEST MANUALS AND REVIEWS

Test manuals and independent review of tests


provide information on test reliability.
The reliability of a test is indicated by the reliability
coefficient. It is denoted by the letter "r," and is
expressed as a number ranging between 0 and
1.00, with r = 0 indicating no reliability, and r = 1.00
indicating perfect reliability. *Do not expect to find
a test with perfect reliability.
INTERPRETATION OF RELIABILITY
INFORMATION FROM TEST MANUALS AND
REVIEWS
Generally, you will see the reliability of a test as a
decimal, for example, r = .80 or r = .93. The larger
the reliability coefficient, the more repeatable or
reliable the test scores.
Table 1 serves as a general guideline for
interpreting test reliability. However, do not
select or reject a test solely based on the size of
its reliability coefficient. To evaluate a test's
reliability, you should consider the type of test,
the type of reliability estimate reported, and the
context in which the test will be used.
THE PURPOSE OF RELIABILITY COEFFICIENT
THE NATURE OF THE TEST
Considerations concerning the purpose and use
of reliability coefficient:
1. the test items are homogenous or
heterogenous in nature
2. the characteristic, ability, or trait being
measured is presumed to be dynamic or static
3. the range of test scores is or is not restricted;
4. the test is a speed or a power test
5. the test is or is not a criterion-referenced.

You might also like