Reliability

RELIABILITY
History of reliability
 Psychology owes the advanced development of reliability assessment to the early

work of the British psychologist Charles Spearman.
 In 1733, Abraham De Moivre introduced the basic notion of sampling error (Stanley,
1971); and in 1896, Karl Pearson developed the product moment correlation (see
Chapter 3 and Pearson, 1901).
 Reliability theory puts these two concepts together in the context of measurement.
 A contemporary of Pearson, Spearman actually worked out most of the basics of
contemporary reliability theory and published his work in a 1904 article entitled “The
Proof and Measurement of Association between Two Things.”
Test score theory
 CLASSICAL TEST SCORE THEORY assumes that each person has a

true score that would be obtained if there were no errors in measurement.
However, because measuring instruments are imperfect, the score
observed for each person almost always differs from the person’s true
ability or characteristic.
Classical test theory uses the standard deviation of
errors as the basic measure of error. Usually this is
called the standard error of measurement.
DOMAIN SAMPLING MODEL
is another central concept in classical test theory. This model

considers the problems created by using a limited number of
items to represent a larger and more complicated construct.
For example, suppose we want to evaluate your spelling

ability. The best technique would be to go systematically
through a dictionary, have you spell each word, and then
determine the percentage you spelled correctly. Since
time is limited we only use sample words.
Test score theory
 ITEM RESPONSE THEORY

CTT assumes that all items in an assessment instrument make
an equal contribution to the performance of students. IRT, in
contrast, takes into account the fact that some items are more
difficult than others.
3 WAYS TO ESTIMATE RELIABILITY
TEST-RETEST
PARALLEL FORMS
INTERNAL CONSISTENCY
Time sampling : Test –retest method
• a measure of reliability obtained by administering the same test twice over a period
of time to a group of individuals. The scores from Time 1 and Time 2 can then be
correlated in order to evaluate the test for stability over time.
• the target time of 2 weeks is the most frequently recommended interval
• Only use for static traits
• One thing you should always consider is the possibility of a carryover effect. This
effect occurs when the first testing session influences scores from the second
session. For example, test takers sometimes remember their answers from the first
time they took the test.
Item sampling : Parallel Forms Method
• compares two equivalent forms of a test that measure the same

attribute. The two forms use different items; however, the rules
used to select items of a particular difficulty level are the same.
• Also known as Equivalent Forms Reliability
• Sometimes the 2 forms are administered to the same group of
people on the same day
• Pearson’s R
Split-half Method
• A test is given and divided into halves that are scored separately.
The results of one half of the test are then compared with the
results of the. The two halves of the test can be created in a
variety of ways.
• Spearman Brown and KR20
KR20 FORMULA
 The formula for calculating the reliability of a test in which the items are dichotomous, scored 0
or 1 (usually for right or wrong), is known as the Kuder-Richardson 20, or KR20 or KR20. The
formula came to be labeled this way because it was the 20th formula presented in the famous
article by Kuder and Richardson.
CRONBACH ALPHA
 Cronbach’s alpha is a measure of internal consistency, that is, how closely related
a set of items are as a group. It is assessing reliability by comparing the amount of
shared variance, or covariance, among the items making up an instrument to the
amount of overall variance.
COVARIANCE
 measures the direction of the relationship between two variables. A positive

covariance means that both variables tend to be high or low at the same time.
A negative covariance means that when one variable is high, the other tends
to be low.
INTER-RATER RELIABILITY
 inter-rater reliability is the degree of agreement among independent observers who

rate, code, or assess the same phenomenon. High inter-rater reliability values refer
to a high degree of agreement between two examiners. Low inter-rater reliability
values refer to a low degree of agreement between two examiners.
 Can be computed using KAPPA STATISTICS

KAPPA STATISTICS
 COHEN KAPPA - assumes the same two raters have rated a set of items
 FLEISS KAPPA - is a metric used to measure the agreement when in the study there
are more than two raters

Reliability

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reliability

Uploaded by

Copyright:

Available Formats

RELIABILITY

 Psychology owes the advanced development of reliability assessment to the early

 CLASSICAL TEST SCORE THEORY assumes that each person has a

is another central concept in classical test theory. This model

For example, suppose we want to evaluate your spelling

 ITEM RESPONSE THEORY

• compares two equivalent forms of a test that measure the same

 measures the direction of the relationship between two variables. A positive

 inter-rater reliability is the degree of agreement among independent observers who

 Can be computed using KAPPA STATISTICS

You might also like