Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 81

Measurement

Definition of Measurement
Assigning numbers to objects or entities
according to some logical or systematic rule.
Measurement processes are fundamental to all
sciences.

Examples:
Self Report: Depression, personality
characteristics, intelligence.
Physiological indices: Heart rate, blood
pressure, cortisol levels, IL-6,A1C levels
Observational assessments: Parent-Child &
Couple warmth and hostility

Qualitative Research
Some would argue that qualitative
research based on observations of the
researcher (e.g., participant observation)
does not involve measurement.
Note that implicit measurement processes
are occurring, based on how the
investigator characterizes the entity being
assessed (e.g., categorical judgments).
Issues of reliability (e.g., repeatability
across observers or coders) and validity
(e.g., bias) of those characterizations of
the entity arise.

Psychometric Theory
Assumption is that we are attempting to assess a
concept or construct that is not directly
observable, but that we can only indirectly assess
via measurement procedures (i.e., latent variable).
Example: Each of you is asked to rate the
extraversion of a job candidate after watching the
videotape of the individual interacting with others.
Assume that each of you would not come up with the
same score.
Should derive a normally distributed set of scores
around the average value.
Issue: What would be the most accurate estimate of his
or her extraversion?

Distribution of Scores

100

80

60

40

20

0
17.60

17.70

17.80

17.90

18.00

Extraversion

18.10

18.20

18.30

One Participants Responses to 20


Loneliness Items

Classic Measurement
Equation

x=t+e
X = measured variable
t = true score
e = random error

Error 2
Error 1

Error3
1

Item 2
Item 1

Item 3
1

Loneliness

Example: Social Support


Score on a social support measure may
not accurately reflect respondents actual
level of support
Random error (e) may serve to bias scores
up or down relative to the persons true
level of support
Scores may also be systematically biased
up or down

Positivity bias: Scores on the social support


measure are higher than they should be
Negativity bias: Scores on the social support
measure are lower than they should be

Adequacy of Measures
Reliability: Reflects the accuracy of a
measure
Represented by the variance in a
measure due to t, or the underlying
construct that is being measured
Formulas reflect the variance due to t
relative to the total variance in scores
on a measure
Validity: Does the measure assess the
construct that it was designed to measure.
Issue: What is t? Is it the construct
you had in mind?

Relationship Between Reliability and


Validity
Can a measure be valid without
being reliable?
The fact that a measure is reliable
means that it is measuring something
(i.e., there is some variance due to t)
Reliability is a necessary but not
sufficient condition for validity

First try to establish that a measure


is reliable before discussing validity

Reliability
Refers to the repeatability or consistency
of a measure.
Issue first arose in astronomical measurement,
where observers were found to differ from one
another in their measurements of stars.

At a conceptual level, refers to the extent


to which scores on your measure reflect t
from the measurement equation. Error in
measurement is assumed to be random
(e).

Random Measurement Error


Assume that errors in measurement
are randomly distributed across the
assessments
Implications:
Error is uncorrelated with t or level of
true score
Errors are normally distributed around
the objects true standing on t

Systematic Error

Affects the validity of the measure, or the


definition of t.
Response sets: General ways of
responding to questions, no matter what
the wording of the item

Example: Acquiescence, or an inclination to


agree irrespective of item wording.
What impact would this have on scores on a
measure?
Need for balanced measures.

Reliability Example
Measure has a reliability of .80.
Implication: 80% of the variation in
scores on that measure are said to
reflect true differences between the
entities that are being assessed (e.g.,
individual differences in loneliness)
Remaining 20% of variance
represents random measurement
error.

Test-Retest Reliability

Reflects the stability of scores on a measure over time.

Assumption: True scores on the construct do not change over


time. Examples: Personality traits, mood (Speilberger scale:
State vs. trait anxiety). Therefore, any differences in scores is
attributed to measurement error.

Loneliness example: Correlation of .60 was found between


loneliness scores at the beginning and end of the first
school year at UCLA.

Criticism: Was interpreted as indicating that scores on the


loneliness scale were unreliable; does loneliness (the
underlying construct) actually change over that period of time?
75% were lonely at the beginning of the school year; declined
to 25% by the end of the school year

Assessment: Simply correlate scores from multiple


administrations of the measure to a group of subjects,
separated by time. Measure the association between
scores on the test.
Issue: Length of time and respondent memory for prior
answers.
Variant: Administration of alternate forms of the measure;
assume that the scales are assessing the same construct.

Internal Consistency:
This is a form of reliability that can be
evaluated when your measure employs
multiple items in assessing the construct.
Involves the consistency of the person's
responses across items.
Domain Sampling model: Assumption is
that the items you have developed for a
measure represent a random sample of
the content domain of the construct.
Example: Social support; select items that
reflect different types of support, such as
emotional support or tangible assistance.
Variation in responding to the items therefore
reflects errors in measurement.

Standardized Coefficient
Alpha ()
= (k*r)/ 1+[(k-1)*r]
k = # of items
r = Average correlation

Raw Score Coefficient


Alpha ()
= k/(k-1)*[c-(2xi/c)]
k = # of items
c = mean covariance
2xi = item variance

Derivations
Can increase reliability of a measure by:
Increasing the number of items, assuming that
the same level of correlation (or covariation)
among responses to the individual items is
maintained.
Increasing the correlation or covariation among
the items

Split-half reliability represents a special case


of alpha.
Standardized and raw score are very
similar for most scales; why?

UCLA Loneliness Scale


Reliability
Scale consists of 20 items
Items were originally derived from the statements used
by lonely individuals to describe the experience
No items use the word lonely; why?

Revision of scale in 1980 added 10 non-lonely or


positive items; why?
Revision of scale in 1996 simplified the response
format & item wording
Scale: UCLA Loneliness Scale Items.pdf
Paper:
..\..\..\Measures\UCLA Loneliness Scale Version 3 Paper.pdf

SPSS Reliability Analysis:


lonely reversed.sav
Lonely 2014.sav

Loneliness Scale Example


Computation of Coefficient Alpha
Average r = .365
= (20*.365)/1+(19*.365)=7.3/7.935=.92

Reliability of 5 item scale:


= (5*.365)/1+(4*.365)=1.825/2.46=.74

Example: Selection of best 5 items


Choose based on corrected item-total correlations
What items would you select?
Compute the reliability for both the original and
hold out group.
Why are the values different?

Three Item HRS Version


Items:
How often to you feel that you lack companionship?
(2)
How often do you feel left out? (11)
How often do you feel isolated from others? (14)

Response format:
1. Hardly ever
2. Some of the time
3. Often

HRS Loneliness data:


Data: hrslone3.sav

Inter-Rater Agreement
Data are sometimes collected by raters or
coders, who evaluate the objects that are
being assessed and assign a number or
numbers.
Examples: Assessments of clinical depression,
coding of behavior during interactions.
Design issue: Rater drift.

Two types of reliability estimates can be


computed depending on the scale of
measurement (i.e., continuous or
categorical) that is involved.

Creation of Random
Samples

Employ the RV function when


computing variable in SPSS
Can use a variety of distributions to create
random variables
Employed Bernouli equation
Dichotomous distribution which takes on a value
of 1.0 with p probability and a value of 0 with 1p probability

SPSS syntax:
COMPUTE coder3 = RV.BERNOULLI(.2) .
EXECUTE .
Example: Sample Agreement 1.sav

Kappa Coefficient
Example: kappa.sav

K = po - pc / 1- pc
po = observed % agreement
pc = chance % agreement

Example
% Agreement (PO): 8/10 = .80
% Chance Agreement (PC):
PC = (.7*.7) + (.3*.3) = .58
Kappa () = (PO- PC)/(1- PC)
= (.80-.58)/(1-.58) = .22/.42
= .524

Continuous Measures
The Intra-class correlation is
computed when you have data from
observers on continuous measures
Example: ISBR coding of Warmth &
Hostility in videotaped family
interactions
Ratings are made on several scales,
which are summed together
Issue: Raters may not agree although
their scores may be correlated

Example Rating Data


File=Rater1.sav

Target

Rater 1

Rater 2

10

FACHS Rating Data


Example: icc 2015.sav

Target

Rater 1

Rater 2

10

S & F (1979) Case I


Appropriate design when all raters
have evaluated all cases
Based on degree of consistency in
ratings

ICC(1) = MSR - MSW


MSR + (k-1)MSW
MSR = Between People
MSW = Within People
k = # of Judges

One-Way ANOVA Design


Appropriate design when you have
raters randomly paired with reliability
coders
Focus is on absolute agreement between
ratings (inter-changeability)

ICC(C1) = MSR - MSE


MSR + (k-1)MSE
MSR = Between People
MSE = Residual
k = # of Judges

Two-Way ANOVA Design


Design would be used when a
proportion of coding is confirmed by a
single reliability coder.
Focus is on absolute agreement

ICC(A1) = MSR - MSE


MSR + (k-1)MSE + k/n (MSCMSE)
MSR = Between People
MSC = Between Measures (Judges)
MSE = Residual
k = # of Judges
n = # of Cases

Consistency Definition
ANOVA table
Between People: Differences between
the individuals being evaluated
MSB = Mean Square Between
MSE: Residual variance; consists of:
Between Items (Judges or Raters)
Residual
ICC = (4.683 - .683) / (4.683+.683) = .745
Equivalent to the inter-item correlation
Note that you are ignoring the variance due to
differences between the judges or raters

Absolute Agreement
Definition
This design would be used when you have
a single expert coder who checks a % of
the interactions that have been coded.
Note that this expert is always the same
individual for all coders.
ICC =
(4.683-.683)/4.683+.683+[2/6(80.083-.68
3)
ICC = .126.

How do the two formulas differ?


Why?

Other Reliability Issues

Dis-Attenuated Correlation

r = r / (1*2)
r = dis-attenuated correlation
r = observed correlation
= reliability of two measures

Standard Error of an
Individual Score

SEx= x*(1-x)
= standard deviation
= reliability of measure

Reliability of a Linear Composite


Example: SPS.sav
Scale: ..\..\..\Measures\Social Provisions Scale chapter.pdf

Rely=1-{[(2i)-({i*2i})]}
2y
2i=variance of measures
i=reliability of measures
2y=variance of total score

Computation of Reliability
Total Variance Reliable Variance of
Composite Variables
Variance = 27.13
Reliable Variance = 19.04
Error Variance = 8.09

Total Score Variance = 97.06


Error Variance / Total Variance = .08
1 - .08 = .92

Reliability of Difference
Score

Compute the difference between two


scores
Linn County study: Change in
depression among the elderly over 6
month period
Two measures of depression are
positively correlated.
r = .48

Question: Implication for reliability of


difference score?

Linn County Depression Data


(LinnDep.sav)

Reliability & Variances


Reliability

Variance

Reliable
Variance

Time 1

.72

25.15

18.11

7.04

Time 2

.76

28.83

21.91

6.92

Measure

Difference
Score

27.75

Sum Score

79.53

Error
Variance

Computation
Time 1 & Time 2 Measures Error
Variance:
Error = 7.04 + 6.92 = 13.96

% Error Variance:
Difference Score = 13.96/27.75 = .50
Sum Score = 13.96/79.53 = .18

Reliability:
Difference Score = 1 - .50 = .50
Sum Score = 1 - .18 = .82

Validity
Definition: Does the measure assess
the construct it was designed to
assess.
Issue: What is t?
What other construct or constructs
affect scores on the measure?

Systematic Error: Non-random


effects of other variables on scores
on the measure
Also termed confounding.

Social Desirability
Desire on the part of some
individuals to appear in a positive
light.
Most constructs we want to assess have
a positive and a negative endpoint
Some individuals may fake good on
the measure.

Logic is that an individual who agrees


a lot with such statements is
probably not being truthful on other

Marlowe-Crowne SD Scale
Measures individual differences in the
tendency to appear in a positive light.
Example items:
"I am always ready to admit it when I
make a mistake
"I always try to practice what I preach."

Impact on measures:
Examine correlation between scores on
SD measure and other measures
Example: Taylor Manifest Anxiety Scale

Other Types of Systematic


Error
Dispositional Negativity: Concept
developed by Watson & Clark (1984),
involves a general tendency to evaluate
things negatively. Indicators: Neuroticism,
reports of negative emotions.
Watson & Pennebaker (1987): Describe
how negativity affects responses to
measures of stress, other psychosocial
variables. Found that if you controlled for
dispositional negativity, the correlation
between stressful life events and
depression became non-significant.
Implication?

Impact of Negativity on Spouse


Ratings
Predictor

Father
Father
Hostility Warmth

Mother
Hostility

Mother
Warmth

Observed
Behavior

.30*

.25*

.31*

.26*

Spouse
Negative
Affectivity

.26*

-.25*

.22*

-.10*

R2

.16*

.14*

.15*

.09*

Halo Effects
Bias due to overall positive or
negative feelings about the
individual; may distort ratings of
performance or other characteristics.
BARS measures: Behaviorally
Anchored Rating Scales; simply
identify which behaviors occur.
Appears to overcome halo effects.

Loneliness Scale Reliability Example


Short Version of the UCLA Loneliness
Scale:
..\..\..\Measures\UCLA_Loneliness_10Items.pdf

Iowa Family Survey:


Telephone survey of adults in Iowa conducted
in 2005
Oversampled adults over 65 years of age
Included 10 item loneliness scale

Data:
..\..\..\Loneliness\Iowa Family Survey 2005
\Iowa Family 2005 Survey Data April 07.sa

Types of Validity
Content Validity
Does the measure adequately represent
the meaning of the construct?

Criterion Validity
Do scores on the measure predict
criteria that reflect the construct?

Construct Validity
Are the results based on the measure
consistent w/ theoretical predictions?

Content Validity I
Issue: Does the measure adequately represent
the content domain of the construct that is being
assessed?
Examples:

Patient satisfaction: Items addressing competence of


medical staff, adequacy of facilities, interpersonal skills
of staff.
Job Satisfaction: Evaluation of various aspects of the
job (e.g., work environment, pay, supervision).
Shyness: Behavioral reactions to others, physiological
(sweating), affect (anxiety).

Theoretical conceptualization of construct:


Definition of construct dictates the content
domain of test.

Social Provision scale example


Unidimensional vs. multidimensional measures of job
satisfaction, social support, loneliness.

Content Validity II
Validation: Based on expert judgment regarding
content of a measure or test.
Power vs. Additive approach: Narrow vs. broad
conceptualization of the construct; has
implications for item content, test (factor)
structure.
Disguised tests: Content validity is irrelevant.
Examples: Rorschach, TAT, MMPI (Item: "I attend church
regularly"; an indicator of schizophrenia).
Psychoanalytic conceptualization: Impact of defense
mechanisms on responding to structured tests;
"objective" vs. projective assessments. Example: TAT
assessment.

Criterion Validity

UCLA Score
Loneliness

Loneliness
Rating

Definition:
Demonstrating that
scores on your
measure are
associated with
other methods of
assessing the same
construct.
Loneliness: Scores
on UCLA scale
correlated .71 with
scores on a measure
based on ratings on 7
loneliness scales

Paranormal Belief Scale (PBS)

Paranormal experiences:

73% reported such experiences


Received higher scores on the PBS

Performance on "psychokinesis" task

Instructed to move the clip w/ their mind.


String with paper clip attached held over set of
concentric circles; assessed how far paperclip
moved
Correlated .40 with PBS scores

Horoscope and personality

Provided a personality description based on


their sign.
All received the same randomly constructed
personality description
Rating of accuracy correlated .41 with BPS
scores

Known Groups Analysis


Identifying groups in the population
that should differ on the construct
being assessed, and demonstrating
that they receive different scores on
your measure.
Loneliness example: Loneliness groups,
received much higher scores on the loneliness
scale (e.g., mean of 60 vs. 40; 2 standard
deviations difference).
Note problem of discriminant validity (i.e.,
groups could differ on other constructs as well,
such as self-esteem or depression).

Functional Ability Scale


Analysis

Impact of Method Variance


Imagine that you were validating a measure of
shyness that you had developed. You
demonstrate that the measure is associated with
scores on another self-report measure of shyness
and a behavioral indicator of shyness based on
observations of interpersonal behavior.
Question: Which form of validity is more convincing?
Why?

Reflects a belief in the effect of common method


variance; that by assessing a variable using the
same method of assessment, the correlation
between the measures should be enhanced.

Concurrent vs. Predictive Criterion


Validity
Example: GRE scores and graduate
GPA following admission.
Problem: Restriction of range due to
selection process, lowers correlation.

Loneliness:
Found to predict subsequent nursing
home admission, mortality among the
elderly, post-partum depression.

Discriminant Validation
Issue: Demonstrating that your measure
assesses a construct that is different or distinct
from measures of related constructs.
Loneliness scale: Addressed this issue by
conducting a regression analysis wherein we used
measures of other constructs to predict scores on
the loneliness scale.
Scores correlated .71 w/ Loneliness Index.
Loneliness scores were strongly related to measures of
depression (.51), extraversion (-.46), self-esteem (-.49). In
combination, these other variables explained 43% of the
variance in loneliness scores.
After controlling for these other variables, scores on a measure
termed the Loneliness Index accounted for an additional 18%
of the variance in loneliness scores.
Scores on the loneliness scale remained related to time alone
(partial r = .27), number of times eat dinner alone on a Friday
or Saturday night (partial r = .31) and number of friends
(partial r = -.27).

Discriminant Validation via


Confirmatory Factor Analysis
Lack of discriminant validity reflected
by factor structure
One factor vs. multiple factors
Compare the fit of alternative CFA
models

Relationships w/ other variables


Examine pattern of relationships
Identical constructs = identical
relationships
Test equality of relationships

Correlations Among the


Factors
Job Satisfaction

Job Involvement

Job Satisfaction

1.00

Job Involvement

.59

1.00

Organizational
Commitment

.55

.55

Organizational
Commitment

1.00

Comparison of Model Fit

Three Factor Model: 2 (24) = 58.25


One Factor Model: 2 (27) = 1050.01
Difference: 2 (3) = 991.76
Results indicate that the three factor
model fits the data much better than
the one factor model
Indicates that the three constructs
are distinct from one another

Comparison of Model Fit


Different Correlations: 2 (256) =
594.29
Identical Correlations: 2 (270) =
824.61
Difference: 2 (14) = 230.32
Results indicate that the three
factors are related differently to the
other job-related variables
Further indicates that the three
constructs are distinct from one

Multimethod-Multitrait
Analysis

Developed by Campbell and Fiske. Goal is to demonstrate


that your measure is more highly related to alternative
methods of assessing the same construct (termed
convergent validity; is identical to criterion validity as I
have defined it) than to measures of other constructs that
employ the same method of assessment (termed divergent
validity).
Example: Loneliness vs. shyness, as assessed using selfreport, roommate report, and behavioral measures.
Heteromethod-monotrait correlations reflect construct
variance in the measure, whereas monomethod-heterotrait
correlations reflect method variance.
Problem: Assumes that traits are truly uncorrelated.

Question: What does the monomethod-heterotrait


correlation represent?
Issue: What is the meaning of "different methods of
assessment"?

Example Results
LSR

LRR

LSR

1.00

LRR

(.70)

1.00

LBM

(.70)

(.70)

SSR

[.50]

SRR

{.20}

SBM

{.20}

LBM

SSR

SRR

SBM

1.00
1.00

[.50]
[.50]

(.70)

1.00

(.70)

(.70)

1.00

Theory Verification
Issue: Are results based on your measure
consistent with theoretical models involving the
construct? Considered the highest form of
validity, wherein you demonstrate that scores on
your measure relate to measures of other
constructs as you would expect, given theoretical
models.
Problem: What if your empirical evidence is
negative? Is the problem with the theory or the
measure? Have to rely on well-developed and
accepted theory.
Contrast with criterion validity: Are relating the
measure to measures of other constructs.

Paranormal Belief Example


Divided into Believers & Skeptics
based on a median split of scores
Read abstract of a fictitious journal
article reviewing a number of studies
dealing with existence of ESP
ESP Proven or ESP Disproven conditions

Predicted that participants would


report emotional arousal & selective
recall of the information based on
their beliefs

Results
Emotional arousal: Interaction between
treatment condition & BPS scores
ESP Proven: r = -.31
ESP Disproven: r = .37

Recall:
Gave them a surprise recall test after they
completed the emotional arousal measure
ESP Disproven: r = -.38
ESP Proven: r = .07

% Correct Recall

Conclusions
Validity of a measure is never "proven".
Development of a body of literature supporting the
measure's validity.
Continuing evolution and improvement of measures.
(Loneliness example, revisions over the years).
Makes you very popular. Recently put information on my
Web site; has increased the requests for the measure.
Issue: Putting scale on the Internet; problem with
copyright.

Impact on the quality of research: Can greatly


improve your ability to demonstrate relationships
among ariables.
Loneliness example: UCLA scale vs. self-labeling
measures, replication of findings. Harry Reiss:
Correlations were consistently .20 higher.

You might also like