Measurement Presentation (Issues in Survey Research)

Measurement
Definition of Measurement
Assigning numbers to objects or entities
according to some logical or systematic rule.
Measurement processes are fundamental to all
sciences.
Examples:
Self Report: Depression, personality
characteristics, intelligence.
Physiological indices: Heart rate, blood
pressure, cortisol levels, IL-6,A1C levels
Observational assessments: Parent-Child &
Couple warmth and hostility
Qualitative Research
Some would argue that qualitative
research based on observations of the
researcher (e.g., participant observation)
does not involve measurement.
Note that implicit measurement processes
are occurring, based on how the
investigator characterizes the entity being
assessed (e.g., categorical judgments).
Issues of reliability (e.g., repeatability
across observers or coders) and validity
(e.g., bias) of those characterizations of
the entity arise.
Psychometric Theory
Assumption is that we are attempting to assess a
concept or construct that is not directly
observable, but that we can only indirectly assess
via measurement procedures (i.e., latent variable).
Example: Each of you is asked to rate the
extraversion of a job candidate after watching the
videotape of the individual interacting with others.
Assume that each of you would not come up with the
same score.
Should derive a normally distributed set of scores
around the average value.
Issue: What would be the most accurate estimate of his
or her extraversion?
Distribution of Scores
100
80
60
40
20
0
17.60
17.70
17.80
17.90
18.00
Extraversion
18.10
18.20
18.30
One Participants Responses to 20

Loneliness Items
Classic Measurement
Equation
x=t+e
X = measured variable
t = true score
e = random error
Error 2
Error 1
Error3
1
Item 2
Item 1
Item 3
1
Loneliness
Example: Social Support

Score on a social support measure may
not accurately reflect respondents actual
level of support
Random error (e) may serve to bias scores
up or down relative to the persons true
level of support
Scores may also be systematically biased
up or down
Positivity bias: Scores on the social support

measure are higher than they should be
Negativity bias: Scores on the social support
measure are lower than they should be
Adequacy of Measures
Reliability: Reflects the accuracy of a
measure
Represented by the variance in a
measure due to t, or the underlying
construct that is being measured
Formulas reflect the variance due to t
relative to the total variance in scores
on a measure
Validity: Does the measure assess the
construct that it was designed to measure.
Issue: What is t? Is it the construct
you had in mind?
Relationship Between Reliability and

Validity
Can a measure be valid without
being reliable?
The fact that a measure is reliable
means that it is measuring something
(i.e., there is some variance due to t)
Reliability is a necessary but not
sufficient condition for validity
First try to establish that a measure

is reliable before discussing validity
Reliability
Refers to the repeatability or consistency
of a measure.
Issue first arose in astronomical measurement,
where observers were found to differ from one
another in their measurements of stars.
At a conceptual level, refers to the extent

to which scores on your measure reflect t
from the measurement equation. Error in
measurement is assumed to be random
(e).
Random Measurement Error

Assume that errors in measurement
are randomly distributed across the
assessments
Implications:
Error is uncorrelated with t or level of
true score
Errors are normally distributed around
the objects true standing on t
Systematic Error
Affects the validity of the measure, or the

definition of t.
Response sets: General ways of
responding to questions, no matter what
the wording of the item
Example: Acquiescence, or an inclination to

agree irrespective of item wording.
What impact would this have on scores on a
measure?
Need for balanced measures.
Reliability Example
Measure has a reliability of .80.
Implication: 80% of the variation in
scores on that measure are said to
reflect true differences between the
entities that are being assessed (e.g.,
individual differences in loneliness)
Remaining 20% of variance
represents random measurement
error.
Test-Retest Reliability
Reflects the stability of scores on a measure over time.
Assumption: True scores on the construct do not change over

time. Examples: Personality traits, mood (Speilberger scale:
State vs. trait anxiety). Therefore, any differences in scores is
attributed to measurement error.
Loneliness example: Correlation of .60 was found between

loneliness scores at the beginning and end of the first
school year at UCLA.
Criticism: Was interpreted as indicating that scores on the

loneliness scale were unreliable; does loneliness (the
underlying construct) actually change over that period of time?
75% were lonely at the beginning of the school year; declined
to 25% by the end of the school year
Assessment: Simply correlate scores from multiple

administrations of the measure to a group of subjects,
separated by time. Measure the association between
scores on the test.
Issue: Length of time and respondent memory for prior
answers.
Variant: Administration of alternate forms of the measure;
assume that the scales are assessing the same construct.
Internal Consistency:
This is a form of reliability that can be
evaluated when your measure employs
multiple items in assessing the construct.
Involves the consistency of the person's
responses across items.
Domain Sampling model: Assumption is
that the items you have developed for a
measure represent a random sample of
the content domain of the construct.
Example: Social support; select items that
reflect different types of support, such as
emotional support or tangible assistance.
Variation in responding to the items therefore
reflects errors in measurement.
Standardized Coefficient
Alpha ()
= (k*r)/ 1+[(k-1)*r]
k = # of items
r = Average correlation
Raw Score Coefficient

Alpha ()
= k/(k-1)*[c-(2xi/c)]
k = # of items
c = mean covariance
2xi = item variance
Derivations
Can increase reliability of a measure by:
Increasing the number of items, assuming that
the same level of correlation (or covariation)
among responses to the individual items is
maintained.
Increasing the correlation or covariation among
the items
Split-half reliability represents a special case

of alpha.
Standardized and raw score are very
similar for most scales; why?
UCLA Loneliness Scale

Reliability
Scale consists of 20 items
Items were originally derived from the statements used
by lonely individuals to describe the experience
No items use the word lonely; why?
Revision of scale in 1980 added 10 non-lonely or

positive items; why?
Revision of scale in 1996 simplified the response
format & item wording
Scale: UCLA Loneliness Scale Items.pdf
Paper:
..\..\..\Measures\UCLA Loneliness Scale Version 3 Paper.pdf
SPSS Reliability Analysis:

lonely reversed.sav
Lonely 2014.sav
Loneliness Scale Example

Computation of Coefficient Alpha
Average r = .365
= (20*.365)/1+(19*.365)=7.3/7.935=.92
Reliability of 5 item scale:

= (5*.365)/1+(4*.365)=1.825/2.46=.74
Example: Selection of best 5 items

Choose based on corrected item-total correlations
What items would you select?
Compute the reliability for both the original and
hold out group.
Why are the values different?
Three Item HRS Version

Items:
How often to you feel that you lack companionship?
(2)
How often do you feel left out? (11)
How often do you feel isolated from others? (14)
Response format:
1. Hardly ever
2. Some of the time
3. Often
HRS Loneliness data:

Data: hrslone3.sav
Inter-Rater Agreement
Data are sometimes collected by raters or
coders, who evaluate the objects that are
being assessed and assign a number or
numbers.
Examples: Assessments of clinical depression,
coding of behavior during interactions.
Design issue: Rater drift.
Two types of reliability estimates can be

computed depending on the scale of
measurement (i.e., continuous or
categorical) that is involved.
Creation of Random
Samples
Employ the RV function when

computing variable in SPSS
Can use a variety of distributions to create
random variables
Employed Bernouli equation
Dichotomous distribution which takes on a value
of 1.0 with p probability and a value of 0 with 1p probability
SPSS syntax:
COMPUTE coder3 = RV.BERNOULLI(.2) .
EXECUTE .
Example: Sample Agreement 1.sav
Kappa Coefficient
Example: kappa.sav
K = po - pc / 1- pc
po = observed % agreement
pc = chance % agreement
Example
% Agreement (PO): 8/10 = .80
% Chance Agreement (PC):
PC = (.7*.7) + (.3*.3) = .58
Kappa () = (PO- PC)/(1- PC)
= (.80-.58)/(1-.58) = .22/.42
= .524
Continuous Measures
The Intra-class correlation is
computed when you have data from
observers on continuous measures
Example: ISBR coding of Warmth &
Hostility in videotaped family
interactions
Ratings are made on several scales,
which are summed together
Issue: Raters may not agree although
their scores may be correlated
Example Rating Data

File=Rater1.sav
Target
Rater 1
Rater 2
10
FACHS Rating Data

Example: icc 2015.sav
Target
Rater 1
Rater 2
10
S & F (1979) Case I

Appropriate design when all raters
have evaluated all cases
Based on degree of consistency in
ratings
ICC(1) = MSR - MSW

MSR + (k-1)MSW
MSR = Between People
MSW = Within People
k = # of Judges
One-Way ANOVA Design

Appropriate design when you have
raters randomly paired with reliability
coders
Focus is on absolute agreement between
ratings (inter-changeability)
ICC(C1) = MSR - MSE

MSR + (k-1)MSE
MSE = Residual
k = # of Judges
Two-Way ANOVA Design

Design would be used when a
proportion of coding is confirmed by a
single reliability coder.
Focus is on absolute agreement
ICC(A1) = MSR - MSE

MSR + (k-1)MSE + k/n (MSCMSE)
MSC = Between Measures (Judges)
MSE = Residual
k = # of Judges
n = # of Cases
Consistency Definition
ANOVA table
Between People: Differences between
the individuals being evaluated
MSB = Mean Square Between
MSE: Residual variance; consists of:
Between Items (Judges or Raters)
Residual
ICC = (4.683 - .683) / (4.683+.683) = .745
Equivalent to the inter-item correlation
Note that you are ignoring the variance due to
differences between the judges or raters
Absolute Agreement
Definition
This design would be used when you have
a single expert coder who checks a % of
the interactions that have been coded.
Note that this expert is always the same
individual for all coders.
ICC =
(4.683-.683)/4.683+.683+[2/6(80.083-.68
3)
ICC = .126.
How do the two formulas differ?

Why?
Other Reliability Issues
Dis-Attenuated Correlation
r = r / (1*2)
r = dis-attenuated correlation
r = observed correlation
= reliability of two measures
Standard Error of an
Individual Score
SEx= x*(1-x)
= standard deviation
= reliability of measure
Reliability of a Linear Composite

Example: SPS.sav
Scale: ..\..\..\Measures\Social Provisions Scale chapter.pdf
Rely=1-{[(2i)-({i*2i})]}
2y
2i=variance of measures
i=reliability of measures
2y=variance of total score
Computation of Reliability
Total Variance Reliable Variance of
Composite Variables
Variance = 27.13
Reliable Variance = 19.04
Error Variance = 8.09
Total Score Variance = 97.06

Error Variance / Total Variance = .08
1 - .08 = .92
Reliability of Difference
Score
Compute the difference between two

scores
Linn County study: Change in
depression among the elderly over 6
month period
Two measures of depression are
positively correlated.
r = .48
Question: Implication for reliability of

difference score?
Linn County Depression Data

(LinnDep.sav)
Reliability & Variances

Reliability
Variance
Reliable
Variance
Time 1
.72
25.15
18.11
7.04
Time 2
.76
28.83
21.91
6.92
Measure
Difference
Score
27.75
Sum Score
79.53
Error
Variance
Computation
Time 1 & Time 2 Measures Error
Variance:
Error = 7.04 + 6.92 = 13.96
% Error Variance:
Difference Score = 13.96/27.75 = .50
Sum Score = 13.96/79.53 = .18
Reliability:
Difference Score = 1 - .50 = .50
Sum Score = 1 - .18 = .82
Validity
Definition: Does the measure assess
the construct it was designed to
assess.
Issue: What is t?
What other construct or constructs
affect scores on the measure?
Systematic Error: Non-random

effects of other variables on scores
on the measure
Also termed confounding.
Social Desirability
Desire on the part of some
individuals to appear in a positive
light.
Most constructs we want to assess have
a positive and a negative endpoint
Some individuals may fake good on
the measure.
Logic is that an individual who agrees

a lot with such statements is
probably not being truthful on other
Marlowe-Crowne SD Scale
Measures individual differences in the
tendency to appear in a positive light.
Example items:
"I am always ready to admit it when I
make a mistake
"I always try to practice what I preach."
Impact on measures:
Examine correlation between scores on
SD measure and other measures
Example: Taylor Manifest Anxiety Scale
Other Types of Systematic

Error
Dispositional Negativity: Concept
developed by Watson & Clark (1984),
involves a general tendency to evaluate
things negatively. Indicators: Neuroticism,
reports of negative emotions.
Watson & Pennebaker (1987): Describe
how negativity affects responses to
measures of stress, other psychosocial
variables. Found that if you controlled for
dispositional negativity, the correlation
between stressful life events and
depression became non-significant.
Implication?
Impact of Negativity on Spouse

Ratings
Predictor
Father
Father
Hostility Warmth
Mother
Hostility
Mother
Warmth
Observed
Behavior
.30*
.25*
.31*
.26*
Spouse
Negative
Affectivity
.26*
-.25*
.22*
-.10*
R2
.16*
.14*
.15*
.09*
Halo Effects
Bias due to overall positive or
negative feelings about the
individual; may distort ratings of
performance or other characteristics.
BARS measures: Behaviorally
Anchored Rating Scales; simply
identify which behaviors occur.
Appears to overcome halo effects.
Loneliness Scale Reliability Example

Short Version of the UCLA Loneliness
Scale:
..\..\..\Measures\UCLA_Loneliness_10Items.pdf
Iowa Family Survey:

Telephone survey of adults in Iowa conducted
in 2005
Oversampled adults over 65 years of age
Included 10 item loneliness scale
Data:
..\..\..\Loneliness\Iowa Family Survey 2005
\Iowa Family 2005 Survey Data April 07.sa
Types of Validity
Content Validity
Does the measure adequately represent
the meaning of the construct?
Criterion Validity
Do scores on the measure predict
criteria that reflect the construct?
Construct Validity
Are the results based on the measure
consistent w/ theoretical predictions?
Content Validity I
Issue: Does the measure adequately represent
the content domain of the construct that is being
assessed?
Examples:
Patient satisfaction: Items addressing competence of

medical staff, adequacy of facilities, interpersonal skills
of staff.
Job Satisfaction: Evaluation of various aspects of the
job (e.g., work environment, pay, supervision).
Shyness: Behavioral reactions to others, physiological
(sweating), affect (anxiety).
Theoretical conceptualization of construct:

Definition of construct dictates the content
domain of test.
Social Provision scale example

Unidimensional vs. multidimensional measures of job
satisfaction, social support, loneliness.
Content Validity II
Validation: Based on expert judgment regarding
content of a measure or test.
Power vs. Additive approach: Narrow vs. broad
conceptualization of the construct; has
implications for item content, test (factor)
structure.
Disguised tests: Content validity is irrelevant.
Examples: Rorschach, TAT, MMPI (Item: "I attend church
regularly"; an indicator of schizophrenia).
Psychoanalytic conceptualization: Impact of defense
mechanisms on responding to structured tests;
"objective" vs. projective assessments. Example: TAT
assessment.
Criterion Validity
UCLA Score
Loneliness
Loneliness
Rating
Definition:
Demonstrating that
scores on your
measure are
associated with
other methods of
assessing the same
construct.
Loneliness: Scores
on UCLA scale
correlated .71 with
scores on a measure
based on ratings on 7
loneliness scales
Paranormal Belief Scale (PBS)
Paranormal experiences:
73% reported such experiences

Received higher scores on the PBS
Performance on "psychokinesis" task
Instructed to move the clip w/ their mind.

String with paper clip attached held over set of
concentric circles; assessed how far paperclip
moved
Correlated .40 with PBS scores
Horoscope and personality
Provided a personality description based on

their sign.
All received the same randomly constructed
personality description
Rating of accuracy correlated .41 with BPS
scores
Known Groups Analysis

Identifying groups in the population
that should differ on the construct
being assessed, and demonstrating
that they receive different scores on
your measure.
Loneliness example: Loneliness groups,
received much higher scores on the loneliness
scale (e.g., mean of 60 vs. 40; 2 standard
deviations difference).
Note problem of discriminant validity (i.e.,
groups could differ on other constructs as well,
such as self-esteem or depression).
Functional Ability Scale

Analysis
Impact of Method Variance

Imagine that you were validating a measure of
shyness that you had developed. You
demonstrate that the measure is associated with
scores on another self-report measure of shyness
and a behavioral indicator of shyness based on
observations of interpersonal behavior.
Question: Which form of validity is more convincing?
Why?
Reflects a belief in the effect of common method

variance; that by assessing a variable using the
same method of assessment, the correlation
between the measures should be enhanced.
Concurrent vs. Predictive Criterion

Validity
Example: GRE scores and graduate
GPA following admission.
Problem: Restriction of range due to
selection process, lowers correlation.
Loneliness:
Found to predict subsequent nursing
home admission, mortality among the
elderly, post-partum depression.
Discriminant Validation
Issue: Demonstrating that your measure
assesses a construct that is different or distinct
from measures of related constructs.
Loneliness scale: Addressed this issue by
conducting a regression analysis wherein we used
measures of other constructs to predict scores on
the loneliness scale.
Scores correlated .71 w/ Loneliness Index.
Loneliness scores were strongly related to measures of
depression (.51), extraversion (-.46), self-esteem (-.49). In
combination, these other variables explained 43% of the
variance in loneliness scores.
After controlling for these other variables, scores on a measure
termed the Loneliness Index accounted for an additional 18%
of the variance in loneliness scores.
Scores on the loneliness scale remained related to time alone
(partial r = .27), number of times eat dinner alone on a Friday
or Saturday night (partial r = .31) and number of friends
(partial r = -.27).
Discriminant Validation via

Confirmatory Factor Analysis
Lack of discriminant validity reflected
by factor structure
One factor vs. multiple factors
Compare the fit of alternative CFA
models
Relationships w/ other variables

Examine pattern of relationships
Identical constructs = identical
relationships
Test equality of relationships
Correlations Among the

Factors
Job Satisfaction
Job Involvement
Job Satisfaction
1.00
Job Involvement
.59
1.00
Organizational
Commitment
.55
.55
Organizational
Commitment
1.00
Comparison of Model Fit
Three Factor Model: 2 (24) = 58.25

One Factor Model: 2 (27) = 1050.01
Difference: 2 (3) = 991.76
Results indicate that the three factor
model fits the data much better than
the one factor model
Indicates that the three constructs
are distinct from one another
Comparison of Model Fit

Different Correlations: 2 (256) =
594.29
Identical Correlations: 2 (270) =
824.61
Difference: 2 (14) = 230.32
Results indicate that the three
factors are related differently to the
other job-related variables
Further indicates that the three
constructs are distinct from one
Multimethod-Multitrait
Analysis
Developed by Campbell and Fiske. Goal is to demonstrate

that your measure is more highly related to alternative
methods of assessing the same construct (termed
convergent validity; is identical to criterion validity as I
have defined it) than to measures of other constructs that
employ the same method of assessment (termed divergent
validity).
Example: Loneliness vs. shyness, as assessed using selfreport, roommate report, and behavioral measures.
Heteromethod-monotrait correlations reflect construct
variance in the measure, whereas monomethod-heterotrait
correlations reflect method variance.
Problem: Assumes that traits are truly uncorrelated.
Question: What does the monomethod-heterotrait

correlation represent?
Issue: What is the meaning of "different methods of
assessment"?
Example Results
LSR
LRR
LSR
1.00
LRR
(.70)
1.00
LBM
(.70)
(.70)
SSR
[.50]
SRR
{.20}
SBM
{.20}
LBM
SSR
SRR
SBM
1.00
1.00
[.50]
[.50]
(.70)
1.00
(.70)
(.70)
1.00
Theory Verification
Issue: Are results based on your measure
consistent with theoretical models involving the
construct? Considered the highest form of
validity, wherein you demonstrate that scores on
your measure relate to measures of other
constructs as you would expect, given theoretical
models.
Problem: What if your empirical evidence is
negative? Is the problem with the theory or the
measure? Have to rely on well-developed and
accepted theory.
Contrast with criterion validity: Are relating the
measure to measures of other constructs.
Paranormal Belief Example

Divided into Believers & Skeptics
based on a median split of scores
Read abstract of a fictitious journal
article reviewing a number of studies
dealing with existence of ESP
ESP Proven or ESP Disproven conditions
Predicted that participants would

report emotional arousal & selective
recall of the information based on
their beliefs
Results
Emotional arousal: Interaction between
treatment condition & BPS scores
ESP Proven: r = -.31
ESP Disproven: r = .37
Recall:
Gave them a surprise recall test after they
completed the emotional arousal measure
ESP Disproven: r = -.38
ESP Proven: r = .07
% Correct Recall
Conclusions
Validity of a measure is never "proven".
Development of a body of literature supporting the
measure's validity.
Continuing evolution and improvement of measures.
(Loneliness example, revisions over the years).
Makes you very popular. Recently put information on my
Web site; has increased the requests for the measure.
Issue: Putting scale on the Internet; problem with
copyright.
Impact on the quality of research: Can greatly

improve your ability to demonstrate relationships
among ariables.
Loneliness example: UCLA scale vs. self-labeling
measures, replication of findings. Harry Reiss:
Correlations were consistently .20 higher.

Measurement Presentation (Issues in Survey Research)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Measurement Presentation (Issues in Survey Research)

Uploaded by

Copyright:

Available Formats

Measurement

One Participants Responses to 20

Example: Social Support

Positivity bias: Scores on the social support

Relationship Between Reliability and

First try to establish that a measure

At a conceptual level, refers to the extent

Random Measurement Error

Affects the validity of the measure, or the

Example: Acquiescence, or an inclination to

Reflects the stability of scores on a measure over time.

Assumption: True scores on the construct do not change over

Loneliness example: Correlation of .60 was found between

Criticism: Was interpreted as indicating that scores on the

Assessment: Simply correlate scores from multiple

Raw Score Coefficient

Split-half reliability represents a special case

UCLA Loneliness Scale

Revision of scale in 1980 added 10 non-lonely or

SPSS Reliability Analysis:

Loneliness Scale Example

Reliability of 5 item scale:

Example: Selection of best 5 items

Three Item HRS Version

HRS Loneliness data:

Two types of reliability estimates can be

Employ the RV function when

Example Rating Data

FACHS Rating Data

S & F (1979) Case I

ICC(1) = MSR - MSW

One-Way ANOVA Design

ICC(C1) = MSR - MSE

Two-Way ANOVA Design

ICC(A1) = MSR - MSE

How do the two formulas differ?

Other Reliability Issues

Reliability of a Linear Composite

Total Score Variance = 97.06

Compute the difference between two

Question: Implication for reliability of

Linn County Depression Data

Reliability & Variances

Systematic Error: Non-random

Logic is that an individual who agrees

Other Types of Systematic

Impact of Negativity on Spouse

Loneliness Scale Reliability Example

Iowa Family Survey:

Patient satisfaction: Items addressing competence of

Theoretical conceptualization of construct:

Social Provision scale example

Paranormal Belief Scale (PBS)

73% reported such experiences

Performance on "psychokinesis" task

Instructed to move the clip w/ their mind.

Horoscope and personality

Provided a personality description based on

Known Groups Analysis

Functional Ability Scale

Impact of Method Variance

Reflects a belief in the effect of common method

Concurrent vs. Predictive Criterion