Professional Documents
Culture Documents
Week 5.1 - Conceptual Basis of Validity - Part 1
Week 5.1 - Conceptual Basis of Validity - Part 1
1
Aims
Introduce the concept of validity
Different types of validity (Part 1)
Content validity
Face validity
Factorial validity
Response processes
2
Validity: Defined
Validity is the degree to which a test measures what it is supposed to
measure.
It is the most important issue in psychological measurement
More formally, validity is “the degree to which evidence and theory
support the interpretations of test scores entailed by the proposed
uses” of a test.
Thus, a test itself is neither valid nor invalid
Validity concerns the interpretations and uses of a measure’s scores
3
Validity: Described
For example, the scores from a PSYC3302 final exam may be
interpreted as representative of a person’s level of knowledge
and understanding of psychometrics.
By contrast, it would be much less defensible to interpret
PSYC3302 final exam scores as representative of a person’s
level of intelligence.
It would be totally indefensible to interpret PSYC3302 final
exam scores as representative of a person’s level of
extraversion.
4
Validity: Described
Validity is related to the proposed uses of the scores.
Whether it is appropriate to use/interpret scores from a test
is, in part, a matter of opinion.
For example, the PSYC3302 final exam is a psychometric
test.
How can one interpret a final exam score of 75% on the
PSYC3302 final exam?
5
Validity: Described
Final exam scores are used principally to order students
from most knowledgeable/competent to least
knowledgeable/competent.
Thus, final exam scores are on an ordinal scale.
However, there is some level of interpretation of degree
of knowledge/competence.
Suggests interval scale.
However, someone who scored 70% on the final exam
can’t be said to know/understand twice as much material
as someone who scored 35% on the final exam.
6
Validity: Described
Someone who scored 90% on the final exam would be
expected to have knowledge/competence over the vast
majority of the material presented in the unit.
By contrast, someone who scored 20% on the final exam
would be expected to have knowledge/competence of less
than 50% of the material.
7
“Test is Valid”?
You should note that people (including professionals and
researchers) do tend to refer to a test as valid.
However, strictly speaking, this is inappropriate.
People tend to say that a “test is valid” for two reasons:
1) They don’t know any better
2) They get lazy
8
Validity is a Matter of Degree
Validity is not an all or none issue
The validity of test score interpretations should be
conceived in terms of strong versus weak, rather than
valid versus invalid.
When you choose a psychological test, you should choose
the test that will support the interpretations that you want
to make from the test scores.
Typically, there are several tests on the market from which
to choose.
Validity should be one of the primary considerations.
9
Validity is a Matter of Degree
Validity is based on theory and empirical evidence.
It’s not good enough to hear someone say that the test is
valid in someone’s experience.
You must take responsibility.
There are many popular tests out there that have little to
no validity.
For example, handwriting analysis (graphology) as an indicator
of someone’s personality (no evidence).
The Color Quiz as indicator of someone’s personality (no
evidence).
10
How is Validity Determined Empirically?
Unlike internal consistency reliability, there is no single
analysis that can be used to represent the degree to which
the interpretations of test scores are valid.
Instead, several different types of analyses need to be
conducted.
Some validity analyses are quantitative and do involve
statistical analyses.
However, some validity analyses are purely qualitative in
nature.
11
How is Validity Determined Empirically?
In contemporary psychometrics, the pursuit of
establishing the validity of the interpretation of test scores
revolves around the concept of ‘construct validity’.
Construct validity refers to the degree to which test scores
can be interpreted as reflecting a particular psychological
construct.
There are many types of evidence relevant to establishing
the validity of test score interpretations.
12
Test Content
Represents the match between the actual content of the
test and the content that should be included in the test.
If test scores are to be interpreted as indicators of a
particular construct of interest, then the items included in
the test should reflect the important facets of the construct.
The description of the nature of the construct should help
define the appropriate content of the test.
There are two types validity relevant to test content
1: Content validity
2: Face validity
13
1: Content Validity
A test may be suggested to be associated with good
content validity when the items cover the entire breadth of
the construct.
However, the items can not exceed the boundaries of the
construct.
That is, cannot include construct-irrelevant content.
14
1: Content Validity
Consider a PSYC3302 mid-semester test.
There should be items in the test from all lectures given
before the mid-semester test.
By contrast, there should not be items from the
PSYC3302 lectures given after the mid-semester test.
This would exceed the boundary of the construct of
interest:
Knowledge and understanding of introductory psychometrics
up to the middle of the semester.
15
1: Content Validity
It is easy to achieve good content validity for a mid-
semester test or a final exam.
It is tougher for some of the more higher-order (abstract)
constructs in psychology.
For example, personality has been reported to encompass
individual differences in “emotional, interpersonal,
experiential, attitudinal, and motivational styles” (McCrae
& John, 1992, p.175).
This essentially covers all facets from individual
differences, except intelligence.
16
1: Content Validity
Consider the items,
“Rarely feel depressed.”
“Am sad most of the time.”
Are these personality items?
Or are they items designed to measure depression?
Turns out, they are personality items within the IPIP.
Constructs such as personality suffer from a lack of clear
boundaries.
Content validity assessment very difficult in this area.
17
1: Content Validity – In Practice
A university unit may consist of 24 lectures and 7 labs.
Perhaps 1,500 pieces of information are
mentioned/discussed in the lectures and labs.
To be entirely comprehensive, a final exam would have to
include 1,500 questions.
At 60 seconds per question, it would take 25 hours to
complete such an exam.
Constraints on time, respondent fatigue, respondent
attention, etc., will place constraints on the precise amount
of content included in a measure.
18
1: Content Validity – In Practice
A final exam may consist of 4 questions per lecture and 2
questions per lab.
That would yield a 110 final exam.
There would necessarily be some elements of the material
covered in the lectures/labs that will not be included in
such a final exam.
However, if the 110 questions represent a random
assortment of material covered across the entire unit, the
correlation between performance on a 110 final exam and
a 1500 final exam would likely be very high.
This is ‘domain sampling theory’.
19
1: Standard Error of Estimate
If we did have scores from participants who completed
both the 110 item exam and the 1,500 item final exam, we
could calculate the standard error of estimate.
Not to be confused with the ‘standard error of measurement’.
The standard error of estimate reflects how much actual
scores deviate from the scores predicted by the regression
line,
The typical error (difference) between predicted values and
observed values.
That is, by how much does the 110 item exam scores fail to
predict performance on the 1,500 item exam?
20
Standard Error of Estimate
Sx2 = Standard deviation of the criterion (in this case, long exam)
21
Standard Error of Measurement/Estimate
The formula for standard error of measurement and standard error of estimate are
very similar.
The term ‘standard error of measurement’ should only be used in the context of
reliability.
‘Standard error of estimate’ should be used in the context of validity.
sem so 1 Rxx
SO = Standard deviation of the test Sx2 = Standard deviation of the criterion (could be a test)
RXX = Reliability estimate RX1X2 = Correlation between variable 1(x 1) and variable 2 (X2)
22
Standard Error of Estimate: Example
In 2015, the PSYC3302 exam (N = 298 students) had 118 questions:
M = 75.4 SD = 13.2, Cronbach’s alpha = .932
A random selection of 59 questions from the 118 exam (50%) had the corresponding
descriptive statistics:
M = 73.8, SD = 14.0, Cronbach’s alpha = .871
The Pearson correlation between the 59 question version of the final exam and the
118 version of the final exam was r = .973
68% of the prediction errors associated with the short-form exam would be between
plus and minus 2.16 marks.
23
2. Face Validity
Face validity represents the degree to which the items
associated with a measure appear to be related to the
construct of interest.
This “appearance” is in the judgement of non-experts.
Face validity is not crucial from a fundamental
psychometric perspective.
It is more of a practical consideration.
As a general statement, respondents need to be made to
feel that they are responding to items that are relevant to
the task at hand.
24
Face Validity
Suppose you were applying for a job as an air traffic
controller and research has shown that introverts perform
better in this job than extraverts.
In the recruitment process, the candidates are asked to
respond to the following item: ‘I am the life of the party’.
This item is an indicator of extraversion.
People might get upset that they are being asked such a
question.
Another item could be: ‘I tend to chew chocolate rather
than let it dissolve in my mouth slowly.’
25
Content vs. Face Validity
Content validity can only realistically be assessed by
experts in the field.
They need to understand the nature of the construct, it’s
history, it’s relation to other constructs, etc.
By contrast, face validity must be assessable by non-
experts.
It is the respondents who are likely to take the test who
must be kept in mind when assessing face validity.
From this perspective, face validity is not considered an
especially important form of construct validity evidence.
26
Face Validity - Disadvantage
One problem with people being able to discern the
appropriateness of an item is that people can also respond
in a way that they think is most advantageous for them.
If you were applying for a job, how would you respond to:
“I express my emotions at the appropriate time.”
By contrast, how would you respond to:
“I tend to chew chocolate rather than let it dissolve in my
mouth slowly.”
27
3: Factorial Validity (Internal Structure of the
Test)
Not all sources (e.g., textbooks) use the term ‘factorial
validity’.
but I (and others) use it.
Factorial validity refers to…
1: The number of dimensions measured by a test.
2: Whether the items of a test are related to the dimensions of
interest.
3: Whether the dimensions of interest are related to each other.
28
3: Factorial Validity (Internal Structure of the
Test)
When a test is designed, it is typically done so in such a
way that the number of dimensions and facets are
specifically considered/specified.
Additionally, the items are hypothesized to be indicators
of one specific dimension/facet.
29
NEO-PI R: Five-Factor Model
Agreeablene Conscientio
Neuroticism Extraversion Openness
ss usness
The blue ellipses are dimensions: they represent the highest-order categories.
The green squares are facets. Each dimension is made up of six facets. Facets are
sometimes referred to as subscales (subtests). In the case of the Neuroticism
dimensions, the 6 facets are named: Anxiety, Hostility, Depression, Self-
Consciousness, Impulsiveness, Vulnerability to Stress.
The blue squares are individual items. Each facet is comprised of 8 items.
31
4: Response Processes
There should be a close match between the psychological
processes that the respondents actually use when
completing a measure and the process that they should
use.
You can’t just assume that people are going to do what
you expect them to do.
32
Response Processes
In some cases, it is relatively obvious that there is a
discordance between the expected responses and the
actual responses.
For example, when people respond ‘strongly agree’ to a high
face valid item in a recruitment context just because they want
the job, not because they actually possess the attribute of
interest.
Another slightly more subtle case is when people respond semi-
randomly to items, because the questionnaire is too long (i.e.,
respondents are getting bored and/or restless).
33
Response Processes
Cheating is another example behaviour that seriously
comprises a researcher’s capacity to interpret the scores as
valid indicators of performance.
Some people cheat on tests, even in “low-stakes” settings.
In my lab, some first-year psychology students look up the
definitions of words on the computer while doing a
vocabulary test!
34