Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 34

Validity – Conceptual Basis

Dr. Gilles Gignac

1
Aims
 Introduce the concept of validity
 Different types of validity (Part 1)
 Content validity
 Face validity
 Factorial validity
 Response processes

2
Validity: Defined
 Validity is the degree to which a test measures what it is supposed to
measure.
 It is the most important issue in psychological measurement
 More formally, validity is “the degree to which evidence and theory
support the interpretations of test scores entailed by the proposed
uses” of a test.
 Thus, a test itself is neither valid nor invalid
 Validity concerns the interpretations and uses of a measure’s scores

3
Validity: Described
 For example, the scores from a PSYC3302 final exam may be
interpreted as representative of a person’s level of knowledge
and understanding of psychometrics.
 By contrast, it would be much less defensible to interpret
PSYC3302 final exam scores as representative of a person’s
level of intelligence.
 It would be totally indefensible to interpret PSYC3302 final
exam scores as representative of a person’s level of
extraversion.

4
Validity: Described
 Validity is related to the proposed uses of the scores.
 Whether it is appropriate to use/interpret scores from a test
is, in part, a matter of opinion.
 For example, the PSYC3302 final exam is a psychometric
test.
 How can one interpret a final exam score of 75% on the
PSYC3302 final exam?

5
Validity: Described
 Final exam scores are used principally to order students
from most knowledgeable/competent to least
knowledgeable/competent.
 Thus, final exam scores are on an ordinal scale.
 However, there is some level of interpretation of degree
of knowledge/competence.
 Suggests interval scale.
 However, someone who scored 70% on the final exam
can’t be said to know/understand twice as much material
as someone who scored 35% on the final exam.

6
Validity: Described
 Someone who scored 90% on the final exam would be
expected to have knowledge/competence over the vast
majority of the material presented in the unit.
 By contrast, someone who scored 20% on the final exam
would be expected to have knowledge/competence of less
than 50% of the material.

7
“Test is Valid”?
 You should note that people (including professionals and
researchers) do tend to refer to a test as valid.
 However, strictly speaking, this is inappropriate.
 People tend to say that a “test is valid” for two reasons:
 1) They don’t know any better
 2) They get lazy

8
Validity is a Matter of Degree
 Validity is not an all or none issue
 The validity of test score interpretations should be
conceived in terms of strong versus weak, rather than
valid versus invalid.
 When you choose a psychological test, you should choose
the test that will support the interpretations that you want
to make from the test scores.
 Typically, there are several tests on the market from which
to choose.
 Validity should be one of the primary considerations.

9
Validity is a Matter of Degree
 Validity is based on theory and empirical evidence.
 It’s not good enough to hear someone say that the test is
valid in someone’s experience.
 You must take responsibility.
 There are many popular tests out there that have little to
no validity.
 For example, handwriting analysis (graphology) as an indicator
of someone’s personality (no evidence).
 The Color Quiz as indicator of someone’s personality (no
evidence).

10
How is Validity Determined Empirically?
 Unlike internal consistency reliability, there is no single
analysis that can be used to represent the degree to which
the interpretations of test scores are valid.
 Instead, several different types of analyses need to be
conducted.
 Some validity analyses are quantitative and do involve
statistical analyses.
 However, some validity analyses are purely qualitative in
nature.

11
How is Validity Determined Empirically?
 In contemporary psychometrics, the pursuit of
establishing the validity of the interpretation of test scores
revolves around the concept of ‘construct validity’.
 Construct validity refers to the degree to which test scores
can be interpreted as reflecting a particular psychological
construct.
 There are many types of evidence relevant to establishing
the validity of test score interpretations.

12
Test Content
 Represents the match between the actual content of the
test and the content that should be included in the test.
 If test scores are to be interpreted as indicators of a
particular construct of interest, then the items included in
the test should reflect the important facets of the construct.
 The description of the nature of the construct should help
define the appropriate content of the test.
 There are two types validity relevant to test content
 1: Content validity
 2: Face validity

13
1: Content Validity
 A test may be suggested to be associated with good
content validity when the items cover the entire breadth of
the construct.
 However, the items can not exceed the boundaries of the
construct.
 That is, cannot include construct-irrelevant content.

14
1: Content Validity
 Consider a PSYC3302 mid-semester test.
 There should be items in the test from all lectures given
before the mid-semester test.
 By contrast, there should not be items from the
PSYC3302 lectures given after the mid-semester test.
 This would exceed the boundary of the construct of
interest:
 Knowledge and understanding of introductory psychometrics
up to the middle of the semester.

15
1: Content Validity
 It is easy to achieve good content validity for a mid-
semester test or a final exam.
 It is tougher for some of the more higher-order (abstract)
constructs in psychology.
 For example, personality has been reported to encompass
individual differences in “emotional, interpersonal,
experiential, attitudinal, and motivational styles” (McCrae
& John, 1992, p.175).
 This essentially covers all facets from individual
differences, except intelligence.

16
1: Content Validity
 Consider the items,
 “Rarely feel depressed.”
 “Am sad most of the time.”
 Are these personality items?
 Or are they items designed to measure depression?
 Turns out, they are personality items within the IPIP.
 Constructs such as personality suffer from a lack of clear
boundaries.
 Content validity assessment very difficult in this area.

17
1: Content Validity – In Practice
 A university unit may consist of 24 lectures and 7 labs.
 Perhaps 1,500 pieces of information are
mentioned/discussed in the lectures and labs.
 To be entirely comprehensive, a final exam would have to
include 1,500 questions.
 At 60 seconds per question, it would take 25 hours to
complete such an exam.
 Constraints on time, respondent fatigue, respondent
attention, etc., will place constraints on the precise amount
of content included in a measure.

18
1: Content Validity – In Practice
 A final exam may consist of 4 questions per lecture and 2
questions per lab.
 That would yield a 110 final exam.
 There would necessarily be some elements of the material
covered in the lectures/labs that will not be included in
such a final exam.
 However, if the 110 questions represent a random
assortment of material covered across the entire unit, the
correlation between performance on a 110 final exam and
a 1500 final exam would likely be very high.
 This is ‘domain sampling theory’.

19
1: Standard Error of Estimate
 If we did have scores from participants who completed
both the 110 item exam and the 1,500 item final exam, we
could calculate the standard error of estimate.
 Not to be confused with the ‘standard error of measurement’.
 The standard error of estimate reflects how much actual
scores deviate from the scores predicted by the regression
line,
 The typical error (difference) between predicted values and
observed values.
 That is, by how much does the 110 item exam scores fail to
predict performance on the 1,500 item exam?

20
Standard Error of Estimate

Sx2 = Standard deviation of the criterion (in this case, long exam)

RX1X2 = Correlation between short-exam (x1) and long-exam (X2)

You do not have to memorise this formula for the purposes of


examination.

21
Standard Error of Measurement/Estimate
 The formula for standard error of measurement and standard error of estimate are
very similar.
 The term ‘standard error of measurement’ should only be used in the context of
reliability.
 ‘Standard error of estimate’ should be used in the context of validity.

sem  so 1  Rxx
SO = Standard deviation of the test Sx2 = Standard deviation of the criterion (could be a test)

RXX = Reliability estimate RX1X2 = Correlation between variable 1(x 1) and variable 2 (X2)

22
Standard Error of Estimate: Example
 In 2015, the PSYC3302 exam (N = 298 students) had 118 questions:
 M = 75.4 SD = 13.2, Cronbach’s alpha = .932
 A random selection of 59 questions from the 118 exam (50%) had the corresponding
descriptive statistics:
 M = 73.8, SD = 14.0, Cronbach’s alpha = .871
 The Pearson correlation between the 59 question version of the final exam and the
118 version of the final exam was r = .973

Sx2 = Standard deviation of the criterion (long form)

RX1X2 = Correlation between variable 1(short-form) and variable 2 (long-form)

see = 13.2 *√(1 - .973) = 13.2 *.164 = 2.16

68% of the prediction errors associated with the short-form exam would be between
plus and minus 2.16 marks.

23
2. Face Validity
 Face validity represents the degree to which the items
associated with a measure appear to be related to the
construct of interest.
 This “appearance” is in the judgement of non-experts.
 Face validity is not crucial from a fundamental
psychometric perspective.
 It is more of a practical consideration.
 As a general statement, respondents need to be made to
feel that they are responding to items that are relevant to
the task at hand.

24
Face Validity
 Suppose you were applying for a job as an air traffic
controller and research has shown that introverts perform
better in this job than extraverts.
 In the recruitment process, the candidates are asked to
respond to the following item: ‘I am the life of the party’.
 This item is an indicator of extraversion.
 People might get upset that they are being asked such a
question.
 Another item could be: ‘I tend to chew chocolate rather
than let it dissolve in my mouth slowly.’

25
Content vs. Face Validity
 Content validity can only realistically be assessed by
experts in the field.
 They need to understand the nature of the construct, it’s
history, it’s relation to other constructs, etc.
 By contrast, face validity must be assessable by non-
experts.
 It is the respondents who are likely to take the test who
must be kept in mind when assessing face validity.
 From this perspective, face validity is not considered an
especially important form of construct validity evidence.

26
Face Validity - Disadvantage
 One problem with people being able to discern the
appropriateness of an item is that people can also respond
in a way that they think is most advantageous for them.
 If you were applying for a job, how would you respond to:
 “I express my emotions at the appropriate time.”
 By contrast, how would you respond to:
 “I tend to chew chocolate rather than let it dissolve in my
mouth slowly.”

27
3: Factorial Validity (Internal Structure of the
Test)
 Not all sources (e.g., textbooks) use the term ‘factorial
validity’.
 but I (and others) use it.
 Factorial validity refers to…
 1: The number of dimensions measured by a test.
 2: Whether the items of a test are related to the dimensions of
interest.
 3: Whether the dimensions of interest are related to each other.

28
3: Factorial Validity (Internal Structure of the
Test)
 When a test is designed, it is typically done so in such a
way that the number of dimensions and facets are
specifically considered/specified.
 Additionally, the items are hypothesized to be indicators
of one specific dimension/facet.

29
NEO-PI R: Five-Factor Model
Agreeablene Conscientio
Neuroticism Extraversion Openness
ss usness

 The blue ellipses are dimensions: they represent the highest-order categories.

 The green squares are facets. Each dimension is made up of six facets. Facets are
sometimes referred to as subscales (subtests). In the case of the Neuroticism
dimensions, the 6 facets are named: Anxiety, Hostility, Depression, Self-
Consciousness, Impulsiveness, Vulnerability to Stress.

 The blue squares are individual items. Each facet is comprised of 8 items.

 The whole inventory consists of 5*6*8 = 240 items.

 According to the structure of the NEO-PI R, it is not appropriate to calculate a total


personality score (i.e., you can’t add the scores on the 5 dimensions together).
30
Factorial Validity
 The factorial validity (hypothesized structure) of a test can
be tested using various quantitative techniques.
 Typically, researchers use a technique known as factor
analysis to evaluate the factorial validity of the scores
derived from a test.
 There are two types of factor analysis:
 Unrestricted factor analysis, and
 Restricted factor analysis.
 Next week’s lecture is devoted to a detailed examination
of factor analysis.

31
4: Response Processes
 There should be a close match between the psychological
processes that the respondents actually use when
completing a measure and the process that they should
use.
 You can’t just assume that people are going to do what
you expect them to do.

32
Response Processes
 In some cases, it is relatively obvious that there is a
discordance between the expected responses and the
actual responses.
 For example, when people respond ‘strongly agree’ to a high
face valid item in a recruitment context just because they want
the job, not because they actually possess the attribute of
interest.
 Another slightly more subtle case is when people respond semi-
randomly to items, because the questionnaire is too long (i.e.,
respondents are getting bored and/or restless).

33
Response Processes
 Cheating is another example behaviour that seriously
comprises a researcher’s capacity to interpret the scores as
valid indicators of performance.
 Some people cheat on tests, even in “low-stakes” settings.
 In my lab, some first-year psychology students look up the
definitions of words on the computer while doing a
vocabulary test!

34

You might also like