Reliability and Validity: DR James Betts

Reliability and Validity
Introduction to Study Skills & Research Methods (HL10040)
Dr James Betts
Lecture Outline:
•Definition of Terms
•Types of Validity
•Threats to Validity
•Types of Reliability
•Threats to Reliability
•Introduction to Measurement Error.
Commonly used terms…
“She has a valid point”
“My car is unreliable”
…in science…
“The conclusion of the study was not valid”
“The findings of the study were not reliable”.

Some definitions…
• Validity
“The soundness or appropriateness of a test

or instrument in measuring what it is
designed to measure”
(Vincent 1999)
Some definitions…
• Validity
“Degree to which a test or instrument

measures what it purports to measure”
(Thomas & Nelson 1996)

Some definitions…
• Reliability
“…the degree to which a test or measure

produces the same scores when applied in
the same circumstances…”
(Nelson 1997)
Some definitions…
• Objectivity
“…the degree to which different observers

agree on measurements…”
(Atkinson & Nevill 1998)

Types of Experimental Validity
• Internal
– Is the experimenter measuring the effect of the

independent variable on the dependent variable?
• External
– Can the results be generalised to the wider

population?
Validity
AKA Criterion
Logical Statistical
Construct
Face Content Concurrent Predictive
Reliability Consistency Objectivity

Logical Validity
• Face Validity
– Infers that a test is valid by definition
– It is clear that the test measures what it is supposed to
e.g.
If you want to assess reaction
time, measuring how long it
takes an individual to react to
a given stimulus would have Externally
face validity Valid?
Logical Validity
• Face Validity
– Infers that a test is valid by definition
– It is clear that the test measures what it is supposed to
i.e.
Would assessing 15 m sprint
time be a valid means of
assessing reaction time?
Assessing face validity is therefore a subjective process.

Logical Validity
• Content Validity
– Infers that the test measures all aspects contributing to the
variable of interest
e.g.
Who is the most physically
fit?
VO2 max test?
Wingate test?
1 RM?
…also a subjective process.
Overall:
A logically valid test simply appears to

measure the right variable in its entirety?
Statistical Validity
• Concurrent Validity
– Infers that the test produces similar results to a
previously validated test
e.g.
VO2
max
Incremental Treadmill Protocol

with expired gas analysis Multi-Stage Fitness (Beep) Test
Statistical Validity
• Predictive Validity
– Infers that the test provides a valid reflection of
future performance using a similar test
e.g.
Can performance
during test A be
used to predict
future performance
in test B?
A B
http://www.youtube.com/watch?v=vdPQ3QxDZ1s
Overall:
A statistically valid test produces results

that agree with other similar tests?
Logical/Statistical Validity
• Construct Validity
– Infers not only that the test is measuring what it is
supposed to, but also that it is capable of detecting
what should exist, theoretically
– Therefore relates to hypothetical or intangible
constructs
e.g.
Team Rivalry
Sportsmanship.
Logical/Statistical Validity
• Construct Validity
– Infers not only that the test is measuring what it is
supposed to, but also that it is capable of detecting
what should exist, theoretically
– Therefore relates to hypothetical or intangible
constructs
– This makes assessment difficult,
i.e. if what should exist cannot be detected, this could mean:
a) Test Invalid? b) Theory Incorrect? c) Sensitivity/Specificity Issues?

Interesting Example: Breast Cancer
• Incidence: ~1 % (0.8 %)
(i.e. a positive result should be detected for approximately 1
in every 100 women tested)
• Sensitivity: ~90 % (87 %)
(the mammogram is sensitive enough that approximately 90
in every 100 breast cancer patients will receive a positive result)
• Specificity: ~90 % (93 %)
(the mammogram is specific enough that approximately 90
in every 100 healthy patients will receive a negative result).
Data from Kerlikowske et al. (1996)

Quick Test
• What is the probability that a

patient receiving a positive
result actually has breast
cancer?
Threats to Validity
(and possible solutions?)
Threats to Internal Validity
• Maturation
– Changes in the DV over time irrespective of the IV
• Maturation
e.g. One Group Pre-test Post-test
O 1
T O 2
• Maturation (possible solution)
Time series
O 1 O 2 O 3 T O 4 O 5 O 6
Pre-test Post-test Randomised Group Comparison
O 1 T O 2
n.b.
R RCT
O 3 P O 4
Repeated measures designs can occasionally be an inappropriate
solution, even when randomised and counterbalanced
e.g.
Muscle Damage (repeated bout effect)
Vitamin Supplementation (wash-out period)
In which case independent measures designs could be used.

• History
– Unplanned events between measurements
• History
O 1
T O 2
e.g. exercise?
Therefore, solution = control extraneous variables!

Threats to Internal/External Validity
• Pre-testing
– Interactive effects due to the pre-test (e.g. learning,
sensitisation, etc.)
– Also influences External Validity
• Pre-testing …so it is actually T+O1 that
e.g. is better than P, not T alone.
O 1 T O 2
Assessing muscle
mass here could make
R them train harder in
…but then respond better
to the T than the P…
O 3
both trials…
P O 4
• Pre-testing (possible solution)
T
O 1
O 2
P O
O 4
R
3
T
O 5
Solomon Four-
Group Design P O 6
Sophomore
Slump & SI
• Statistical Regression ‘Cover Jinx’
– AKA regression to the mean
– An initial extreme score is likely to be

followed by less extreme subsequent scores
e.g.
Training has the greatest effect on untrained individuals.
Therefore, solution = effective sampling.

• Instrumentation
– A difference in the way 2 comparable variables
were measured
e.g.
Uncalibrated equipment
Therefore, solution = calibrate!

• Selection Bias
– The groups for comparison are not equivalent
• Selection Bias
e.g. Groups not randomly assigned
T O 1
i.e.
Static Group Group T were

resistance trained
Comparison
P to start with
Oa
• Selection Bias (possible solution)
Either: T O 1
-Randomise group
assignment,
-Pre-test and post-
test difference, P
-Repeated Measures Oa
Design.
• Experimental Mortality
– Missing Data due to subject drop-out
– Reduced n = reduced statistical Power
– Not only challenges quality of data gathered
(Internal Validity) but
also our ability to
generalise
(External Validity).
Therefore, solution =
recruit sufficient (young?)
participants
Threats to External Validity
• Inadequate description
– 5th characteristic of research…
…should be replicable
If nobody can replicate the methods of a given

study, then it is irrefutable and therefore lacks
external validity.
Therefore, solution = comprehensive methodology

• Biased sampling
– Linked to statistical regression
– Sample does not reflect target population
–n≠N
Results generalised
across gender
Therefore, solution = random sample (of target population).

• Hawthorne Effect
– DV is influenced by the fact that it is being
recorded
e.g.
Fastest sprint when
professor enters lab
control the lab environment.
• Demand Characteristics
– Participants detect the purpose of the study and
behave accordingly
e.g.
Sports Science students already know that the
carbohydrate drink is supposedly superior
CHO double or single H2 O
blinding.
• Operationalisation
– AKA Ecological Validity
– The DV must have some relevance in the
‘real world’
e.g.
TTE has no
Olympic
equivalent
Therefore, solution = choose your DV carefully.

Reliability
• Reliability is a pre-requisite of validity
e.g. Direct versus Indirect measures of VO2 max
-Gold Standard (i.e. valid and reliable) -Predictive

-Expensive -Cheap
-Complex -Easy
Reliability
Subject 1 60 ml.kg-1.min-1 60 ml.kg-1.min-1 60 ml.kg-1.min-1
Valid and Reliable

Reliability

5 ml.kg-1.min-1
Not Valid but Reliable correction?
Reliability

i.e. a test can never
Not Valid and not Reliable be valid without
being reliable?
Types of Reliability
• Relative
• Absolute
• Rater reliability (Objectivity)
– Intrarater reliability
– Interrater reliability.
Relative Reliability

i.e. Individuals maintain
Relatively Reliable position in the group
Absolute Reliability

i.e. Test-Retest
Not Absolutely Reliable within individuals
Rater Reliability
• Intrarater reliability
– The consistency of a given observer or
measurement tool on more than one occasion
Rater Reliability
• Interrater reliability
– The consistency of a given measurement from
more than one observer or measurement tool
e.g.
Score for the American Gymnast
British Judge = 9.9
French Judge = 4.4
Japanese Judge = 7.0
Threats to Reliability
• Fatigue
8 am 9 am 10 am
Therefore, solution = increase time between tests.

• Habituation
Therefore, solution = familiarise prior to test.

• Standardisation of Procedures
– Control of extraneous variables
• Precision of Measurements
– i.e. if we are happy to measure VO2 max to the nearest
10 ml.kg-1.min-1, then it could probably be reliably
predicted from your training volume and age.
Measurement Errors
• Ultimately, reliability is dependent on the
degree of measurement error in a given study
• The overall error in any measurement is

comprised of both systematic and random error
• We will address measurement error further next

week…
Literature Search Assignment
• The handout lists 8 questions which can be
answered through retrieving the corresponding
source articles
• Answer as many as possible and bring them to
next week’s lecture
• DO NOT contact author or order articles.
Selected Reading
• Atkinson, G. and A. M. Nevill. Statistical methods for
assessing measurement error (Reliability) in variables relevant
to sports medicine. Sports Medicine. 26:217-238, 1998.
• Holmes, T. H. Ten categories of statistical errors: a guide for

research in endocrinology and metabolism. American Journal
of Physiology. 286: E495-501.
• Thomas J. R. & Nelson J. K. (2001) Research Methods in

Physical Activity, 4th edition. Champaign, Illinois: Human
Kinetics
J.Betts@bath.ac.uk

Reliability and Validity: DR James Betts

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reliability and Validity: DR James Betts

Uploaded by

Copyright:

Available Formats

Reliability and Validity

Introduction to Study Skills & Research Methods (HL10040)

“She has a valid point”

“My car is unreliable”

“The findings of the study were not reliable”.

“The soundness or appropriateness of a test

“Degree to which a test or instrument

(Thomas & Nelson 1996)

“…the degree to which a test or measure

“…the degree to which different observers

(Atkinson & Nevill 1998)

– Is the experimenter measuring the effect of the

– Can the results be generalised to the wider

Face Content Concurrent Predictive

Reliability Consistency Objectivity

Assessing face validity is therefore a subjective process.

A logically valid test simply appears to

Incremental Treadmill Protocol

A statistically valid test produces results

a) Test Invalid? b) Theory Incorrect? c) Sensitivity/Specificity Issues?

Data from Kerlikowske et al. (1996)

• What is the probability that a

In which case independent measures designs could be used.

Therefore, solution = control extraneous variables!

– An initial extreme score is likely to be

Therefore, solution = effective sampling.

Therefore, solution = calibrate!

Static Group Group T were

If nobody can replicate the methods of a given

Therefore, solution = comprehensive methodology

Therefore, solution = random sample (of target population).

Therefore, solution = choose your DV carefully.

-Gold Standard (i.e. valid and reliable) -Predictive

Subject 1 60 ml.kg-1.min-1 60 ml.kg-1.min-1 60 ml.kg-1.min-1

Subject 2 55 ml.kg-1.min-1 55 ml.kg-1.min-1 55 ml.kg-1.min-1

Subject 3 70 ml.kg-1.min-1 70 ml.kg-1.min-1 70 ml.kg-1.min-1

Valid and Reliable

Subject 1 60 ml.kg-1.min-1 65 ml.kg-1.min-1 65 ml.kg-1.min-1

Subject 2 55 ml.kg-1.min-1 60 ml.kg-1.min-1 60 ml.kg-1.min-1

Subject 3 70 ml.kg-1.min-1 75 ml.kg-1.min-1 75 ml.kg-1.min-1

Subject 1 60 ml.kg-1.min-1 72 ml.kg-1.min-1 57 ml.kg-1.min-1

Subject 2 55 ml.kg-1.min-1 61 ml.kg-1.min-1 52 ml.kg-1.min-1

Subject 3 70 ml.kg-1.min-1 40 ml.kg-1.min-1 84 ml.kg-1.min-1

Subject 1 60 ml.kg-1.min-1 63 ml.kg-1.min-1 57 ml.kg-1.min-1

Subject 2 55 ml.kg-1.min-1 56 ml.kg-1.min-1 48 ml.kg-1.min-1

Subject 3 70 ml.kg-1.min-1 65 ml.kg-1.min-1 66 ml.kg-1.min-1

Subject 1 60 ml.kg-1.min-1 63 ml.kg-1.min-1 57 ml.kg-1.min-1

Subject 2 55 ml.kg-1.min-1 56 ml.kg-1.min-1 48 ml.kg-1.min-1

Subject 3 70 ml.kg-1.min-1 65 ml.kg-1.min-1 66 ml.kg-1.min-1

Subject 1 60 ml.kg-1.min-1 55 ml.kg-1.min-1 50 ml.kg-1.min-1

Therefore, solution = increase time between tests.

Subject 1 60 ml.kg-1.min-1 65 ml.kg-1.min-1 70 ml.kg-1.min-1

Therefore, solution = familiarise prior to test.

• The overall error in any measurement is

• We will address measurement error further next

• Holmes, T. H. Ten categories of statistical errors: a guide for

• Thomas J. R. & Nelson J. K. (2001) Research Methods in

You might also like