Professional Documents
Culture Documents
A221 SGDE4013 8 Validity&Validation VNB
A221 SGDE4013 8 Validity&Validation VNB
A221 SGDE4013 8 Validity&Validation VNB
2
INTENDED LEARNING OUTCOMES
1. Describe the notions related to validity concepts and different types of validity (C2)
2. Make connections between the concepts of reliability and validity (C3)
3. Identify different sources of validity evidence based on analysis of different testing situations
(C4)
4. Develop arguments and justifications on the validity issues in educational assessments (C6, A5)
5. Elaborate on the reasons why certain factors may affect validity (C6)
3
http://www.aera.net/Home/tabid/10041/Default.aspx
http://www.apa.org/index.aspx
http://www.ncme.org/ncme/NCME/
4
Standards (2014)
PART I: PART II: PART III:
FOUNDATIONS OPERATIONS TESTING APPLICATIONS
1.Validity 4. Test Design & 10. Psychological Testing
2.Reliability/Precision and Development and Assessment
Errors of Measurement 5. Scales, Norms, Score 11. Workplace Testing and
Linking, and Cut Scores Credentialing
3.Fairness
6. Test Administration, 12. Educational Testing
Scoring, Reporting, & and Assessment
Interpretation
13. Uses of Tests for
7. Supporting Program Evaluation,
Documentation for Policy Studies, and
Tests Accountability
8. Test Takers’ Rights and
Responsibilities
9. Test Users’ Rights and
Responsibilities
5
OUTLINE
Brief History
Validity
Construct (convergent & discriminant)
Content
Criterion (predictive & concurrent)
Validation
5 Primary Sources of Validity Evidence from the Standards (AERA, APA, & NCME, 2014, 1999)
Test Content
Response Processes
Internal Structure
Relations with External Variables/Relationship to a Criterion
Consequences of Testing
Creating a Validity Argument
Factors Affecting Validity
Relationship between Validity & Reliability
6
VALIDITY: THEN & NOW
It was then recognized that a test might be put to multiple uses and that a
given [score from a test] might be valid for some uses but not for others.
That is, validity came to be understood as a characteristic of the
interpretation and use of test scores, and not of the test itself, because the
very same test (e.g., reading test) could be used to predict academic
performance, estimate the level of an individual’s proficiency, and diagnose
problems.
7
The concept of validity
has historically seen a variety of repetitions
that involved
“packing” different aspects into the concept
and subsequently “unpacking” some of them.
8
Brief History
9
VALIDITY THEORY: SOME HISTORY
APA (1954) in Technical Recommendations for Psychological Tests and
Diagnostic Techniques listed four types of validity
1. Construct validity
2. Content validity
3. Predictive validity
4. Concurrent validity
10
Nomological Network
(Cronbach & Meehl, 1955)
11
Nomological Network
(Cronbach & Meehl, 1955)
12
English Language Skills
For examples…
Reading Grammar Listening
Writing Speaking
Menulis
Menganggar
Nombor
13
VALIDITY THEORY: SOME HISTORY (cont…)
APA (1966), based on Cronbach & Meehl (1955), collapsed the predictive and
concurrent validity into criterion-related validity in Standards for Educational
and Psychological Tests and Manuals
1. Construct validity
2. Content validity
3. Criterion-related validity
predictive validity
concurrent validity
14
VALIDITY THEORY: SOME HISTORY (cont…)
measures construct
construct validity
15
VALIDITY THEORY: SOME HISTORY (cont…)
Messick (1965, 1980) asked and provided suggestions on how to answer them:
Two questions
1. Is the test any good as a measure of the characteristics it is interpreted to asses?
2. Should the test be used for the proposed purpose in the proposed way (ethics issues)?
How to Answer
3. By assessing psychometric evidence, especially construct validity. This assessment provides
evidential basis for test interpretation.
4. By assessing the potential social consequences of the testing. This assessment provides a
consequential basis for test use.
The gist is based on the 2 questions & 2 answers (Messick 1965, 1980)
Figure is adopted from Messick (1989, p. 17) 16
VALIDITY THEORY: SOME HISTORY (cont…)
APA, AERA, & NCME (1985) in Standards for Educational and Psychological Tests
redefined validity as the appropriateness, meaningfulness, and usefulness of the
specific inferences made from test scores. The unintended social consequences of the
test use (e.g., bias, adverse impact) were also included.
use
intended?
unintended?
appropriate?
meaningful?
useful?
17
VALIDITY THEORY: SOME HISTORY (cont…)
Messick (1989) stressed that the three types of validity is related to the valid
interpretation and use of scores. It is not types of validity but the relation
between the evidence and the inferences drawn that should determine the
validation focus. The varieties of evidence are not alternatives but rather
supplements to one another. This is the main reason that validity is now
recognized as a unitary concept.
18
Unitary Concept of Validity
(Messick, 1975, 1980, 1989)
The Analogy…
construct validity
construct validity
content validity
criterion-related
validity
19
VALIDITY THEORY: SOME HISTORY (cont…)
20
VALIDITY THEORY: SOME HISTORY (cont…)
APA, AERA, & NCME (1999) stated that “[v]alidity refers to the degree to
which evidence & theory support the interpretations of test scores entailed by
proposed uses of tests” (p.9).
22
VALIDITY THEORY: SOME HISTORY (cont…)
Kane (2011, p. 7) The unified model of construct validity (Messick, 1989) was conceptually elegant,
but not very practical.
Kane (2013) introduced interpretive/use argument (IUA) in addition to IA. “In the past, I have talked
about IAs but this expression may give too much weight to interpretations and not enough to uses”
(p.2). Test scores can have multiple possible interpretations/uses, and it is the proposed
interpretation/use that is validated, not the test itself or the test scores” (Kane, 2013, p. 21).
interpretive argument (IA) &
validation validity argument (VA)
interpretive/use argument (IUA)
23
VALIDITY THEORY: SOME HISTORY (cont…)
APA, AERA, & NCME (2014) stated that “[v]alidity refers to the degree to which evidence
& theory support the interpretations of test scores for proposed uses of tests” (p.11).
24
VALIDITY IN ESSENCE
“… the most fundamental consideration in developing and evaluating tests” (AERA, APA, & NCME,
1999, p.9; 2014, p. 11).
“Validity is a theoretical notion that defines the scope and the nature of validation work..” (Xi, 2008, p.
177).
Validity refers to the degree to which each interpretation or use of a test score is supported by the
accumulated evidence. It constitutes the central notion underlying the development, administration,
and scoring of a test and the uses and interpretations of test scores (AERA, APA, & NCME, 2014, 1999;
Sireci, 2016 in SBAC Technical Report).
25
Validity
26
3 TYPES OF VALIDITY
1. Construct validity
Convergent validity/evidence
Divergent validity/evidence
2. Content validity
3. Criterion-related validity
Predictive validity
Concurrent validity
27
1. CONSTRUCT VALIDITY
This is evaluated by examining the degree to which certain explanatory concepts (constructs)
derived from theory, account for performance on a measure.
It is a type of validity which depicts how a particular measure relates to other measures consistent
with theoretically derived hypothesis concerning the concepts or constructs that are being
measured.
The measure is logically related to another variable or is measuring underlying construct as you
had conceptualized.
Although not explicitly mentioned by Cronbach & Meehl (1955), construct validity often being
categorized into two types:
Convergent validity/evidence
Discriminant validity/evidence 28
1. CONSTRUCT VALIDITY (cont…)
Convergent validity/evidence
This is based on the idea that two instruments that are valid measures of the same concept should
correlate rather highly with one another or yield similar results even though they are different
instruments:
Different tests measuring the same construct
A test should correlate highly with other similar tests
Discriminant validity/evidence
This is based on the idea that two instruments, although similar to one another, should not correlate
highly if they measure different concepts:
A test should correlate poorly with tests that are very dissimilar.
This approach thus involves the simultaneous assessment of numerous instruments (multi-method)
and numerous concepts (multi-trait) through the computation of inter-correlations.
29
1. CONSTRUCT VALIDITY (cont…)
Convergent Validity
1.00
1.00
0.83
0.85 1.00
0.89
0.90 0.86 1.00
0.91
30
1. CONSTRUCT VALIDITY (cont…)
Discriminant Validity
31
1. CONSTRUCT VALIDITY (cont…)
Example: A test of basic algebra should primarily measure algebra-related constructs and not
reading constructs. In order to determine the construct validity of a particular algebra test, one
would need to demonstrate that the correlations of scores on that test with scores on other
algebra tests are higher than the correlations of scores on reading tests.
32
2. CONTENT VALIDITY
This is the extent to which a measuring instrument (the test) reflects a specific domain of content,
content standards, skills, or knowledge that need to be mastered.
It can also be viewed as the sampling adequacy of the content of a phenomena being measured.
Do the items fairly represent all possible questions to be asked?
Is there any overlapping?
Is each domain covered properly?
Table of specifications/blueprint/JSU/JPU helps you to answer these questions
This type of validity is often used in the assessment of various educational and psychological tests.
The extent that the test items are a sample of universe in which the investigator is interested
(Cronbach & Meehle, 1955).
33
3. CRITERION-RELATED VALIDITY
This is an issue when the purpose is, to use an instrument to estimate some important form of
behavior that is external to the measuring instrument itself, i.e., the criterion.
Validity of an instrument when it is scored, is compared with another criterion known already to be a
measure/score of the same trait or skill
IQ & achievement test
IQ & job performance
Personality & job performance
34
3. CRITERION-RELATED VALIDITY (cont…)
Concurrent validity
The degree to which a test score correlates with an external criterion score that is measured at the same time.
The strength of the relationship between test scores and criterion scores/information that are obtained at
about the same time.
Refers to the ability of a measure to accurately predict the current situation/status of an individual.
The score from an instrument being assessed is compared to some already existing criterion, such as the scores
of another measuring device.
For examples:
language proficiency test score is related to reading scored, taken at the same time
scores on Quizzes (formative assessments) is related to overall course grade for that semester
a student who score high in a Biology topic on cell structure & cell organization should also be able to use the microscope
accurately when assessed during lab session
35
3. CRITERION-RELATED VALIDITY (cont…)
Predictive validity
The degree to which a test predicts (correlates) with an external criteria that is measured some time in the
future.
The strength of the relationship between test scores and criterion scores/information that are obtained at a
later time.
This is where a score from an instrument is used to predict some future state of affairs (SPM predicts GPA;
intention to dropout predicts actual dropout).
More examples here are the various educational tests used for selection purposes in different occupations and
schools: the SPM, MUET, MEdSI, etc.
If people who score high on the MEdSI is a better educator, then the MEdSI score is presumably a valid measure
of teaching profession.
The prison system uses this to assess criminal who are less likely to recidivist (repeat crimes). They use factors
such as age, type of crime, family background, etc.
36
LET’S DO SOME EXERCISES
Go to https://quizlet.com/live
37
Validation
38
VALIDATION
“… validation is the process of developing and evaluating evidence for a proposed score interpretation
and use” (Xi, 2008, p. 177).
Validation is the process of accumulating evidence to support each proposed score interpretation or
use. This validation process does not rely on a single study or gathering one type of evidence. Rather,
validation involves multiple investigations and different kinds of supporting evidence (AERA, APA, &
NCME, 1999; 2014; ETS, 2002; Kane, 2006).
It begins with test design and is connected throughout the entire assessment process, which includes
item development and field-testing, analyses of items, test scaling and linking, scoring, and reporting.
test design item development field test item analyses test scaling & linking scoring reporting score
validation 39
MORE ON VALIDATION based on
THE STANDARDS (AERA, APA, & NCME, 2014, 1999)
A process that involves accumulating relevant evidence to provide a sound scientific basis for the
proposed score interpretations.
It logically begins with an explicit statement of the proposed interpretations of test scores, along with
a rationale for the relevance of the interpretation to the proposed use.
A process of constructing and evaluating argument for and against the intended interpretation of test
scores and their relevance to the proposed use.
Decisions about what types of evidence are important for the validation argument in each instance
can be clarified by developing a set of propositions or claims that support the proposed
interpretation for the particular purpose of testing.
The validation process evolves as these propositions are articulated and evidence is gathered to
evaluate their soundness.
40
WHY VALIDATION?
Validation is the joint responsibility (tanggungjawab bersama) of the test developer and the test user
(AERA, APA, & NCME, 2014, 1999).
The test developer is responsible for furnishing relevant evidence and a rationale in support of the
intended use.
The test user is ultimately responsible for evaluating the evidence in the particular setting in which
the test is to be used (Standards, AERA, APA, & NCME, 1999, p. 11).
41
5 SOURCES OF VALIDITY EVIDENCE
42
SOURCES OF VALIDITY EVIDENCE #1: TEST CONTENT
This source of evidence refers to traditional forms of content validity evidence, such as:
The rating of test specifications and test items (Crocker, Miller, & Franks, 1989; Sireci, 1998),
“Alignment” methods for educational tests that evaluate the interactions between curriculum frameworks, testing, and
instruction.
The alignment studies are conducted to ensure the items adequately represent the domains
delineated in the test specifications. For example, we assume that the knowledge, skills, and abilities
measured in SPM are consistent with the ones specified in the KSSM.
Administration and scoring can be considered as aspects of content-based evidence. With computer
adaptive testing, an extra dimension of test content is to ensure that the tests administered to
students conform to the test blueprint.
43
SOURCES OF VALIDITY EVIDENCE #1: TEST CONTENT (cont…)
44
#1: TEST CONTENT EVIDENCE IN MUET
Source of Evidence: Test Content
Examples of Evidence Type
Test Construction Practices MUET Regulations, Test Specifications, Test Format And Sample
Questions (MEC, 2015)
Test Administration Procedure Part 10 in MUET Regulations, Test Specifications, Test Format And
Sample Questions (MEC, 2015, p.8)
Interpretation of Scores • MUET Band Description (MEC, January2016)
• http://portal.mpm.edu.my/documents/10156/f0ecb3db-fc56-4e
c3-a817-bb836cc4294b
Test Accommodations Part 9 in MUET Regulations, Test Specifications, Test Format And
Sample Questions (MEC, 2015, p.7)
Revision(s) of the Test Major revision of the test in 2006
45
SOURCES OF VALIDITY EVIDENCE #2: RESPONSE PROCESSES
This source of evidence refers to “evidence concerning the fit between the construct and the detailed
nature of performance or response actually engaged in by examinees” (AERA et al., 1999 p. 12).
students’ interview concerning their responses to test items (i.e., think aloud)
systematic observations of test response behaviour
students’ calculations/demonstration to get the answer
evaluation of the criteria used by judges when scoring performance tasks
analysis of student item-response-time data (i.e., distractor analysis)
keyboard stoke in computer-based testing
eye-movement in computer-based testing
features scored by automated algorithms
evaluation of the reasoning processes students employ when solving test items (Emberetson, 1983;
Messick, 1989; Mislevy, 2009)
sensitivity, bias and test fairness reviews
46
#2: RESPONSE PROCESSES EVIDENCE IN MUET
Item Response Time Part 17 (duration for each test component) in MUET Regulations,
Test Specifications, Test Format And Sample Questions (MEC, 2015,
p.9)
47
SOURCES OF VALIDITY EVIDENCE #3: INTERNAL STRUCTURE
This source of evidence refers to statistical analyses of item and score subdomains to investigate the
primary and secondary (if any) dimensions measured by an assessment.
With a vertical scale, a consistent primary dimension or construct shift across the levels of the test
should be maintained.
Internal structure evidence also evaluates the “strength” or “salience” of the major dimensions
underlying an assessment using indices of measurement precision such as reliability, decision
accuracy and consistency, generalizability coefficients, conditional and unconditional standard errors
of measurement, and test information functions (based on item difficulty & item discrimination
indices).
In addition, analysis of item functioning using Item Response Theory (IRT) and differential item
functioning (DIF) fall under the internal structure category. 48
SOURCES OF VALIDITY EVIDENCE #3: INTERNAL STRUCTURE
(cont…)
item analyses
scale reliability (internal consistency)
standard error of measurement SEM for CTT
IRT item fit
conditional SEM
IRT test information
generalizability studies
cut-score decision consistency and accuracy
inter-item correlations
DIF study
factor/dimensionality analysis
49
#3: INTERNAL STRUCTURE EVIDENCE IN MUET
Source of Evidence: Internal Structure
Examples of Evidence Type
Pilot test No evidence so far
Reliability of Scores/Item Analysis 2009 Reading Test Scores Reliability is .78 and SEM of 2.99
(difficulty & discrimination) (Yusup, Unpublished 2012 dissertation)
The writing, reading, and speaking tests scores are reliable measures
of test-takers ability
Benchmarking Report
(MEC, 2005)
Scaling/Linking/ Equating No evidence
Analyses
Factor analysis/ Dimensionality No evidence
DIF Study
No evidence
50
SOURCES OF VALIDITY EVIDENCE #4: RELATIONS TO EXTERNAL
VARIABLES
This source of evidence refers to traditional forms of criterion-related validity evidence such as
concurrent and predictive validity.
It also refers to comprehensive investigations of the relationships among test scores and other
variables (construct validity: convergent & discriminant evidence) such as multitrait-multimethod
studies (Campbell & Fiske, 1959).
51
SOURCES OF VALIDITY EVIDENCE #4: RELATIONS TO EXTERNAL
VARIABLES (cont)
sensitivity to instruction
criterion-related validation of on track
criterion-related studies of change in achievement/growth
criterion related validation of readiness
predictive validity
group differences (test fairness)
classroom artifacts
52
#4: EXTERNAL RELATIONSHIP EVIDENCE IN MUET
Source of Evidence: Relations to External Variables
Examples of Evidence Type
Relationships with • There was a high positive correlation between MUET and IELTS overall band
Conceptually Related Benchmarking Report (MEC, 2005)
Construct (convergent • The relationship between MUET and local public university EPT was
evidence) insufficiently low to enable them to be used interchangeably (Abu Kassim,
Zubairi, & Mat Daud, 2007)
Relationships with Predictive validity of MUET & college grades (academic achievement in higher
Criteria (i.e., Criterion- education):
Related Validation The results of the studies varied that they did not offer conclusive evidence
Effort) about the predictive validity of MUET
(Abd Samad, Syed Abd Rahman, & Yahya, 2008; Othman & Nordin, 2013;
Rethinasamy & Chuah, 2011)
Other Convergent/ Correlation Between MUET and Academic Performance of Electrical Engineering
Discriminant Validity Students (Norlida Buniyamin, 2015)
53
SOURCES OF VALIDITY EVIDENCE #5: CONSEQUENCES OF TESTING
This evidence refers to the evaluation of the intended and unintended consequences associated with
a testing program.
55
#5: EVIDENCE BASED ON CONSEQUENCES OF TESTING IN MUET
(cont…)
Source of Evidence: Consequences of Testing
Examples of Evidence Type
Teacher Morale/ Interviews and surveys among 9 high school teachers (Nambiar & Ransirini, 2012):
Perception on Test • Teachers were frustrated because they have been forced to teach to the test
Utility • Teachers’ creativity and innovation have been seriously hindered by MUET.
Students’ Interviews and surveys among 108 high school students (Nambiar & Ransirini,
Perspectives 2012):
• Majority of students think MUET has contributed towards and improvement in the
language proficiency
• Only half of the students perceived that their performance in the university will be
enhanced because of MUET
• MUET has positive impact in speaking and reading and are two most useful skills at
the university.
56
#5: EVIDENCE BASED ON CONSEQUENCES OF TESTING IN MUET
(cont…)
Source of Evidence: Consequences of Testing
Examples of Evidence Type
Teachers’ Interviews and surveys among 9 high school teachers (Nambiar & Ransirini, 2012):
Perception on • Teachers viewed reading as the most important components of MUET and the most
Changes in Student useful skill to survive at the tertiary level.
Learning • Speaking skill is also important in order to succeed in tertiary level.
Score Report MUET Band Description
Utility & Clarity (MEC, January2016)
MUET Regulations, Test Specifications, Test Format And Sample Questions (MEC,
2015)
57
MORE ON MUET
http://portal.mpm.edu.my/web/guest/home
http://portal.mpm.edu.my/web/guest/perkhidmatan-online
http://portal.mpm.edu.my/documents/10156/f0ecb3db-fc56-4ec3-a817-bb836cc4294b
http://apps.mpm.edu.my/mod/public/register
MUET_Test_Specification_2015VersiPortal.pdf
58
https://assess.com/
59
5 SOURCES OF VALIDITY EVIDENCE: RECAP
60
5 SOURCES OF VALIDITY EVIDENCE: EXERCISE #1
For each of the following situations, specify which validity evidence the researcher is gathering.
• content
• consequences
• response-process
• criterion-related
• internal structure
61
5 SOURCES OF VALIDITY EVIDENCE: EXERCISE #1 (cont…)
For each of the following situations, specify which validity evidence the researcher is gathering.
• content
• consequences
• response-process
• criterion-related
• internal structure
1. The researcher estimates the difficulty and discrimination of each item. internal structure
2. The researcher asks a panel of experts to rate each item according to how well its content
matches the purpose that is intended to be measured by the item. content
62
5 SOURCES OF VALIDITY EVIDENCE: EXERCISE #1 (cont…)
For each of the following situations, specify which validity evidence the researcher is gathering.
• content
• consequences
• response-process
• criterion-related
• internal structure
3. The researcher asks a group of five respondents to write down how they arrived at each response (e.g., what
processes they used, what they were thinking about, etc.). response-process
4. The researcher conducted a factor analysis and inter-item correlations to examine the pattern of
interrelationships between the items of the self-efficacy scale. internal
5. The researcher correlates the scores on the self-efficacy scale obtained for a sample of individuals criterion-related
with the
scores obtained from a long, comprehensive self-efficacy scale.
63
5 SOURCES OF VALIDITY EVIDENCE: EXERCISE #2 (cont…)
For each of the following situation, specify which of the five sources of validity evidence it addresses.
• content
• consequences
• response-process
• criterion-related
• internal structure
1. The scores of a scale measuring job performance potential are not correlated with actual job
productivity. criterion-related
2. Patricia completes a psychological inventory measuring extroversion. The possible scores that
any individual can receive range from 0 (low introversion) to 50 (high introversion). On three
different occasions, Patricia receives scores of 35, 42, and 27 on the extroversion scale. internal
3. A researcher using a scale measuring fidelity notices that, for a particular item, individuals
scoring high on the item tended to score low on the other items. internal 64
Creating a Validity Argument
65
CREATING A VALIDITY ARGUMENT
66
VALIDITY ARGUMENT (cont…): SOME QUESTIONS TO BE ASKED…
Validity (Kesahan) concerns whether the obtained scores lead to the correct and intended
interpretations and decisions about people.
Are the scores used in the intended fashion and lead to the intended outcomes and decisions?
Do we use the scores correctly?
Do we make the right decision for people based on the test scores?
Adakah markah ujian Biologi digunakan untuk membuat andaian/jangkaan kebolehan pelajar untuk memasuki jurusan
perubatan dan bergraduasi dengan jayanya?
67
VALIDITY ARGUMENT (cont…) Isu Lulus Bahasa Malaysia
http://www.astroawani.com/berita-malaysia/isu-lulus-bm-doktor-inggeris-dahulu-pun-cakap-melayu-rais-148149 68
https://www.themalaysianinsight.com/bahasa/s/6734/
VALIDITY Isu Lulus Bahasa Malaysia (cont…)
ARGUMENT
(cont…)
Bagaimanakah markah SPM bahasa Malaysia melambangkan kebolehan
para doktor untuk menggunakan bahasa Malaysia dengan lancar ketika
membincangkan terma-terma perubatan dengan pesakit yang cenderung
berbahasa Malaysia?
http://www.astroawani.com/berita-malaysia/dakwaan-kertas-spm-bahasa-melayu-matematik-baha
sa-cina-bocor-tidak-benar-191368
https://www.bharian.com.my/berita/nasional/2018/11/498975/kertas-peperiksaan-spm-bocor-tid
ak-benar
https://www.bharian.com.my/berita/nasional/2018/11/498933/kementerian-pendidikan-nafi-soal
an-matematik-spm-bocor
70
HYPOTHETICAL SITUATION:
LET’S DO A ROLE PLAY! THE NEWS IS TRUE!!!!
We are the…
1. Minister and Deputy Minister
2. Test takers from the involved school
3. Test takers from other schools
4. Parents
5. Math Teachers
6. Malaysian Examination Council Officials
7. Public & Netizen
“… the most fundamental consideration in developing and evaluating tests” (AERA, APA, & NCME,
1999, p.9; 2014, p. 11).
“Validity is a theoretical notion that defines the scope and the nature of validation work..” (Xi, 2008, p.
177).
Validity refers to the degree to which each interpretation or use of a test score is supported by the
accumulated evidence. It constitutes the central notion underlying the development, administration,
and scoring of a test and the uses and interpretations of test scores (AERA, APA, & NCME, 2014, 1999;
Sireci, 2016 in SBAC Technical Report).
72
VALIDATION IN ESSENCE: RECAP
Validation is the process of accumulating evidence to support each proposed score interpretation or
use. This validation process does not rely on a single study or gathering one type of evidence. Rather,
validation involves multiple investigations and different kinds of supporting evidence (AERA, APA, &
NCME, 1999; 2014; ETS, 2002; Kane, 2006).
The test developer is responsible for furnishing relevant evidence and a rationale in support of the
intended use.
The test user is ultimately responsible for evaluating the evidence in the particular setting in which
the test is to be used (Standards, AERA, APA, & NCME, 1999, p. 11).
73
BACK TO OUTLINE based on OUR DISCUSSION SO FAR…
Brief History
Validity
Construct (convergent & discriminant)
Cronbach & Meehl (1955)
Content
Criterion (predictive & concurrent) APA (1954) APA (1966)
Validation Messick
5 Primary Sources of Validity Evidence from the Standards (AERA, APA, & NCME, 2014, 1999) (1975,
Test Content 1980,
Response Processes 1989)
Internal Structure
Relations with External Variables/Relationship to a Criterion
Consequences of Testing
Creating a Validity Argument
Factors Affecting Validity
Relationship between Validity & Reliability
APA, AERA, & APA, AERA, &
APA, AERA, & Kane NCME (1999) NCME (2014)
NCME (1985) (1992, 2006,
2011, 2013) 74
Factors Affecting Validity
75
FACTORS AFFECTING VALIDITY
76
FACTORS AFFECTING VALIDITY (cont…)
77
FACTORS AFFECTING VALIDITY (cont…)
Test administration
whether the testing conditions are appropriate or not
whether unexpected disturbances occur or not
whether the test administrators administers the test according to the test manual/guideline or not
whether the test guides for examinees are clear or not
Scoring
whether the scoring system is objective and standardized or not
scoring fails to capture important qualities of task performance
undue emphasis on some criteria, forms or styles of response
lack of intra-rater or inter-rater consistency
scoring too analytic
scoring too holistic
78
FACTORS AFFECTING VALIDITY (cont…)
Examinees
whether the sample is representative, heterogeneous or not
interests and motivation on the test
emotional state and attitude during testing
state of physical health
experiences on test
assessment anxiety
79
Factors Affecting
Validity
80
Cont…
81
Cont…
82
Cont…
83
Factors Affecting Validity PATRICIA’S
SLIDE
A172
GROUP b
Arrangement
Difficulty
Structure of of items and
level of the
the items correct
test
responses
84
Relationship
between
Validity & Reliability
85
RELATIONSHIP BETWEEN VALIDITY & RELIABILITY
86
RELATIONSHIP BETWEEN VALIDITY & RELIABILITY (cont…)
87
RELATIONSHIP BETWEEN VALIDITY & RELIABILITY (cont…)
88
RELATIONSHIP BETWEEN VALIDITY & RELIABILITY (cont…)
89
Relationship Between Validity and Reliability
• Validity and reliability are closely related
• A test score cannot be considered valid unless it is reliable.
• Likewise, result from a test can be reliable and not necessarily valid
90
Cont..
• A valid test score is always reliable (in order for a test score to be valid, it needs to be reliable in
the first place).
• A reliable test score is not always valid.
• Validity comes first to ensure reliability.
• For test score to be useful, a measuring instrument (test, scale) must be both reasonable reliable
and valid.
• Aim for validity first, and then try make the test more reliable little by little, rather than the other
way around.
91
IF THEN
• Unreliable • Validity is undermined.
• Reliable, but not valid • Test and its score are not useful.
• Unreliable and invalid • Test and its score are definitely NOT useful!
• Reliable and valid • Test will result in valid & reliable score
(trustworthy and of good quality).
92
OPTIMIZING RELIABILITY & VALIDITY
93
SELECTING & CREATING MEASURES
94
RELIABILITY & VALIDITY: PROPERTIES OF SCORES
A particular instrument may generate scores that are reliable and valid for one population or
individual, but not for another.
For example, a science test given in English may yield valid and reliable scores in the U.S., but not in China.
So, when we speak of reliability and validity, we talk about “scores” being reliable or valid, not
“instruments” or “tests”.
95
RECAP
Validity
Construct
Content
Criterion
Validation
Essential Validity Evidence from the Standards (AERA, APA, & NCME,
1999, p.17)
5 Primary Sources of Validity Evidence from the Standards (AERA,
APA, & NCME, 2014, 1999)
Test Content
Response Processes
Internal Structure
Relations with External Variables/Relationship to a
Criterion/External Relationship
Consequences of Testing
Creating a Validity Argument
Factors Affecting Validity
Relationship between Validity & Reliability 96
nurliyana@uum.edu.my
nurliyana.bukhari@alumni.uncg.edu
97