A221 SGDE4013 8 Validity&Validation VNB

1
SGDE4013 Assessment in Education
Validity & Validation (Briefly…)

Nurliyana Bukhari, Ph.D.
School of Education (SOE)
Semester A221 | First Semester 2022/2023
2
INTENDED LEARNING OUTCOMES
At the end of this lesson, students will be able to:
1. Describe the notions related to validity concepts and different types of validity (C2)
2. Make connections between the concepts of reliability and validity (C3)
3. Identify different sources of validity evidence based on analysis of different testing situations
(C4)
4. Develop arguments and justifications on the validity issues in educational assessments (C6, A5)
5. Elaborate on the reasons why certain factors may affect validity (C6)
3
http://www.aera.net/Home/tabid/10041/Default.aspx
http://www.apa.org/index.aspx
http://www.ncme.org/ncme/NCME/
4
Standards (2014)
PART I: PART II: PART III:
FOUNDATIONS OPERATIONS TESTING APPLICATIONS
1.Validity 4. Test Design & 10. Psychological Testing
2.Reliability/Precision and Development and Assessment
Errors of Measurement 5. Scales, Norms, Score 11. Workplace Testing and
Linking, and Cut Scores Credentialing
3.Fairness
6. Test Administration, 12. Educational Testing
Scoring, Reporting, & and Assessment
Interpretation
13. Uses of Tests for
7. Supporting Program Evaluation,
Documentation for Policy Studies, and
Tests Accountability
8. Test Takers’ Rights and
Responsibilities
9. Test Users’ Rights and
Responsibilities
5
OUTLINE
Brief History
Validity
Construct (convergent & discriminant)
Content
Criterion (predictive & concurrent)
Validation
5 Primary Sources of Validity Evidence from the Standards (AERA, APA, & NCME, 2014, 1999)
Test Content
Response Processes
Internal Structure
Relations with External Variables/Relationship to a Criterion
Consequences of Testing
Creating a Validity Argument
Factors Affecting Validity
Relationship between Validity & Reliability
6
VALIDITY: THEN & NOW
“At first, validity was viewed as a characteristic of the test.
It was then recognized that a test might be put to multiple uses and that a
given [score from a test] might be valid for some uses but not for others.
That is, validity came to be understood as a characteristic of the
interpretation and use of test scores, and not of the test itself, because the
very same test (e.g., reading test) could be used to predict academic
performance, estimate the level of an individual’s proficiency, and diagnose
problems.
Today, validity theory incorporates both test interpretation and use

(e.g., intended and unintended social consequences). ”
(The National Research Council, 2002, p. 35)
7
The concept of validity
has historically seen a variety of repetitions
that involved
“packing” different aspects into the concept
and subsequently “unpacking” some of them.
The concept of validity,

as described in the literature,
has changed over time to become
a broad and somewhat complex issues.
8
Brief History
9
VALIDITY THEORY: SOME HISTORY
APA (1954) in Technical Recommendations for Psychological Tests and
Diagnostic Techniques listed four types of validity
1. Construct validity
2. Content validity
3. Predictive validity
4. Concurrent validity
Cronbach & Meehl (1955) discussed the term nomological network

and grouped the four types of validity (APA, 1954) into three types
2. Content validity
3. Criterion-related validity
predictive validity
concurrent validity
10
Nomological Network
(Cronbach & Meehl, 1955)
"nomological" is derived from Greek—means "lawful“
nomological network can be thought of as the "lawful network"
“A network would include

the theoretical framework (construct: latent) for what we are trying to measure,
an empirical framework (observable attributes) for how we intend to measure it, and
specification of the linkages among and between these two frameworks”
(Messick, 1989, p. 23)
11
Nomological Network
(Cronbach & Meehl, 1955)
12
English Language Skills
For examples…
Reading Grammar Listening
Writing Speaking
Item 1 Item N Item 1 Item N
Item 1 Item N Item 1 Item N Item 1 Item N
Matematik Tahun 1: Nombor Bulat 1-100
Nilai Nombor Nilai Tempat Membundar
Menulis
Menganggar
Nombor
Item 1 Item N Item 1 Item N
Item 1 Item N Item 1 Item N Item 1 Item N
13
VALIDITY THEORY: SOME HISTORY (cont…)
APA (1966), based on Cronbach & Meehl (1955), collapsed the predictive and
concurrent validity into criterion-related validity in Standards for Educational
and Psychological Tests and Manuals
2. Content validity
predictive validity
concurrent validity
14
Messick (1975) stated that “any brief discussion of the meaning of a

measure should center on the concept of validity and, specifically, on the
concept of construct validity, for that is the evidential basis for inferring a
measure's meaning” (p. 955).
measures construct
construct validity
15
Messick (1965, 1980) asked and provided suggestions on how to answer them:
Two questions
1. Is the test any good as a measure of the characteristics it is interpreted to asses?
2. Should the test be used for the proposed purpose in the proposed way (ethics issues)?
How to Answer
3. By assessing psychometric evidence, especially construct validity. This assessment provides
evidential basis for test interpretation.
4. By assessing the potential social consequences of the testing. This assessment provides a
consequential basis for test use.
The gist is based on the 2 questions & 2 answers (Messick 1965, 1980)
Figure is adopted from Messick (1989, p. 17) 16
APA, AERA, & NCME (1985) in Standards for Educational and Psychological Tests
redefined validity as the appropriateness, meaningfulness, and usefulness of the
specific inferences made from test scores. The unintended social consequences of the
test use (e.g., bias, adverse impact) were also included.
use
intended?
unintended?
appropriate?
meaningful?
useful?
17
Messick (1989) stressed that the three types of validity is related to the valid
interpretation and use of scores. It is not types of validity but the relation
between the evidence and the inferences drawn that should determine the
validation focus. The varieties of evidence are not alternatives but rather
supplements to one another. This is the main reason that validity is now
recognized as a unitary concept.
He defines validity as “an integrated evaluative judgment of the degree to

which empirical evidence and theoretical rationales support the adequacy
and appropriateness of inferences and actions based on test scores” (p.1).
18
Unitary Concept of Validity
(Messick, 1975, 1980, 1989)
The Analogy…
construct validity
construct validity
content validity
criterion-related
validity
19
Kane (1992) introduced argument-based approach to validation (validity

argument (VA)) and developed interpretive argument (IA) for a test. He
asserted that validity is related to the interpretation assigned to test scores,
not to the test scores or the test. The interpretation involves an argument
leading from the scores to score-based statements or decisions (VA). The
validity of the interpretation depends on the plausibility of this IA.
validation  validity argument (VA)  interpretive argument (IA)
20
“The target domains

of most interest in education
are not restricted to test items or test-like tasks,
although they may include this kind of formal performance as a subset.
A person’s level of literacy in a language depends on his or her ability

to perform a variety of tasks in a variety of contexts, ranging from
interpretive the casual reading of a magazine to
arguments the careful study of a textbook or technical manual.
(IAs)
These performances can occur in a variety of
locations and social situations. ”
(Kane, 2006, p. 31)

21
APA, AERA, & NCME (1999) stated that “[v]alidity refers to the degree to
which evidence & theory support the interpretations of test scores entailed by
proposed uses of tests” (p.9).
22
Kane (2011, p. 7) The unified model of construct validity (Messick, 1989) was conceptually elegant,
but not very practical.
Kane (2013) introduced interpretive/use argument (IUA) in addition to IA. “In the past, I have talked
about IAs but this expression may give too much weight to interpretations and not enough to uses”
(p.2). Test scores can have multiple possible interpretations/uses, and it is the proposed
interpretation/use that is validated, not the test itself or the test scores” (Kane, 2013, p. 21).
interpretive argument (IA) &
validation  validity argument (VA) 
interpretive/use argument (IUA)
23
APA, AERA, & NCME (2014) stated that “[v]alidity refers to the degree to which evidence
& theory support the interpretations of test scores for proposed uses of tests” (p.11).
24
VALIDITY IN ESSENCE
“… the most fundamental consideration in developing and evaluating tests” (AERA, APA, & NCME,
1999, p.9; 2014, p. 11).
“Validity is a theoretical notion that defines the scope and the nature of validation work..” (Xi, 2008, p.
177).
Validity refers to the degree to which each interpretation or use of a test score is supported by the
accumulated evidence. It constitutes the central notion underlying the development, administration,
and scoring of a test and the uses and interpretations of test scores (AERA, APA, & NCME, 2014, 1999;
Sireci, 2016 in SBAC Technical Report).
“Validity refers to the degree to which evidence and theory

support the interpretations of test scores for proposed uses of
tests” (AERA, APA, & NCME, 2014, p.11)
25
Validity
26
3 TYPES OF VALIDITY
Convergent validity/evidence
Divergent validity/evidence
2. Content validity
Predictive validity
Concurrent validity
27
1. CONSTRUCT VALIDITY
This is evaluated by examining the degree to which certain explanatory concepts (constructs)
derived from theory, account for performance on a measure.
It is a type of validity which depicts how a particular measure relates to other measures consistent
with theoretically derived hypothesis concerning the concepts or constructs that are being
measured.
The process of construct validation is theory-laden.
The measure is logically related to another variable or is measuring underlying construct as you
had conceptualized.
Although not explicitly mentioned by Cronbach & Meehl (1955), construct validity often being
categorized into two types:
Discriminant validity/evidence 28
1. CONSTRUCT VALIDITY (cont…)
This is based on the idea that two instruments that are valid measures of the same concept should
correlate rather highly with one another or yield similar results even though they are different
instruments:
Different tests measuring the same construct
A test should correlate highly with other similar tests
Discriminant validity/evidence
This is based on the idea that two instruments, although similar to one another, should not correlate
highly if they measure different concepts:
A test should correlate poorly with tests that are very dissimilar.
This approach thus involves the simultaneous assessment of numerous instruments (multi-method)
and numerous concepts (multi-trait) through the computation of inter-correlations.
29
Convergent Validity
1.00
1.00
0.83
0.85 1.00
0.89
0.90 0.86 1.00
0.91
30
Discriminant Validity
31
Convergent & Discriminant validity

For teachers, both convergent and discriminant validity provide important evidence in the case of
construct validity.
Example: A test of basic algebra should primarily measure algebra-related constructs and not
reading constructs. In order to determine the construct validity of a particular algebra test, one
would need to demonstrate that the correlations of scores on that test with scores on other
algebra tests are higher than the correlations of scores on reading tests.
32
2. CONTENT VALIDITY
This is the extent to which a measuring instrument (the test) reflects a specific domain of content,
content standards, skills, or knowledge that need to be mastered.
It can also be viewed as the sampling adequacy of the content of a phenomena being measured.
Do the items fairly represent all possible questions to be asked?
Is there any overlapping?
Is each domain covered properly?
Table of specifications/blueprint/JSU/JPU helps you to answer these questions
This type of validity is often used in the assessment of various educational and psychological tests.
Content validation is essentially judgmental (e.g., based on subject-matter experts).
The extent that the test items are a sample of universe in which the investigator is interested
(Cronbach & Meehle, 1955).
33
3. CRITERION-RELATED VALIDITY
This is an issue when the purpose is, to use an instrument to estimate some important form of
behavior that is external to the measuring instrument itself, i.e., the criterion.
Validity of an instrument when it is scored, is compared with another criterion known already to be a
measure/score of the same trait or skill
IQ & achievement test
IQ & job performance
Personality & job performance
The validity is classified into:

Concurrent validity
Predictive validity
34
3. CRITERION-RELATED VALIDITY (cont…)
Concurrent validity
The degree to which a test score correlates with an external criterion score that is measured at the same time.
The strength of the relationship between test scores and criterion scores/information that are obtained at
about the same time.
Refers to the ability of a measure to accurately predict the current situation/status of an individual.
The score from an instrument being assessed is compared to some already existing criterion, such as the scores
of another measuring device.
For examples:
language proficiency test score is related to reading scored, taken at the same time
scores on Quizzes (formative assessments) is related to overall course grade for that semester
a student who score high in a Biology topic on cell structure & cell organization should also be able to use the microscope
accurately when assessed during lab session
35
3. CRITERION-RELATED VALIDITY (cont…)
Predictive validity
The degree to which a test predicts (correlates) with an external criteria that is measured some time in the
future.
The strength of the relationship between test scores and criterion scores/information that are obtained at a
later time.
This is where a score from an instrument is used to predict some future state of affairs (SPM predicts GPA;
intention to dropout predicts actual dropout).
More examples here are the various educational tests used for selection purposes in different occupations and
schools: the SPM, MUET, MEdSI, etc.
If people who score high on the MEdSI is a better educator, then the MEdSI score is presumably a valid measure
of teaching profession.
The prison system uses this to assess criminal who are less likely to recidivist (repeat crimes). They use factors
such as age, type of crime, family background, etc.
36
LET’S DO SOME EXERCISES
Go to https://quizlet.com/live
37
Validation
38
VALIDATION
“… validation is the process of developing and evaluating evidence for a proposed score interpretation
and use” (Xi, 2008, p. 177).
Validation is the process of accumulating evidence to support each proposed score interpretation or
use. This validation process does not rely on a single study or gathering one type of evidence. Rather,
validation involves multiple investigations and different kinds of supporting evidence (AERA, APA, &
NCME, 1999; 2014; ETS, 2002; Kane, 2006).
It begins with test design and is connected throughout the entire assessment process, which includes
item development and field-testing, analyses of items, test scaling and linking, scoring, and reporting.
test design item development field test item analyses test scaling & linking scoring reporting score
validation 39
MORE ON VALIDATION based on
THE STANDARDS (AERA, APA, & NCME, 2014, 1999)
A process that involves accumulating relevant evidence to provide a sound scientific basis for the
proposed score interpretations.
It logically begins with an explicit statement of the proposed interpretations of test scores, along with
a rationale for the relevance of the interpretation to the proposed use.
A process of constructing and evaluating argument for and against the intended interpretation of test
scores and their relevance to the proposed use.
Decisions about what types of evidence are important for the validation argument in each instance
can be clarified by developing a set of propositions or claims that support the proposed
interpretation for the particular purpose of testing.
The validation process evolves as these propositions are articulated and evidence is gathered to
evaluate their soundness.
40
WHY VALIDATION?
Validation is the joint responsibility (tanggungjawab bersama) of the test developer and the test user
(AERA, APA, & NCME, 2014, 1999).
The test developer is responsible for furnishing relevant evidence and a rationale in support of the
intended use.
The test user is ultimately responsible for evaluating the evidence in the particular setting in which
the test is to be used (Standards, AERA, APA, & NCME, 1999, p. 11).
41
5 SOURCES OF VALIDITY EVIDENCE
1. Evidence based on Test Content

2. Evidence based on Response Processes
3. Evidence based on Internal Structure
4. Evidence based on Relations to Other Variables
5. Evidence for Validity and Consequences of Testing
42
SOURCES OF VALIDITY EVIDENCE #1: TEST CONTENT
This source of evidence refers to traditional forms of content validity evidence, such as:
The rating of test specifications and test items (Crocker, Miller, & Franks, 1989; Sireci, 1998),
“Alignment” methods for educational tests that evaluate the interactions between curriculum frameworks, testing, and
instruction.
The alignment studies are conducted to ensure the items adequately represent the domains
delineated in the test specifications. For example, we assume that the knowledge, skills, and abilities
measured in SPM are consistent with the ones specified in the KSSM.
Administration and scoring can be considered as aspects of content-based evidence. With computer
adaptive testing, an extra dimension of test content is to ensure that the tests administered to
students conform to the test blueprint.
43
SOURCES OF VALIDITY EVIDENCE #1: TEST CONTENT (cont…)
Some Examples of Evidence based on Test Content
testing purposes (theory of action) clearly stated

clear construct definition
test specifications/blueprints sufficiently documented (item matches the construct/domain of interest)
item writers appropriately recruited and trained
items adhere to item writing style guidelines (wordings, item/task formats)
items reviewed for content quality and technical adequacy
test booklet conforms to test specification/blueprint
appropriate test administration (based on test administration manual, functioning test delivery
system)
scoring rubrics
rater’s/scorer’s justification for giving particular score/mark to students
subject-matter experts judgment
44
#1: TEST CONTENT EVIDENCE IN MUET
Source of Evidence: Test Content
Examples of Evidence Type
Test Construction Practices MUET Regulations, Test Specifications, Test Format And Sample
Questions (MEC, 2015)
Test Administration Procedure Part 10 in MUET Regulations, Test Specifications, Test Format And
Sample Questions (MEC, 2015, p.8)
Interpretation of Scores • MUET Band Description (MEC, January2016)
• http://portal.mpm.edu.my/documents/10156/f0ecb3db-fc56-4e
c3-a817-bb836cc4294b
Parallel Test Administration No evidence
Test Accommodations Part 9 in MUET Regulations, Test Specifications, Test Format And
Sample Questions (MEC, 2015, p.7)
Revision(s) of the Test Major revision of the test in 2006
45
SOURCES OF VALIDITY EVIDENCE #2: RESPONSE PROCESSES
This source of evidence refers to “evidence concerning the fit between the construct and the detailed
nature of performance or response actually engaged in by examinees” (AERA et al., 1999 p. 12).
Some Examples of Evidence based on Response Processes
students’ interview concerning their responses to test items (i.e., think aloud)
systematic observations of test response behaviour
students’ calculations/demonstration to get the answer
evaluation of the criteria used by judges when scoring performance tasks
analysis of student item-response-time data (i.e., distractor analysis)
keyboard stoke in computer-based testing
eye-movement in computer-based testing
features scored by automated algorithms
evaluation of the reasoning processes students employ when solving test items (Emberetson, 1983;
Messick, 1989; Mislevy, 2009)
sensitivity, bias and test fairness reviews
46
#2: RESPONSE PROCESSES EVIDENCE IN MUET
Source of Evidence: Response Processes

Systematic Observation Students’ draft paper for speaking
Introspective Study Evidence not found so far
Item Response Time Part 17 (duration for each test component) in MUET Regulations,
Test Specifications, Test Format And Sample Questions (MEC, 2015,
p.9)
Test Fairness No statement found so far
47
SOURCES OF VALIDITY EVIDENCE #3: INTERNAL STRUCTURE
This source of evidence refers to statistical analyses of item and score subdomains to investigate the
primary and secondary (if any) dimensions measured by an assessment.
Procedures for gathering evidence on construct dimension include factor analysis or

multidimensional IRT scaling (both exploratory and confirmatory).
With a vertical scale, a consistent primary dimension or construct shift across the levels of the test
should be maintained.
Internal structure evidence also evaluates the “strength” or “salience” of the major dimensions
underlying an assessment using indices of measurement precision such as reliability, decision
accuracy and consistency, generalizability coefficients, conditional and unconditional standard errors
of measurement, and test information functions (based on item difficulty & item discrimination
indices).
In addition, analysis of item functioning using Item Response Theory (IRT) and differential item
functioning (DIF) fall under the internal structure category. 48
SOURCES OF VALIDITY EVIDENCE #3: INTERNAL STRUCTURE
(cont…)
Some Examples of Evidence based on Internal Structure
item analyses
scale reliability (internal consistency)
standard error of measurement SEM for CTT
IRT item fit
conditional SEM
IRT test information
generalizability studies
cut-score decision consistency and accuracy
inter-item correlations
DIF study
factor/dimensionality analysis
49
#3: INTERNAL STRUCTURE EVIDENCE IN MUET
Source of Evidence: Internal Structure
Pilot test No evidence so far
Reliability of Scores/Item Analysis 2009 Reading Test Scores Reliability is .78 and SEM of 2.99
(difficulty & discrimination) (Yusup, Unpublished 2012 dissertation)
The writing, reading, and speaking tests scores are reliable measures
of test-takers ability
Benchmarking Report
(MEC, 2005)
Scaling/Linking/ Equating No evidence
Analyses
Factor analysis/ Dimensionality No evidence
DIF Study
No evidence
50
SOURCES OF VALIDITY EVIDENCE #4: RELATIONS TO EXTERNAL
VARIABLES
This source of evidence refers to traditional forms of criterion-related validity evidence such as
concurrent and predictive validity.
It also refers to comprehensive investigations of the relationships among test scores and other
variables (construct validity: convergent & discriminant evidence) such as multitrait-multimethod
studies (Campbell & Fiske, 1959).
These external variables can be used to evaluate

hypothesized relationships (e.g., test scores and teacher grades) between test scores and other measures of
student achievementconvergent & concurrent evidence
the degree to which different tests actually measure different skills discriminant evidence
the utility of test scores for predicting specific criteria (e.g., university CGPA/Grades) predictive evidence
51
SOURCES OF VALIDITY EVIDENCE #4: RELATIONS TO EXTERNAL
VARIABLES (cont)
Some Examples of Evidence based on Relations to External Variables
sensitivity to instruction
criterion-related validation of on track
criterion-related studies of change in achievement/growth
criterion related validation of readiness
predictive validity
group differences (test fairness)
classroom artifacts
52
#4: EXTERNAL RELATIONSHIP EVIDENCE IN MUET
Source of Evidence: Relations to External Variables
Relationships with • There was a high positive correlation between MUET and IELTS overall band
Conceptually Related Benchmarking Report (MEC, 2005)
Construct (convergent • The relationship between MUET and local public university EPT was
evidence) insufficiently low to enable them to be used interchangeably (Abu Kassim,
Zubairi, & Mat Daud, 2007)
Relationships with Predictive validity of MUET & college grades (academic achievement in higher
Criteria (i.e., Criterion- education):
Related Validation The results of the studies varied that they did not offer conclusive evidence
Effort) about the predictive validity of MUET
(Abd Samad, Syed Abd Rahman, & Yahya, 2008; Othman & Nordin, 2013;
Rethinasamy & Chuah, 2011)
Other Convergent/ Correlation Between MUET and Academic Performance of Electrical Engineering
Discriminant Validity Students (Norlida Buniyamin, 2015)
53
SOURCES OF VALIDITY EVIDENCE #5: CONSEQUENCES OF TESTING
This evidence refers to the evaluation of the intended and unintended consequences associated with
a testing program.
Some Examples of Evidence based on Testing Consequences

investigations of adverse impact
evaluation of the effects of testing on instruction,
evaluation of the effects of testing on issues such as high school dropout rates
analyses of students’ opportunity to learn
analyses of changes in textbooks and instructional approaches
Investigations of Unintended Consequences

changes in instruction
diminished morale among teachers and students
increased pressure on students leading to increased dropout rates
the pursuit of college majors and careers that are less challenging
54
#5: EVIDENCE BASED ON CONSEQUENCES OF TESTING IN MUET
Source of Evidence: Consequences of Testing
Pedagogical Teaching and learning materials such as commercially-produced book and
Consequences commercially-prepared MUET exam papers (Lee, 2004) for basic test
preparation are widely published.
A Study on the Washback Effect of MUET (Nambiar & Ransirini, 2012):

High schools and pre-university programs have to allocate approximately 10
hours in a week to prepare students to the test
Classroom observations demonstrated that “reading activities were modeled
largely in MUET format, that is, teachers would discuss reading passages
from textbooks or previous exam papers which were then followed by the
multiple choice questions” (p.8).
Speaking skill was not taught in refreshing manner that closely matches the
use of language in the real life target situation, particularly the university.
55
(cont…)
Teacher Morale/ Interviews and surveys among 9 high school teachers (Nambiar & Ransirini, 2012):
Perception on Test • Teachers were frustrated because they have been forced to teach to the test
Utility • Teachers’ creativity and innovation have been seriously hindered by MUET.
Students’ Interviews and surveys among 108 high school students (Nambiar & Ransirini,
Perspectives 2012):
• Majority of students think MUET has contributed towards and improvement in the
language proficiency
• Only half of the students perceived that their performance in the university will be
enhanced because of MUET
• MUET has positive impact in speaking and reading and are two most useful skills at
the university.
56
(cont…)
Teachers’ Interviews and surveys among 9 high school teachers (Nambiar & Ransirini, 2012):
Perception on • Teachers viewed reading as the most important components of MUET and the most
Changes in Student useful skill to survive at the tertiary level.
Learning • Speaking skill is also important in order to succeed in tertiary level.
Score Report MUET Band Description
Utility & Clarity (MEC, January2016)
MUET Regulations, Test Specifications, Test Format And Sample Questions (MEC,
2015)
57
MORE ON MUET
http://portal.mpm.edu.my/web/guest/home
http://portal.mpm.edu.my/web/guest/perkhidmatan-online
http://portal.mpm.edu.my/documents/10156/f0ecb3db-fc56-4ec3-a817-bb836cc4294b
http://apps.mpm.edu.my/mod/public/register
MUET_Test_Specification_2015VersiPortal.pdf
58
https://assess.com/
59
5 SOURCES OF VALIDITY EVIDENCE: RECAP
1. Evidence based on Test Content

2. Evidence based on Response Processes
3. Evidence based on Internal Structure
4. Evidence based on Relations to Other Variables
5. Evidence for Validity and Consequences of Testing
60
5 SOURCES OF VALIDITY EVIDENCE: EXERCISE #1
A researcher is constructing a scale to measure self-efficacy of University athletes. Although

other self-efficacy scales exist, the researcher is interested in developing a shorter version that
requires less time to complete than those currently available.
For each of the following situations, specify which validity evidence the researcher is gathering.
• content
• consequences
• response-process
• criterion-related
• internal structure
61
5 SOURCES OF VALIDITY EVIDENCE: EXERCISE #1 (cont…)
• content
• consequences
1. The researcher estimates the difficulty and discrimination of each item. internal structure
2. The researcher asks a panel of experts to rate each item according to how well its content
matches the purpose that is intended to be measured by the item. content
62
• content
• consequences
3. The researcher asks a group of five respondents to write down how they arrived at each response (e.g., what
processes they used, what they were thinking about, etc.). response-process
4. The researcher conducted a factor analysis and inter-item correlations to examine the pattern of
interrelationships between the items of the self-efficacy scale. internal
5. The researcher correlates the scores on the self-efficacy scale obtained for a sample of individuals criterion-related
with the
scores obtained from a long, comprehensive self-efficacy scale.
63
For each of the following situation, specify which of the five sources of validity evidence it addresses.
• content
• consequences
1. The scores of a scale measuring job performance potential are not correlated with actual job
productivity. criterion-related
2. Patricia completes a psychological inventory measuring extroversion. The possible scores that
any individual can receive range from 0 (low introversion) to 50 (high introversion). On three
different occasions, Patricia receives scores of 35, 42, and 27 on the extroversion scale. internal
3. A researcher using a scale measuring fidelity notices that, for a particular item, individuals
scoring high on the item tended to score low on the other items. internal 64
65
CREATING A VALIDITY ARGUMENT
In writing up reports or manuscripts where it is important to demonstrate validity, you should

describe the collection of each form of evidence (or the possible forms) and how this evidence was
used to improve the instrument and/or support the validity of the obtained scores.
“In developing this instrument

several forms of validity evidence were collected.
Content-based evidence of validity was obtained by….

internal structure evidence was obtained by…etc.”
One example of Validity Argument
66
VALIDITY ARGUMENT (cont…): SOME QUESTIONS TO BE ASKED…
Validity (Kesahan) concerns whether the obtained scores lead to the correct and intended
interpretations and decisions about people.
To create validity argument, we can start asking questions like:

Is the score reflective of the level of the trait intended to be measured?
Adakah markah ujian matematik pelajar melambangkan tahap kebolehan matematik beliau?
Adakah markah ujian matematik pelajar melambangkan tahap kebolehan matematik dan bahasa Inggeris beliau?
Jika ujian dijalankan dengan menggunakan komputer, adakah pelajar perlu menguasai kemahiran komputer untuk
menjawab soalan matematik yang dikemukakan? Jika markah matematik pelajar rendah kerana beliau tidak cekap
menggunakan komputer tetapi cekap mengira menggunakan pensel, kertas, dan kalkulator, adakah markah matematik
pelajar tersebut sah?
Are the scores used in the intended fashion and lead to the intended outcomes and decisions?
Do we use the scores correctly?
Do we make the right decision for people based on the test scores?
Adakah markah ujian Biologi digunakan untuk membuat andaian/jangkaan kebolehan pelajar untuk memasuki jurusan
perubatan dan bergraduasi dengan jayanya?
67
VALIDITY ARGUMENT (cont…) Isu Lulus Bahasa Malaysia
http://www.astroawani.com/berita-malaysia/isu-lulus-bm-doktor-inggeris-dahulu-pun-cakap-melayu-rais-148149 68
https://www.themalaysianinsight.com/bahasa/s/6734/
VALIDITY Isu Lulus Bahasa Malaysia (cont…)
ARGUMENT
(cont…)
Bagaimanakah markah SPM bahasa Malaysia melambangkan kebolehan
para doktor untuk menggunakan bahasa Malaysia dengan lancar ketika
membincangkan terma-terma perubatan dengan pesakit yang cenderung
berbahasa Malaysia?
Apakah topik-topik penting (e.g., ujian lisan? komponen sastera?

tatabahasa?) dalam peperiksaan SPM yang dapat menjadi kayu ukur
kebolehan doktor-doktor (markah peperiksaan) menggunakan bahasa
Malaysia ketika bertemu pesakit?
Perlukah doktor kerajaan lulus peperiksaan Bahasa Malaysia untuk

bertutur dalam Bahasa Malaysia atau dalam dialek-dialek Bahasa Malaysia?
Adakah syarat lulus Bahasa Malaysia satu kelonggaran syarat atau

penguatkuasaan syarat baru (sebelum ini doktor tidak perlu lulus Bahasa
Malaysia)? Mengapakah isu ini timbul?
69
ANOTHER ISSUE Kertas Soalan Bocor!
https://ohbulan.com/soalan-matematik-didakwa-bocor/
http://www.astroawani.com/berita-malaysia/dakwaan-kertas-spm-bahasa-melayu-matematik-baha
sa-cina-bocor-tidak-benar-191368
https://www.bharian.com.my/berita/nasional/2018/11/498975/kertas-peperiksaan-spm-bocor-tid
ak-benar
https://www.bharian.com.my/berita/nasional/2018/11/498933/kementerian-pendidikan-nafi-soal
an-matematik-spm-bocor
70
HYPOTHETICAL SITUATION:
LET’S DO A ROLE PLAY! THE NEWS IS TRUE!!!!
We are the…
1. Minister and Deputy Minister
2. Test takers from the involved school
3. Test takers from other schools
4. Parents
5. Math Teachers
6. Malaysian Examination Council Officials
7. Public & Netizen
Given your role for each group:

Discuss how will you react to this issue
What questions do you ask? (if any)
What argument can you make? (if any)
What source of evidence that you may convey/show?
71
VALIDITY IN ESSENCE: RECAP
“… the most fundamental consideration in developing and evaluating tests” (AERA, APA, & NCME,
1999, p.9; 2014, p. 11).
“Validity is a theoretical notion that defines the scope and the nature of validation work..” (Xi, 2008, p.
177).
Validity refers to the degree to which each interpretation or use of a test score is supported by the
accumulated evidence. It constitutes the central notion underlying the development, administration,
and scoring of a test and the uses and interpretations of test scores (AERA, APA, & NCME, 2014, 1999;
Sireci, 2016 in SBAC Technical Report).
“Validity refers to the degree to which evidence and theory

support the interpretations of test scores for proposed uses of
tests” (AERA, APA, & NCME, 2014, p.11)
72
VALIDATION IN ESSENCE: RECAP
Validation is the process of accumulating evidence to support each proposed score interpretation or
use. This validation process does not rely on a single study or gathering one type of evidence. Rather,
validation involves multiple investigations and different kinds of supporting evidence (AERA, APA, &
NCME, 1999; 2014; ETS, 2002; Kane, 2006).
WHY VALIDATION? RECAP

Validation is the joint responsibility (tanggungjawab bersama) of the test developer and the test user
(AERA, APA, & NCME, 2014, 1999).
The test developer is responsible for furnishing relevant evidence and a rationale in support of the
intended use.
The test user is ultimately responsible for evaluating the evidence in the particular setting in which
the test is to be used (Standards, AERA, APA, & NCME, 1999, p. 11).
73
BACK TO OUTLINE based on OUR DISCUSSION SO FAR…
Brief History
Validity
Construct (convergent & discriminant)
Cronbach & Meehl (1955)
Content
Criterion (predictive & concurrent) APA (1954) APA (1966)
Validation Messick
5 Primary Sources of Validity Evidence from the Standards (AERA, APA, & NCME, 2014, 1999) (1975,
Test Content 1980,
Response Processes 1989)
Internal Structure
Relations with External Variables/Relationship to a Criterion
Relationship between Validity & Reliability
APA, AERA, & APA, AERA, &
APA, AERA, & Kane NCME (1999) NCME (2014)
NCME (1985) (1992, 2006,
2011, 2013) 74
75
FACTORS AFFECTING VALIDITY
test-related factors (test itself)

the criterion to which you compare your instrument may not be well enough established
intervening events (test administration and scoring)
human factor (examinees, scorers)
test features/practices
item properties
score interpretation and uses
76
FACTORS AFFECTING VALIDITY (cont…)
Effects from the test itself

whether the statement/stem of the items is clear or not
whether the items represent the trait measured or not
whether the length of the test is adequate or not
whether the test and item difficulty is appropriate or not
77
Test administration
whether the testing conditions are appropriate or not
whether unexpected disturbances occur or not
whether the test administrators administers the test according to the test manual/guideline or not
whether the test guides for examinees are clear or not
Scoring
whether the scoring system is objective and standardized or not
scoring fails to capture important qualities of task performance
undue emphasis on some criteria, forms or styles of response
lack of intra-rater or inter-rater consistency
scoring too analytic
scoring too holistic
78
Examinees
whether the sample is representative, heterogeneous or not
interests and motivation on the test
emotional state and attitude during testing
state of physical health
experiences on test
assessment anxiety
79
Factors Affecting
Validity
• Factors in the Test Itself

 Each test contains items and a close scrutiny of test items will indicate whether the test
appears to measure the subject matter content and the mental functions of the teacher
whishes to test.
 The following factors in the test itself can prevent the test items from functioning as
desired and thereby lower the validity:
(a) Length of the test
(b) Unclear direction
(c) Reading vocabulary and sentence structures which are too difficult
(Referred to Your Article Library, The Next Generation Library)
80
Cont…
(d) Inappropriate level of difficulty of the test items

(e) Poorly constructed test items
(f) Ambiguity
(g) Test items inappropriate for the outcomes being measured
(h) Improper arrangement of items
(i) Identifiable pattern of answers
81
Cont…
• Functioning Content and Teaching Procedure

 In achievement testing, the functioning content of test items cannot be determined only by examining the
form and content of the test. The teacher has to teach fully how to solve a particular problem before
including it in the test.
• Factors in Test Administration and Scoring

 The test administration and scoring procedure may also affect the validity og the interpretations from the
results. For instance, in teacher-made tests factors like insufficient time to complete the test, unfair help to
individual students, cheating during examination, and the unreliable scoring of essay answers might lead to
lower the validity.
82
Cont…
• Factors in Pupils’ Response

 There are certain personal factors which influence the pupils response to the test situation and invalidate the
test interpretation. The emotionally disturbed students, lack of student’s motivation and students’ being afraid
of test situation may not respond normally and this may ultimately affect the validity.
• Nature of the Group and the Criterion

 It has already been explained to you that validity is always specific to a particular group.
 The nature of the criterion used is another important consideration while evaluating validity coefficient.
83
Factors Affecting Validity PATRICIA’S
SLIDE
A172
GROUP b
Arrangement
Difficulty
Structure of of items and
level of the
the items correct
test
responses
84
Relationship
between
Validity & Reliability
85
RELATIONSHIP BETWEEN VALIDITY & RELIABILITY
86
RELATIONSHIP BETWEEN VALIDITY & RELIABILITY (cont…)
• Women and Weight Scale Imagine your weight scale is broken.
It doesn’t give you the right measure

of your weight (measure not valid)
But it gives you almost the same

measure every time you step on it
(measure is reliable)
87
 Validity is the more important feature.

 Reliability is a prerequisite to validity (in other words, if items accurately assess a
domain of content, the scores will also be consistent).
 Assessment results may be reliable (i.e., consistent) but not valid (i.e., accurate).
88
Validity and reliability are closely related.
A test score cannot be considered valid unless

the measurements resulting from it are reliable
On the other hand, results/scores from a test can

be reliable but not necessarily valid.
89
Relationship Between Validity and Reliability
• Validity and reliability are closely related
• A test score cannot be considered valid unless it is reliable.
• Likewise, result from a test can be reliable and not necessarily valid
90
Cont..
• A valid test score is always reliable (in order for a test score to be valid, it needs to be reliable in
the first place).
• A reliable test score is not always valid.
• Validity comes first to ensure reliability.
• For test score to be useful, a measuring instrument (test, scale) must be both reasonable reliable
and valid.
• Aim for validity first, and then try make the test more reliable little by little, rather than the other
way around.
91
IF THEN
• Unreliable • Validity is undermined.
• Reliable, but not valid • Test and its score are not useful.
• Unreliable and invalid • Test and its score are definitely NOT useful!
• Reliable and valid • Test will result in valid & reliable score
(trustworthy and of good quality).
92
OPTIMIZING RELIABILITY & VALIDITY
The more questions the better (the number of test items)

Ask questions several times in slightly different ways (homogeneity)
Get as many people as you can in your program (sample size n)
Get different kinds of people in your program (sample heterogeneity)
Linear relationship between the test and the criterion (Pearson correlation coefficient)
93
SELECTING & CREATING MEASURES
Define the construct(s) that you want to measure clearly.

Identify existing measures, particularly those with established reliability and validity.
Determine whether those measures will work for your purpose and identify any areas
where you may need to create a new measure or add new questions.
Create additional questions/measures.
Identify criteria that your measure should correlate with or predict, and develop
procedures for assessing those criteria.
94
RELIABILITY & VALIDITY: PROPERTIES OF SCORES
Reliability and validity are properties of scores, not instruments.
A particular instrument may generate scores that are reliable and valid for one population or
individual, but not for another.
For example, a science test given in English may yield valid and reliable scores in the U.S., but not in China.
So, when we speak of reliability and validity, we talk about “scores” being reliable or valid, not
“instruments” or “tests”.
95
RECAP
Validity
Construct
Content
Criterion
Validation
Essential Validity Evidence from the Standards (AERA, APA, & NCME,
1999, p.17)
5 Primary Sources of Validity Evidence from the Standards (AERA,
APA, & NCME, 2014, 1999)
Test Content
Response Processes
Internal Structure
Relations with External Variables/Relationship to a
Criterion/External Relationship
Relationship between Validity & Reliability 96
nurliyana@uum.edu.my
nurliyana.bukhari@alumni.uncg.edu
97

A221 SGDE4013 8 Validity&Validation VNB

Uploaded by

Copyright:

Available Formats

You might also like

A221 SGDE4013 8 Validity&Validation VNB

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A221 SGDE4013 8 Validity&Validation VNB

Uploaded by

Copyright:

Available Formats

1

SGDE4013 Assessment in Education

Validity & Validation (Briefly…)

School of Education (SOE)

Semester A221 | First Semester 2022/2023

At the end of this lesson, students will be able to:

“At first, validity was viewed as a characteristic of the test.

Today, validity theory incorporates both test interpretation and use

(The National Research Council, 2002, p. 35)

The concept of validity,

Cronbach & Meehl (1955) discussed the term nomological network

"nomological" is derived from Greek—means "lawful“

nomological network can be thought of as the "lawful network"

“A network would include

(Messick, 1989, p. 23)

Item 1 Item N Item 1 Item N

Item 1 Item N Item 1 Item N Item 1 Item N

Matematik Tahun 1: Nombor Bulat 1-100

Nilai Nombor Nilai Tempat Membundar

Item 1 Item N Item 1 Item N

Item 1 Item N Item 1 Item N Item 1 Item N

Messick (1975) stated that “any brief discussion of the meaning of a

He defines validity as “an integrated evaluative judgment of the degree to

Kane (1992) introduced argument-based approach to validation (validity

“The target domains

A person’s level of literacy in a language depends on his or her ability

(Kane, 2006, p. 31)

“Validity refers to the degree to which evidence and theory

The process of construct validation is theory-laden.

Convergent & Discriminant validity

Content validation is essentially judgmental (e.g., based on subject-matter experts).

The validity is classified into:

1. Evidence based on Test Content

Some Examples of Evidence based on Test Content

testing purposes (theory of action) clearly stated

Parallel Test Administration No evidence

Some Examples of Evidence based on Response Processes

Source of Evidence: Response Processes

Introspective Study Evidence not found so far

Test Fairness No statement found so far

Procedures for gathering evidence on construct dimension include factor analysis or

Some Examples of Evidence based on Internal Structure

These external variables can be used to evaluate

Some Examples of Evidence based on Relations to External Variables

Some Examples of Evidence based on Testing Consequences

Investigations of Unintended Consequences

A Study on the Washback Effect of MUET (Nambiar & Ransirini, 2012):

1. Evidence based on Test Content

A researcher is constructing a scale to measure self-efficacy of University athletes. Although

In writing up reports or manuscripts where it is important to demonstrate validity, you should

“In developing this instrument

Content-based evidence of validity was obtained by….

One example of Validity Argument

To create validity argument, we can start asking questions like:

Apakah topik-topik penting (e.g., ujian lisan? komponen sastera?

Perlukah doktor kerajaan lulus peperiksaan Bahasa Malaysia untuk

Adakah syarat lulus Bahasa Malaysia satu kelonggaran syarat atau