Reliability Validity Utility HANDOUTS

PSYCHOLOGICAL MEASUREMENTS : Reliability estimate
Reliability, Validity and Utility

1. Test Retest Reliability – Also known as
RELIABILITY - Synonym for dependability or time-sampling reliability because it uses the same
consistency, refers to the consistency of measurement. instrument to measure the same thing at two points in
● reliability coefficient – an index that indicates time.
the ratio between the true score variance on a 2. Parallel forms and Alternate forms reliability-
test and total variance. Compares two equivalent forms of a test that measure
● “reliability estimates” in the range of .70 and the same attribute.
.80 are good enough for most purposes in basic 3. Internal Consistency – assesses the correlation
research. between multiple items in a test that are intended to
measure the same construct.
THE CONCEPT OF RELIABILITY
CLASSICAL TEST THEORY – That a score on a ability Ways to compute internal consistency
test is presumed to reflect not only the test takers true
score on the ability being measures but also error. ● Split–half reliability – correlating two pairs of
scores obtained from equivalent halves of a single test
Observe Score = True ability + Random Error administered once.
● Kr 20 formula or Kuder-Richardson formula 2
Measurement error – Are all of the factors associated - A measure of internal consistency reliability for
with the process of measuring some variable, other than measures with dichotomous choices.
the variable being measured.
Two forms of measurement error

1. Random Error - this source of error fluctuates from
one testing situation to another with no discernible ● Cronbach’s Alpha – the most famous and
pattern that would systematically raise or lower scores. commonly used among reliability coefficients because it
The main reason for random error are limitation of requires only one administration of the test.
instruments, environmental factors, and slight variation
in procedure.
2. Systematic error - this source of error is predictable
and either constant or else proportional to the
measurement. Systematic error affect validity. Typically ● Average proportional distance ( APD ) – as a
causes observational error, imperfect instrument measure used to evaluate the internal consistency of a
calibration and sampling bias. test that focuses on the degree of differences that exists
between item scores.
ERROR VARIANCE - The element of variability in a
score that is produced by extraneous factors, such as
measurement imprecision, and is not attributable in
independent variable, or other controlled experimental WHAT TO DO ABOUT LOW RELIABILITY
manipulation.
1. Item sampling or content sampling – refers to 1. Increase the number of items – the larger the
variation among items within a test as well as to variation sample the more likely that the test will represent the
among items between tests. true characteristic.
2. Administration- Test environment: the room
temperature, level of lighting, and the amount of 2. Factor and number analysis – test are most reliable
ventilation and noise. if they are unidimensional ; measuring a single ability,
● Test taker variables – emotional attribute, construct or skill.
problems, physical discomfort, lack of
sleep
● Scoring & interpretation – despite
rigorous scoring criteria set forth in
many of the better known tests of
intelligence, the scorer or rater can
be a source of error variance.
studies with their own groups of testtakers.
CHAPTER 7: VALIDITY Such local validation studies may yield
insights regarding a particular population of
VALIDITY testtakers as compared to the norming
validity is a term used in conjunction with the sample described in a test manual. Local
meaningfulness of a test score what the test score validation studies are absolutely necessary
truly means. when the test user plans to alter in some
way the format, instructions, language, or
CONCEPT OF VALIDITY content of the test.
Validity, as applied to a test, is a judgment or Example:
estimate of how well a test measures what it Adolescent Depression Inventory (ADI) in
purports to measure in a particular context. More identifying symptoms of depression
specifically, it is a judgment based on evidence
about the appropriateness of inferences drawn from Local validation studies would also be
test scores.(An inference is a Logical result or necessary if a test user sought to use a test with a
deduction.) Characterizations of the validity of tests population of testtakers that differed in some
and test scores are frequently phrased in terms significant way from the population on which the
such as “acceptable” or “weak.” These terms reflect test was standardized.
a judgment about how adequately the test
measures what it purports to measure. One way measurement specialists have
traditionally conceptualized validity is according to
- How useful the instrument is for a particular three categories:
purpose with a particular population of people 1. Content validity. This is a measure of validity
(“valid test.”) based on an evaluation of the subjects, topics, or
- test has been shown to be valid for a particular content covered by the items in the test.
use with a particular population of testtakers at a 2. Criterion-related validity. This is a measure
particular time of validity obtained by evaluating the relationship of
- No test or measurement technique is scores obtained on the test to scores on other tests
“universally valid” for all time, for all uses, with all or measures
types of testtaker populations. 3. Construct validity. This is a measure of
- Further, to the extent that the validity of a test validity that is arrived at by executing a
may diminish as the culture or the times change, comprehensive analysis of
the validity of a test may have to be re-established a. how scores on the test relate to other test
with the same as well as other testtaker scores and measures, and
populations. b. how scores on the test can be understood
within some theoretical framework for
Validation is the process of gathering and understanding the construct that the test was
evaluating evidence about validity. Both the test designed to measure.
developer and the test user may play a role in the
validation of a test for a specific Purpose. It is the - Trinitarian approaches to validity
test developer’s responsibility to supply validity assessment are not mutually exclusive
evidence in the test manual. - all three types of validity evidence
contribute to a unified picture of a test’s validity
Validation in therapy include active listening, - A test user may not need to know about
mindfully responding, and other validation skills that all three
encourage the understanding and acceptance of - Depending on the use to which a test is
the client's experiences. This helps clients feel being put, one type of validity evidence may be
heard and understood, contributing to the more relevant than another.
therapeutic outcome
Ecological validity refers
It may sometimes be appropriate for to a judgment regarding how well a test
test users to conduct their own validation measures what it purports to measure at the time
and place that the variable being measured
(typically a behavior, cognition, or emotion) is CRITERION-RELATED VALIDITY
actually emitted. In essence, the greater the is an assessment of how well a test score can
ecological validity of a test or other measurement accurately predict or reflect an individual's likely
procedure, the greater the generalizability of the performance on a specific measure of interest,
measurement results to particular real-life which is referred to as the criterion.
circumstances.
Concurrent validity checks how well a test score
FACE VALIDITY aligns with a criterion measured at the same time.
Predictive validity, on the other hand, measures
- Face validity relates more to what a test how accurately a test score can predict future
appears to measure to the person being tested performance on a criterion.
than to what the test actually measures.
- if a test definitely appears to measure what it An adequate criterion measure must also be valid
purports to measure “on the face of it,” then it could for the purpose for which it is being used. If one test
be said to be high in face validity. (X) is being used as the criterion to validate a
second test (Y), then evidence should exist that test
- A test’s lack of face validity could contribute to X is valid. If the criterion used is a rating made by a
a lack of confidence in the perceived effectiveness judge or a panel, then evidence should exist that
of the test—with a consequential decrease in the the rating is valid.
testtaker’s cooperation or motivation to do his or
her best. criterion must not be contaminated
- In a corporate environment, lack of face validity When criterion contamination does occur, the
may lead to unwillingness of administrators or results of the validation study cannot be taken
managers to “buy-in” to the use of a particular test seriously. There are no methods or statistics to
- In a similar vein, parents may object to having gauge the extent to which criterion contamination
their children tested with instruments that lack has taken place, and there are no methods or
ostensible validity. Such concern might stem from a statistics to correct for such contamination.
belief that the use of such tests will result in invalid
conclusions. Concurrent Validity
If test scores are obtained at about the same time
- In reality, a test that lacks face validity may still as the criterion measures are obtained, measures
be relevant and useful. However, if the test is not of the relationship between the test scores and the
perceived as relevant and useful by testtakers, criterion provide evidence of concurrent validity.
parents, legislators, and others, then negative
consequences may result. These consequences Statements of concurrent validity indicate the extent
may range from poor testtaker attitude to lawsuits to which test scores may be used to estimate an
filed by disgruntled parties against a test user and individual’s present standing on a criterion.
test publisher. Ultimately, face validity may be more
a matter of public relations than psychometric Predictive Validity
soundness. Still, it is important nonetheless, and Test scores may be obtained at one time and the
(much like Rodney Dangerfield) deserving of criterion measures obtained at a future time,
respect. usually after some intervening event has taken
place. The intervening event may take varied
CONTENT VALIDITY forms, such as training, experience, therapy,
Content Validity describes a judgment of how medication, or simply the passage of time.
adequately a test samples behavior representative Measures of the relationship between the test
of the universe of behavior that the test was scores and a criterion measure obtained at a future
designed to sample/ refers to the extent to which a time provide an indication of the predictive validity
test or assessment instrument evaluates all aspects of the test; that is, how accurately scores on the
of the topic, construct, or behavior that it is test predict some criterion measure.
designed to measure.
base rate - is the extent to which a particular trait, - Contrary evidence can provide a stimulus for
behavior, characteristic, or attribute exists in the the discovery of new facets of the construct as well
population (expressed as a proportion) as alternative methods of measurement.
hit rate may be defined as the proportion of people
a test accurately identifies as possessing or Evidence of Construct Validity
exhibiting a particular trait, behavior, characteristic, ■ the test is homogeneous, measuring a single
or attribute. construct;
ex: Imagine you have a test designed to identify ■ test scores increase or decrease as a function of
individuals with high creativity. You administer the age, the passage of time, or an experimental
test to a group of people, and the hit rate is the manipulation as theoretically predicted;
proportion of those correctly identified by the test as ■ test scores obtained after some event or the
highly creative among all the truly creative mere passage of time (or, posttest scores)
individuals in the group. For example, if the test differ from pretest scores as theoretically predicted;
correctly identifies 80 out of 100 genuinely creative ■ test scores obtained by people from distinct
individuals, the hit rate would be 80%. groups vary as predicted by the theory;
miss rate may be defined as the proportion of ■ test scores correlate with scores on other tests in
people the test fails to identify as having, or not accordance with what would be predicted from a
having, a particular characteristic or attribute: theory that covers the manifestation of the
false positive, false negative construct in question.
Construct Validity Validity, Bias, and Fairness Test Bias
Is a judgment about the appropriateness of bias is a factor inherent in a test that systematically
inferences drawn from test scores regarding prevents accurate, impartial measurement.
individual standings on a variable called a Test bias can arise more from the study design than
construct. A construct is an informed, scientific idea the test itself
developed or hypothesized to describe or explain
behavior. Rating error A rating is a numerical or verbal
judgment (or both) that places a person or an
Constructs are unobservable, presupposed attribute along a continuum identified by a scale of
(underlying) traits that a test developer may invoke numerical or word descriptors known as a rating
to describe test behavior or criterion performance. scale. Simply stated, a rating error is a judgment
- hypotheses about the expected behavior of resulting from the intentional or unintentional
high scorers and low scorers on the test misuse of a rating scale. Thus, for example, a
- hypotheses give rise to a tentative theory about leniency error (also known as a generosity error) is,
the nature of the construct the test was designed to as its name implies, an error in rating that arises
measure. from the tendency on the part of the rater to be
- If the test is a valid measure of the construct, lenient in scoring, marking, and/or grading.
then high scorers and low scorers will behave as
predicted by the theory. Severity error is a type of rating error in which the
ratings are consistently overly negative, particularly
- investigator will need to reexamine the nature with regard to the performance or ability of the
of the construct itself or hypotheses made about it participants. It is a type of error that can occur in
- the test simply does not measure the construct. psychometric assessments. Severity error is the
- One procedure may have been more opposite of leniency error, which is a type of rating
appropriate than another, given the particular error in which the ratings are consistently overly
assumptions. positive.
Although confirming evidence contributes to a
judgment that a test is a valid measure of a central tendency error
construct, evidence to the contrary can also be Here the rater, for whatever reason, exhibits a
useful. general and systematic reluctance to giving ratings
at either the positive or the negative extreme.
Consequently, all of this rater’s ratings would tend stringent application process for airline
to cluster in the middle of the rating continuum. personnel.
solution: • Benefits
use rankings o Profits, gains, advantages
o (e.g.) more stringent hiring policy more
Halo effect describes the fact that, for some raters, productive
some ratees can do no wrong. More specifically, a employees
halo effect may also be defined as a tendency to o (e.g.) maintaining successful and academic
give a particular ratee a higher rating than he or environment of university
she objectively deserves because of the rater’s
failure to discriminate among conceptually distinct UTILITY ANALYSIS
and potentially independent aspects of a ratee’s
behavior. What is Utility Analysis?
- a family of techniques that entail a cost-benefit
Test Fairness analysis designed to yield information relevant to a
Unlike the technically complex nature of test bias, division about the usefulness and/or practical value
concerns about test fairness are often tied to of a tool of assessment.
values. While test bias can be addressed with
precision, fairness is subjective and can lead to How Is a Utility Analysis Conducted?
ongoing debates among people with different objective: dictate what sort of information will be
viewpoints. In the context of psychometrics, required as well as the specific methods to be used
fairness is defined as the degree to which a test is
employed impartially, justly, and equitably. Expectancy Data
o Expectancy table provides indication of the
likelihood that a test taker will score within some
interval of scores on a criterion measure
o Used to measure costs vs. benefits
Brogden-Cronbach-Gleser formula
o Utility gain: estimate of the benefit of using a
particular test or selection method
CHAPTER 7: UTILITY o Most simply is benefits-cost
o Productivity gain: estimated increase in work
UTILITY: usefulness or practical value of testing to output
improve efficiency.
FACTORS THAT AFFECT A TEST’S UTILITY
• Psychometric Soundness SOME PRACTICAL CONSIDERATIONS

o Reliability and validity of a test
o Gives us the practical value of both the The Pool of Job Applicants
scores o There is rarely a limitless supply of
(reliability and validity) potential employees
o They tell us whether decisions are o Dependent on many factors, including
cost-effective economic environment
o A valid test is not always a useful test o We assume that top scoring individuals
especially if will accept the job, but those individuals
test takers do not follow test directions are more likely to be the ones being
• Costs offered higher positions
o Economic and non economic
o (e.g.) using a less expensive and therefore The Complexity Of The Job
less
o It is questionable whether the same Collection of data on the predictor of
utility analysis methods can be used to interest from a group known to possess
measure the eligibility of varying and not to possess trait, attribute, or
complexities of jobs ability. Cut score based on which test best
discriminates the two groups performance
The Cut Score In Use
o Relative cut score: may be defined as a IRT-Based Method
reference point. Based on test taker’s performance across
- Based on norm-related all items on a test
considerations rather than on the Some portion of the total number of items
relationship of test scores to a on the test must be scored “correct” order
criterion for the test taker to “pass” the test
- Also called norm-referenced cut
score.
- (e.g.) top 10% of test scores get Item-mapping method: determining
A’s difficulty level reflected by cut score
o Fixed cut score: set with reference to a Book-Mark method: test items are listed,
judgment concerning a minimum level of one per page, in ascending level of
proficiency required to be included in a difficulty. An expert places a bookmark to
particular classification. mark the divide which separates test
- Also called absolute cut scores takers who have acquired minimal
knowledge, skills, or abilities and those
o Multiple cut scores: using two or more that have not.
cut scores with reference to one predictor
for the purpose of categorizing test taker Problems include training of experts,
- (e.g.) having cut score that marks possible floor and ceiling effects, and the
an A, B, C etc. all measuring optimal length of item booklets
same predictor
Other Methods
o Multiple hurdles: for success, requires -discriminant analysis: family of statistical
one individual to complete many tasks, techniques used to shed light on the
with elimination at each level relationship between certain variables and
- (e.g.) written application -> group two or more naturally occurring groups
interview -> personal interview -(e.g.) the relationships between scores of
etc. tests and ppl judged to besuccessful or
unsuccessful at job
o Compensatory model of selection:
assumption is made that high scores on
one attribute can compensate for low
scores on another attribute.
Methods for Setting Cut Scores
The Angoff Method

Judgments of experts are averaged
The Known Groups Method

Reliability Validity Utility HANDOUTS

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reliability Validity Utility HANDOUTS

Uploaded by

Copyright:

Available Formats

PSYCHOLOGICAL MEASUREMENTS : Reliability estimate

Reliability, Validity and Utility

Two forms of measurement error

Construct Validity Validity, Bias, and Fairness Test Bias

FACTORS THAT AFFECT A TEST’S UTILITY

• Psychometric Soundness SOME PRACTICAL CONSIDERATIONS

Methods for Setting Cut Scores

The Angoff Method

The Known Groups Method

You might also like