Chapter 8 Test Development

PSYCHOLOGICAL ASSESSMENT
Some basic Info from Sinag’s Reviewer: CHAPTER 8 TEST DEVELOPMENT (PART 1)
TEST DEVELOPMENT SCALOGRAM ANALYSIS
PROCESS used to analyze data. It is an analysis procedure and approach to test
I. TEST CONCEPTUALIZATION development that involves a graphic mapping of testtaker’s responses
stimulus for thoughts that sparks the conceptualization Objective: obtain an arrangement of items wherein endorsement of one item
could be almost anything automatically connotes endorsement of less extreme positions
review of the available literature a scaling
emerging social phenomenon method of the
response to a need to assess mastery in an emerging direct
occupation or profession estimation
NORM REFERENCED VS. CRITERION-REF TEST: ITEM variety. In here,
DEVELOPMENT ISSUES there is no need
NORM REFERENCED good item: high scores=respond to transform the
correctly; low scores= respond to that same item testtakers
incorrectly; insufficient and inappropriate when responses into
knowledge of mastery some other
CRITERION-REFERENCED good item: best discriminate scale.
between the 2 group, each item addresses the issue of depends on
whether the test taker has met certain criteria, and many variables:
conceptualization of the knowledge or skills to be variables being
measured measured,
PILOT WORK group for whom
preliminary research surrounding the creation of a it is intended,
prototype of the test and the
attempts to determine how best to measure a targeted preferences of
construct the test
literature reviews, experimentation, and creation, developer
revision, and deletion
necessity for publication and wide distribution
not necessary for developing teacher-made test WRITING ITEMS
II. TEST CONSTRUCTION what range of content should the items cover?
Scaling which of the many different types of item formats should be employed?
process of setting rules for assigning numbers in how many items should be written in total and for each content are covered?
measurement Multiple- choice format advisable that the first draft contain approx. twice the
numbers are assigned to different amounts of what was number of items
being measured Item Pool reservoir or well from which items will or will not be drawn for the
L.L. Thurstone “A method of scaling” introduced the people drawn for the final version of the tests
notion of absolute scaling good item sampling= good content validity
test developer needs to ensure that the final version also contains items that
TYPES OD SCALES adequately sample the domain. poorly written= rewritten or the developer should
Age-Based Scale create new items
Grade-Based Scale HOW TO DEVELOP ITEMS
Stanine Scales write a large number of items from personal experience or academic
Unidimensional and Multidimensional acquaintance
Comparative or Categorical help from experts
psychological tests (clinical): interview clinicians, patients, patient’s family
SCALING METHODS psychological test (personnel): interview w/ members of a targeted industry or
1. Rating Scale organization
judgement of the strength of a particular trait, attitude, psychological tests (school): interview w/ teachers, admin staff, etc.
or emotion are indicated by the test taker ITEM FORMAT
can be used to record judgements of oneself, others, selected and constructed response
experience, or objects form, plan, structure, arrangements, and layout
Summative Scale test score is obtained by summing the SELECTED-RESPONSE FORMAT
ratings across all the items select a response from a set of alternative responses
Likert Scale a summative rating scale; usually to scale achievement: examinees must select that is keyed as correct
attitudes trait: examinees must select the alternative that best answers the question w/
Unidimensional one dimension respect to themselves
Multidimensional more than one TYPES OF SRF
2. Method of Paired Comparisons Multiple Choice Format: has 3 elements: stem, a correct alternative, and
produces ordinal data distractors or fails
pairs of stimuli to compare
must choose one of the stimuli
higher score for selecting the option deemed more
justifiable by the majority of group judges
test score would reflect the number of times the choices
of the test taker agreed
it forces test takers to choose between items
3. Comparative Scaling
sorting that result into ordinal result
judgements of a stimulus in comparison w/ every other
stimulus on the scale
4. Categorical Scaling
sorting and placed into 2 or more alternative categories
5. Gutman Scale
ordinal-level measures
Matching Item: 2 columns: premises on the left and III. TEST TRYOUT
responses on the right; task: determine w/c response is tried out on people who have similar in critical respects to the people whom the
the best asscociated with which premise test was designed
homogenous consider the number of the testtaker
rule of thumb: no fewer than 5 subjects
a good test is reliable and valid; helps to discriminate testtakers
IV. ITEM ANALYSIS
statistical procedure used to analyze items
1. Item Difficulty Index
an item is not good if everyone gets it right or if everyone gets it wrong
p= item difficulty
subscript no. = item number
longer item difficulty index = the easier the item
optimal average difficulty: 0.5
0.3 below value= difficult item
0.8 above value= easy items
range of difficulty: 0.3 to 0.8
Chance Success Proportion
Binary- Choice Format true or false, agree or disagree, True or False
yes or no, right or wrong, fact or opinion Five-Option Multiple Choice Item
CONSTRUCTED RESPONSE FORMAT 2. Item Reliability Index
supply or create the answer internal consistency
TYPES OF CRF Factor Analysis whether items on a test appear to be measuring the same
Completion Item provide a word or phrase that thing(s)
completes the sentence if an item do not “load” on the factor, it might needed revision or be discarded
Essay Items respond to a question by writing a 3. Item Validity Index
composition, typically one that demonstrates recall of higher the item validity index, the grater the test’s criterion related validity
facts, understanding, analysis, and/or interpretations 4. Item Discrimination Index
how adequately an item separates of discriminates between high scores and low
WRITING ITEMS FOR COMPUTER ADMINISTRATION scores on an entire test
the ability to store items in an item bank and the ability
to individualize testing through a technique called item ITEM-CHARACTERISTIC CURVES
branching graphic representation of item difficulty and discrimination
Item Bank a relatively large and easily accessible the steeper the slope, the greater the item discrimination
collection of test questions easy item: shift the ICC to the left
difficult item: shift the ICC to the right
OTHER CONSIDERATIONS IN ITEM ANALYSIS
Guessing
Item Fairness degree, if any, a test item is biased

Speed Tests
COMPUTERIZED ADAPTIVE TESTING (CAT)

items presented are based in part on the testtaker’s
performance on previous items
reduce floor effects and ceiling effects
Floor Effect diminished utility of an assessment tool for
distinguishing testtaker’s at the low end
Ceiling Effect diminished utility of an assessment tool for
distinguishing testtaker’s at the high end QUALITATIVE ITEM ANALYSIS
ITEM BRANCHING Think Aloud
ability of the computer to tailor the content and order of the to shed light on the testtaker’s thought process during the administration of a
presentation of test items on the basis of responses to previous test
items one to one basis of administration
SCORING ITEMS achievement test: useful in accessing if certain students are misinterpreting a
1. Cumulative Scoring
particular item and why and how
items are summed and the higher the score, the higher the
Expert Panels
testtaker ability, trait, or offer characteristic
2. Class/ Category Scoring individually or in groups
responses earn credit toward placement in a particular
class w/ other testtakers
3. Ipsative Scoring
comparing testtakers score on one scale within a test to
another sclae w/in the same test
TR IN THE LIFE CYCLE OF AN EXISTING TEST
the APA offered the general suggestions that an existing test be kept in its
present form as long as it remains “useful” but that is should be revised “when
significant changes in the domain represented, or new conditions of test use and
interpretation, make the test inappropriate for its intended use”
CROSS VALIDATION
revalidation of a test on a sample of testtakers other than those on whom test
performance was originally found to be a solid predictor
SENSITIVITY REVIEW smaller item validates when administered to a second sample of testtakers
study of test items, in which items are examined for CO-VALIDATION
fairness to all prospective testtakers and for the a test validation process conducted on 2 or more tests using the same sample of
presence of offensive language, streotypes, or situations testtakers
standard part of test development Co-nornming used in conjunction w/ the creation of norms or the revision of
expert panels do this existing norms
QUALITY ASSURANCE DURING TEST REVISION
examinees adhere to standardized procedures
employ examiners who have experience in testing members of the population
targeted
all examiners will be trained to administer the instrument e
examiner or another scorer = resolver (resolves the discrepancy of the scores)
Anchor protocol for ensuring consistency in scoring; scored by a highly
authoritative scorer that is designed as a model of scoring and a mechanism for
resolving scoring discrepancies
Scoring drift discrepancy between scoring in an anchor protocol and the
scoring of another protocol
V. TEST REVISION
TR as a stage in new test development
some items may be eliminated and others may be
rewritten
APPROACHES:
1. Characterized each item according to its strengths
and weakness
many weaknesses, making them prime candidates for
deletion or revision
very easy and very difficult items have a restricted
range; all or almost all testtaker get them wrong. These
items tend to lack reliability and validity
1. EVALUATING THE PROCESS OF EXISTING TESTS
AND GUIFING TEST REVISION
2. DETERMINING MEASUREMENT EQUIVALENCE
ACROSS TESTTAKER POPULATIONS
IRT information curves can help test developers
evaluate how well an item is working to measure
used to weed out uninformative questions or to
eliminate redundant items
allows test developers to tailor an instrument to
provide high information
Differential Item Functioning am item functions
differently in one group of testtakers as compared to
another are known to have the same (or similar) level
of the underlying trait
DIF Analysis test developers scrutinize group by group
item response curves looking for DIF items
DIF items items that respondents from different
groups at the same level of underlying trait have
different probabilities of endorsing as a function of
their group membership
3. DEELOPING ITEM BANKS
begin with the collection of appropriate items from
existing instruments
new items may also be written
the items that make the cut after such scoring

constitute the preliminary item bank. Administered to
a large and representative sample
can be done by group administration by computer or
by administering individually using paper-and-pencil
methods
responses are evaluated w/ regard to variables such as
R, V, domain coverage, and DIF
final item bank will consist of a large set of items all
measuring a single domain.
IB can also be used for CAT
INSTRUCTOR-MADE TESTS FIR IN CLASS USE

informal methods of psychometric evaluation are
frequently used because the use of formal methods is
impractical
content validity are routinely addressed
criterion validity is difficult to establish
construct validity is also often assessed informally
test reliability can also be informally assessed
professors also attempt to reduce the administration
error by eliminating items or discussing / explaining
items that the students cannot understand
can also ask for 2nd opinion to reduce rater error

Chapter 8 Test Development

Uploaded by

Copyright:

Available Formats

You might also like

Chapter 8 Test Development

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 8 Test Development

Uploaded by

Copyright:

Available Formats

PSYCHOLOGICAL ASSESSMENT

Item Fairness degree, if any, a test item is biased

COMPUTERIZED ADAPTIVE TESTING (CAT)

the items that make the cut after such scoring

INSTRUCTOR-MADE TESTS FIR IN CLASS USE

You might also like