Professional Documents
Culture Documents
Chapter 8 Test Development
Chapter 8 Test Development
Chapter 8 Test Development
Some basic Info from Sinag’s Reviewer: CHAPTER 8 TEST DEVELOPMENT (PART 1)
TEST DEVELOPMENT SCALOGRAM ANALYSIS
PROCESS used to analyze data. It is an analysis procedure and approach to test
I. TEST CONCEPTUALIZATION development that involves a graphic mapping of testtaker’s responses
stimulus for thoughts that sparks the conceptualization Objective: obtain an arrangement of items wherein endorsement of one item
could be almost anything automatically connotes endorsement of less extreme positions
review of the available literature a scaling
emerging social phenomenon method of the
response to a need to assess mastery in an emerging direct
occupation or profession estimation
NORM REFERENCED VS. CRITERION-REF TEST: ITEM variety. In here,
DEVELOPMENT ISSUES there is no need
NORM REFERENCED good item: high scores=respond to transform the
correctly; low scores= respond to that same item testtakers
incorrectly; insufficient and inappropriate when responses into
knowledge of mastery some other
CRITERION-REFERENCED good item: best discriminate scale.
between the 2 group, each item addresses the issue of depends on
whether the test taker has met certain criteria, and many variables:
conceptualization of the knowledge or skills to be variables being
measured measured,
PILOT WORK group for whom
preliminary research surrounding the creation of a it is intended,
prototype of the test and the
attempts to determine how best to measure a targeted preferences of
construct the test
literature reviews, experimentation, and creation, developer
revision, and deletion
necessity for publication and wide distribution
not necessary for developing teacher-made test WRITING ITEMS
II. TEST CONSTRUCTION what range of content should the items cover?
Scaling which of the many different types of item formats should be employed?
process of setting rules for assigning numbers in how many items should be written in total and for each content are covered?
measurement Multiple- choice format advisable that the first draft contain approx. twice the
numbers are assigned to different amounts of what was number of items
being measured Item Pool reservoir or well from which items will or will not be drawn for the
L.L. Thurstone “A method of scaling” introduced the people drawn for the final version of the tests
notion of absolute scaling good item sampling= good content validity
test developer needs to ensure that the final version also contains items that
TYPES OD SCALES adequately sample the domain. poorly written= rewritten or the developer should
Age-Based Scale create new items
Grade-Based Scale HOW TO DEVELOP ITEMS
Stanine Scales write a large number of items from personal experience or academic
Unidimensional and Multidimensional acquaintance
Comparative or Categorical help from experts
psychological tests (clinical): interview clinicians, patients, patient’s family
SCALING METHODS psychological test (personnel): interview w/ members of a targeted industry or
1. Rating Scale organization
judgement of the strength of a particular trait, attitude, psychological tests (school): interview w/ teachers, admin staff, etc.
or emotion are indicated by the test taker ITEM FORMAT
can be used to record judgements of oneself, others, selected and constructed response
experience, or objects form, plan, structure, arrangements, and layout
Summative Scale test score is obtained by summing the SELECTED-RESPONSE FORMAT
ratings across all the items select a response from a set of alternative responses
Likert Scale a summative rating scale; usually to scale achievement: examinees must select that is keyed as correct
attitudes trait: examinees must select the alternative that best answers the question w/
Unidimensional one dimension respect to themselves
Multidimensional more than one TYPES OF SRF
2. Method of Paired Comparisons Multiple Choice Format: has 3 elements: stem, a correct alternative, and
produces ordinal data distractors or fails
pairs of stimuli to compare
must choose one of the stimuli
higher score for selecting the option deemed more
justifiable by the majority of group judges
test score would reflect the number of times the choices
of the test taker agreed
it forces test takers to choose between items
3. Comparative Scaling
sorting that result into ordinal result
judgements of a stimulus in comparison w/ every other
stimulus on the scale
4. Categorical Scaling
sorting and placed into 2 or more alternative categories
5. Gutman Scale
ordinal-level measures
PSYCHOLOGICAL ASSESSMENT
Some basic Info from Sinag’s Reviewer: CHAPTER 8 TEST DEVELOPMENT (PART 2)
Matching Item: 2 columns: premises on the left and III. TEST TRYOUT
responses on the right; task: determine w/c response is tried out on people who have similar in critical respects to the people whom the
the best asscociated with which premise test was designed
homogenous consider the number of the testtaker
rule of thumb: no fewer than 5 subjects
a good test is reliable and valid; helps to discriminate testtakers
IV. ITEM ANALYSIS
statistical procedure used to analyze items
1. Item Difficulty Index
an item is not good if everyone gets it right or if everyone gets it wrong
p= item difficulty
subscript no. = item number
longer item difficulty index = the easier the item
optimal average difficulty: 0.5
0.3 below value= difficult item
0.8 above value= easy items
range of difficulty: 0.3 to 0.8
Chance Success Proportion
Binary- Choice Format true or false, agree or disagree, True or False
yes or no, right or wrong, fact or opinion Five-Option Multiple Choice Item
CONSTRUCTED RESPONSE FORMAT 2. Item Reliability Index
supply or create the answer internal consistency
TYPES OF CRF Factor Analysis whether items on a test appear to be measuring the same
Completion Item provide a word or phrase that thing(s)
completes the sentence if an item do not “load” on the factor, it might needed revision or be discarded
Essay Items respond to a question by writing a 3. Item Validity Index
composition, typically one that demonstrates recall of higher the item validity index, the grater the test’s criterion related validity
facts, understanding, analysis, and/or interpretations 4. Item Discrimination Index
how adequately an item separates of discriminates between high scores and low
WRITING ITEMS FOR COMPUTER ADMINISTRATION scores on an entire test
the ability to store items in an item bank and the ability
to individualize testing through a technique called item ITEM-CHARACTERISTIC CURVES
branching graphic representation of item difficulty and discrimination
Item Bank a relatively large and easily accessible the steeper the slope, the greater the item discrimination
collection of test questions easy item: shift the ICC to the left
difficult item: shift the ICC to the right
OTHER CONSIDERATIONS IN ITEM ANALYSIS
Guessing
CROSS VALIDATION
revalidation of a test on a sample of testtakers other than those on whom test
performance was originally found to be a solid predictor
SENSITIVITY REVIEW smaller item validates when administered to a second sample of testtakers
study of test items, in which items are examined for CO-VALIDATION
fairness to all prospective testtakers and for the a test validation process conducted on 2 or more tests using the same sample of
presence of offensive language, streotypes, or situations testtakers
standard part of test development Co-nornming used in conjunction w/ the creation of norms or the revision of
expert panels do this existing norms
QUALITY ASSURANCE DURING TEST REVISION
examinees adhere to standardized procedures
employ examiners who have experience in testing members of the population
targeted
all examiners will be trained to administer the instrument e
examiner or another scorer = resolver (resolves the discrepancy of the scores)
Anchor protocol for ensuring consistency in scoring; scored by a highly
authoritative scorer that is designed as a model of scoring and a mechanism for
resolving scoring discrepancies
Scoring drift discrepancy between scoring in an anchor protocol and the
scoring of another protocol
V. TEST REVISION
TR as a stage in new test development
some items may be eliminated and others may be
rewritten
APPROACHES:
1. Characterized each item according to its strengths
and weakness
many weaknesses, making them prime candidates for
deletion or revision
very easy and very difficult items have a restricted
range; all or almost all testtaker get them wrong. These
items tend to lack reliability and validity
PSYCHOLOGICAL ASSESSMENT
Some basic Info from Sinag’s Reviewer: CHAPTER 8 TEST DEVELOPMENT (PART 4)
1. EVALUATING THE PROCESS OF EXISTING TESTS
AND GUIFING TEST REVISION
2. DETERMINING MEASUREMENT EQUIVALENCE
ACROSS TESTTAKER POPULATIONS
IRT information curves can help test developers
evaluate how well an item is working to measure
used to weed out uninformative questions or to
eliminate redundant items
allows test developers to tailor an instrument to
provide high information
Differential Item Functioning am item functions
differently in one group of testtakers as compared to
another are known to have the same (or similar) level
of the underlying trait
DIF Analysis test developers scrutinize group by group
item response curves looking for DIF items
DIF items items that respondents from different
groups at the same level of underlying trait have
different probabilities of endorsing as a function of
their group membership
3. DEELOPING ITEM BANKS
begin with the collection of appropriate items from
existing instruments
new items may also be written