Professional Documents
Culture Documents
Test Development
Test Development
Constructing Scales
Piloting the Test
Standardizing the Test
Collecting Norms
Validation & Reliability Studies
Manual Writing
Test Revision
Defining the Test Universe,
Audience, & Purpose
Defining the test universe.
– prepare a working definition of the
construct
– locate studies that explain the construct
– locate current measures of the construct
Defining the Test Universe,
Audience, & Purpose
Defining the target audience.
– make a list of characteristics of persons
who will take the test--particularly those
characteristics that will affect how test
takers will respond to the test questions
(e.g., reading level, disabilities, honesty,
language)
Defining the Test Universe,
Audience, & Purpose
Defining the purpose.
– includes not only what the test will
measure, but also how scores will be used
– e.g., will scores be used to compare test
takers (normative approach) or to indicate
achievement (criterion approach)?
– e.g., will scores be used to test a theory or
to provide information about an individual?
Developing a Test Plan
A test plan includes a definition of the
construct, the content to be measured
(test domain), the format for the
questions, and how the test will be
administered and scored
Defining the Construct
Define construct after reviewing
literature about the construct and any
available measures
Operationalize in terms of observable
and measurable behaviours
Provides boundaries for the test domain
(what should and shouldn’t be included)
Specify approximate number of items
needed
Choosing the Test Format
Test format refers to the type of
questions the test will contain (usually
one format per test for ease of test
takers and scoring)
Test formats have two elements:
– stimulus (e.g., a question or phrase)
– mechanism for response (e.g., multiple
choice, true-false). May be objective or
subjective test format
Composing the Test Items
test items are the stimuli presented to
the test taker (may or may not take the
form of questions)
the form chosen depends on decisions
made in the test plan (e.g., purpose,
audience, method of administration,
scoring)
Test Types
Structured Response
– Multiple Choice
– True False, Forced Choice
– Likert Scales
Free Response
– Essay, Short Answer
– Interview Questions
– Fill in the Blank
– Projective Techniques
Multiple Choice
Multiple choice most common in educational
testing (and also some personality and
employment testing)
– consists of a stem and a number of responses--
should only be one right answer
– the wrong answers are called distractors because
they may appear correct--should be realistic
enough to appeal to uninformed test taker
– easy scoring but downside is that test takers can
get some correct by guessing
Multiple Choice
Pros
• more answer options (4-5) reduce the chance of
guessing that an item is correct
• many items can aid in student comparison and
reduce ambiguity, increase reliability
Cons
• measures narrow facets of performance
• reading time increased with more answers
• transparent clues (e.g., verb tenses or letter uses
“a” or “an”) may encourage guessing
• difficult to write four or five reasonable choices
• takes more time to write questions
True/False
True/False is also used in educational
testing and some personality testing
– in educational testing the test taker can
again gain some advantage by guessing
True/False (cont.)
Ideally a true/false question should be
constructed so that an incorrect response
indicates something about the student's
misunderstanding of the learning objective.
– Disadvantages
Limited Depth
Difficult to assess higher levels of skills
Guessing/Memorization vs. Knowledge
Subjective Items
subjective items are less easily scored
but provide the test taker with fewer
cues and open wider areas for
response--often used in education
– essay questions - responses can vary in
breadth and depth and scorer must
determine to what extent the response is
correct (often by examining match with
predetermined correct response)
Essay Questions
Provide a freedom on response that
facilitates assessing higher cognitive
behaviors (e.g., analysis and evaluation)
– Disadvantages
Difficult to Grade
Judgement error (e.g., interrater reliability)
Requires Advance - Objective Scoring Key
Writing Good Items
Basis building block of test construction
Little attention given to writing items
an art that requires originality, creativity, combined
with knowledge of test domain and good item writing
practices
not all items will perform as expected--may be too
easy or difficult, may be misinterpreted, etc.
rule of thumb to write at least twice as many items as
you expect to use
Broad vs. Narrow items
Writing Good Items (cont.)
Suggestions:
– identify item topics by consulting test plan
(increases content validity)
– ensure that each item presents a central
idea or problem
– write items drawn only from testing
universe
– write each item in clear and direct manner
Writing Good Items (cont..)
Suggestions:
– use vocabulary and language appropriate for
the target audience (e.g., age, culture)
– take into account sexist or racist language
(e.g., mailman, fireman)
– make all items independent (e.g.,one
question per question)
– ask an expert to review items to reduce
ambiguity and inaccuracy
Writing Administration
Instructions
specify the testing environment to
decrease variation or error in test scores
should address:
– group or individual administration
– requirements for location (e.g., quiet)
– required equipment
– time limits or approximate completion time
– script for administrator and answers to
questions test takers may ask
Specifying Administration and
Scoring Methods
determine such things as how test will
be administered (e.g., orally, written,
computer--individually or in groups)
method of scoring, but also whether
scored by hand by test administrator, or
accompanied by scoring software, or
sent to test publisher for scoring
Scoring Methods
Cumulative model: most common
– assumes that the more a test taker responds in a
particular fashion the more he/she has of the
attribute being measured (e.g., more “correct”
answers, or endorses higher numbers on a Likert
scale)
– correct responses or responses on Likert scale are
summed
– yields interval data that can be interpreted with
reference to norms
Scoring Methods (cont.)
Categorical model: place test takers in a group
– Reasons:
Cry for help
Want to plea insanity in court
Want to avoid draft into military
Want to show psychological damage
Duplicate items:
“I love my mother.”
“I hate my mother.”
Infrequency scales:
“I’ve never had hair on my head.”
“I have not seen a car in 10 years.”
Random Responding
– May occur for several reasons:
People are not motivated to participate
Reading or language difficulties
Do not understand instructions / item content
Too confused or disturbed to respond
appropriately
Piloting and Revising Tests
can’t assume the test will perform as
expected
pilot test scientifically investigates the
test’s reliability and validity
administer test to sample from target
audience
analyze data and revise test to fix any
problems uncovered--many aspects to
consider
Setting Up the Pilot Test
test situation should match actual
circumstances in which test will be used
(e.g., in sample characteristics, setting)
developers must follow the American
Psychological Association’s codes of
ethics (e.g., strict rules of confidentiality
and publish only aggregate results)
Conducting the Pilot Test
depth and breadth depends on the size
and complexity of the target audience
adhere strictly to test procedures
outlined in test administration
instructions
generally require large sample
may ask participants about the testing
experience
Analyzing the Results
can gather both quantitative and
qualitative information
use quantitative information for such
things as item characteristics, internal
consistency, convergent and
discriminate validity, and in some
instances predictive validity
Revising the Test
Choosing the final items requires
weighing each item’s content validity,
item difficulty and discrimination, inter-
item correlation, and bias
when new items need to be added or
items need to be revised, the items
must again be pilot tested to ensure
that the changes produced the desired
results
Validation and Cross-Validation
Validation is the process of obtaining
evidence that the test effectively measures
what it is supposed to measure (i.e.,
reliability and validity)
first part of establishing content validity is
carried out as the test is developed--that it
measures the constructs (construct
validity) and predicts an outside criterion is
determined in subsequent data collection
Validation and Cross-Validation
when the final revision of a test yields
scores with sufficient evidence of reliability
and validity, test developers then conduct
cross-validation--a final round of test
administration to another sample
because of chance factors the reliability
and validity coefficients will likely be
smaller in the new sample--referred to as
shrinkage