Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

Test Development

Why Develop a New Test?


meet the needs of a special group of
test takers
sample behaviours from a newly
defined test domain
improve the accuracy of test scores for
their intended purpose
Tests need to be revised
First Four Steps
Defining the test universe, audience,
and purpose
Developing a test plan
Composing the test items
Writing the administration instructions
Continued Steps of Test Construction
Diagram of Test Construction (p. 234)

Constructing Scales
Piloting the Test
Standardizing the Test
Collecting Norms
Validation & Reliability Studies
Manual Writing
Test Revision
Defining the Test Universe,
Audience, & Purpose
Defining the test universe.
– prepare a working definition of the
construct
– locate studies that explain the construct
– locate current measures of the construct
Defining the Test Universe,
Audience, & Purpose
Defining the target audience.
– make a list of characteristics of persons
who will take the test--particularly those
characteristics that will affect how test
takers will respond to the test questions
(e.g., reading level, disabilities, honesty,
language)
Defining the Test Universe,
Audience, & Purpose
Defining the purpose.
– includes not only what the test will
measure, but also how scores will be used
– e.g., will scores be used to compare test
takers (normative approach) or to indicate
achievement (criterion approach)?
– e.g., will scores be used to test a theory or
to provide information about an individual?
Developing a Test Plan
A test plan includes a definition of the
construct, the content to be measured
(test domain), the format for the
questions, and how the test will be
administered and scored
Defining the Construct
Define construct after reviewing
literature about the construct and any
available measures
Operationalize in terms of observable
and measurable behaviours
Provides boundaries for the test domain
(what should and shouldn’t be included)
Specify approximate number of items
needed
Choosing the Test Format
Test format refers to the type of
questions the test will contain (usually
one format per test for ease of test
takers and scoring)
Test formats have two elements:
– stimulus (e.g., a question or phrase)
– mechanism for response (e.g., multiple
choice, true-false). May be objective or
subjective test format
Composing the Test Items
test items are the stimuli presented to
the test taker (may or may not take the
form of questions)
the form chosen depends on decisions
made in the test plan (e.g., purpose,
audience, method of administration,
scoring)
Test Types
Structured Response
– Multiple Choice
– True False, Forced Choice
– Likert Scales

Free Response
– Essay, Short Answer
– Interview Questions
– Fill in the Blank
– Projective Techniques
Multiple Choice
Multiple choice most common in educational
testing (and also some personality and
employment testing)
– consists of a stem and a number of responses--
should only be one right answer
– the wrong answers are called distractors because
they may appear correct--should be realistic
enough to appeal to uninformed test taker
– easy scoring but downside is that test takers can
get some correct by guessing
Multiple Choice
Pros
• more answer options (4-5) reduce the chance of
guessing that an item is correct
• many items can aid in student comparison and
reduce ambiguity, increase reliability
Cons
• measures narrow facets of performance
• reading time increased with more answers
• transparent clues (e.g., verb tenses or letter uses
“a” or “an”) may encourage guessing
• difficult to write four or five reasonable choices
• takes more time to write questions
True/False
True/False is also used in educational
testing and some personality testing
– in educational testing the test taker can
again gain some advantage by guessing
True/False (cont.)
Ideally a true/false question should be
constructed so that an incorrect response
indicates something about the student's
misunderstanding of the learning objective.

This may be a difficult task, especially when


constructing a true statement.
Forced Choice Items
Forced-Choice is similar to multiple-choice but
is used in personality and attitude tests (e.g.,
MBTI)
– test taker must choose between unrelated but
equally acceptable responses
Forced Choice Items(cont.)
Example

Place an “X” in the space to the left of the work that of


the word in each pair that best describes your
personality.

1. ____ Sunny 2. ____ Outgoing


____ Friendly ____ Loyal
Likert Scales
Likert scales are usually reliable and
highly popular (e.g., personality and
attitude tests)
– item is presented with an array of response
options (e.g., 1 to 5 or 1 to 7 scale), usually
on an agree/disagree or
approve/disapprove continuum
Test Types
Structured Response
– Advantages
Great Breadth
Quick Scoring

– Disadvantages
Limited Depth
Difficult to assess higher levels of skills
Guessing/Memorization vs. Knowledge
Subjective Items
subjective items are less easily scored
but provide the test taker with fewer
cues and open wider areas for
response--often used in education
– essay questions - responses can vary in
breadth and depth and scorer must
determine to what extent the response is
correct (often by examining match with
predetermined correct response)
Essay Questions
Provide a freedom on response that
facilitates assessing higher cognitive
behaviors (e.g., analysis and evaluation)

Allows respondent to focus on what they


have learned and does not limit them to
specific questions
Interview Questions
– interview questions are often used in organizational
settings--interviewer decides what is a good or poor
answer

test plan should be based on knowledge, skills,


abilities and other characteristics required to
perform the job

Information can be obtained from a job


description, job analysis, current job incumbent
Projective Techniques
Projective techniques are often
employed in clinical settings
– uses a highly ambiguous stimulus to elicit
an unstructured response (i.e., the test
taker “projects” his or her perception and
perspective onto a neutral stimulus)
– variety of stimuli (e.g., pictures, words) and
responses may be verbal or drawing
pictures
Sentence Completion
Sentence-Completion format presents an
incomplete sentence that the test taker
completes (e.g., “I feel happiest when …)

subjective tests are at risk for judgment error


and inter-rater reliability is therefore of
particular importance--scoring keys and
training important
Test Types
Subjective Items
– Advantages
Can Test Higher Cognitive skills
Encourages organize/develop thoughts

– Disadvantages
Difficult to Grade
Judgement error (e.g., interrater reliability)
Requires Advance - Objective Scoring Key
Writing Good Items
Basis building block of test construction
Little attention given to writing items
an art that requires originality, creativity, combined
with knowledge of test domain and good item writing
practices
not all items will perform as expected--may be too
easy or difficult, may be misinterpreted, etc.
rule of thumb to write at least twice as many items as
you expect to use
Broad vs. Narrow items
Writing Good Items (cont.)
Suggestions:
– identify item topics by consulting test plan
(increases content validity)
– ensure that each item presents a central
idea or problem
– write items drawn only from testing
universe
– write each item in clear and direct manner
Writing Good Items (cont..)
Suggestions:
– use vocabulary and language appropriate for
the target audience (e.g., age, culture)
– take into account sexist or racist language
(e.g., mailman, fireman)
– make all items independent (e.g.,one
question per question)
– ask an expert to review items to reduce
ambiguity and inaccuracy
Writing Administration
Instructions
specify the testing environment to
decrease variation or error in test scores
should address:
– group or individual administration
– requirements for location (e.g., quiet)
– required equipment
– time limits or approximate completion time
– script for administrator and answers to
questions test takers may ask
Specifying Administration and
Scoring Methods
determine such things as how test will
be administered (e.g., orally, written,
computer--individually or in groups)
method of scoring, but also whether
scored by hand by test administrator, or
accompanied by scoring software, or
sent to test publisher for scoring
Scoring Methods
Cumulative model: most common
– assumes that the more a test taker responds in a
particular fashion the more he/she has of the
attribute being measured (e.g., more “correct”
answers, or endorses higher numbers on a Likert
scale)
– correct responses or responses on Likert scale are
summed
– yields interval data that can be interpreted with
reference to norms
Scoring Methods (cont.)
Categorical model: place test takers in a group

– e.g., a particular pattern of responses may suggest


diagnosis of a certain psychological disorder
– typically yields nominal data because it places test
takers in categories
Scoring Methods (cont…)
Ipsative model: test takers scores are not
compared to that of other test takers but rather
compare the scores on various scales WITHIN
the test taker (Which scores are high & low)

– e.g., a test taker may complete a measure of


interpersonal problems of various types and the test
administrator may want to determine which of the
types the test taker feels is most problematic for
him or her

Cumulative model may be combined with


Response Bias
In preparing an item review, each question can
be evaluated from two perspectives: Is the item
fair? Is the item biased?

Tests are subject to error and one form comes


from the test takers
Response Sets/Styles
Are patterns of responding that result in
misleading information and limit the accuracy
and usefulness of the test scores

Reasons for misleading information


1. Information requested is too personal
2. Distort their responses
3. Answer items carelessly
4. May feel coerced into completing the test
Response Style

– People always agree (acquiescence) or


disagree (criticalness) with statements
without attending to the actual content
– Usually, when items are ambiguous

Solution: use both positively- and negatively-


keyed items
Social Desirability
Some test takers choose socially acceptable
answers or present themselves in a favourable
light

People often do not attend as much to the trait


being measured as to the social acceptability of
the statement

This represents unwanted variance


Social Desirability (cont.)
Example items:
– Friends would call me spontaneous.
– People I know can count on me to finish what I
start.
– I would rather work in a group than by myself.
– I often get stressed-out in many situations.
Faking
Faking -- some test takers may respond in a
particular way to cause a desired outcome
– may “fake good” (e.g., in employment settings) to
create a favourable impression

– may “fake bad” (e.g., in clinical or forensic settings)


as a cry for help or to appear mentally disturbed

– may use some subtle questions that are difficult to


fake because they aren’t clearly face valid
“Faking Bad”
– People try to look worse than they really are
Common problem in clinical settings

– Reasons:
Cry for help
Want to plea insanity in court
Want to avoid draft into military
Want to show psychological damage

– Most people who fake bad overdo it


Impression Management
– Mitigating IM:

Use positive and negative impression scales


(endorsed by 10% of the population)
Use lie scales to “flag” those who score high
(e.g., “I get angry sometime”).
Inconsistency scales (e.g., two different
responses to two similar questions)
(Use multiple assessment methods (other
than self-report)
Random Responding
Random responding may occur when test takers
are unwilling or unable to respond accurately.

– likely to occur when test taker lacks the skills (e.g.,


reading), does not want to be evaluated, or lacks
attention to the task

– try to detect by embedding a scale that tends to


yield clear results from vast majority such that a
different result suggests the test taker wasn’t
cooperating
Random Responding
– Detection:

Duplicate items:
“I love my mother.”
“I hate my mother.”

Infrequency scales:
“I’ve never had hair on my head.”
“I have not seen a car in 10 years.”
Random Responding
– May occur for several reasons:
People are not motivated to participate
Reading or language difficulties
Do not understand instructions / item content
Too confused or disturbed to respond
appropriately
Piloting and Revising Tests
can’t assume the test will perform as
expected
pilot test scientifically investigates the
test’s reliability and validity
administer test to sample from target
audience
analyze data and revise test to fix any
problems uncovered--many aspects to
consider
Setting Up the Pilot Test
test situation should match actual
circumstances in which test will be used
(e.g., in sample characteristics, setting)
developers must follow the American
Psychological Association’s codes of
ethics (e.g., strict rules of confidentiality
and publish only aggregate results)
Conducting the Pilot Test
depth and breadth depends on the size
and complexity of the target audience
adhere strictly to test procedures
outlined in test administration
instructions
generally require large sample
may ask participants about the testing
experience
Analyzing the Results
can gather both quantitative and
qualitative information
use quantitative information for such
things as item characteristics, internal
consistency, convergent and
discriminate validity, and in some
instances predictive validity
Revising the Test
Choosing the final items requires
weighing each item’s content validity,
item difficulty and discrimination, inter-
item correlation, and bias
when new items need to be added or
items need to be revised, the items
must again be pilot tested to ensure
that the changes produced the desired
results
Validation and Cross-Validation
Validation is the process of obtaining
evidence that the test effectively measures
what it is supposed to measure (i.e.,
reliability and validity)
first part of establishing content validity is
carried out as the test is developed--that it
measures the constructs (construct
validity) and predicts an outside criterion is
determined in subsequent data collection
Validation and Cross-Validation
when the final revision of a test yields
scores with sufficient evidence of reliability
and validity, test developers then conduct
cross-validation--a final round of test
administration to another sample
because of chance factors the reliability
and validity coefficients will likely be
smaller in the new sample--referred to as
shrinkage

You might also like