Week13- ÃÄrenci

PSY358-Week 13
Test Development
© McGraw Hill 1
Test Development
Test development is an umbrella

term for all that goes into the
process of creating a test.
The five stages of test development.
Access the text alternative for slide images.
© McGraw Hill 2
Test Conceptualization
• The thought that “there ought to be a test for…” is impetus to

developing a new test.
• The stimulus could be knowledge of psychometric problems with other
tests, a new social phenomenon, or any number of things.
• There may be a need to assess mastery in an emerging occupation.
© McGraw Hill 3
Some preliminary questions include:
• What is the test designed to • What special training will be

measure? required of test users for
administering or interpreting
• What is the objective of the test?
the test?
• Is there a need for this test? • What types of responses will
• Who will use this test? be required of test takers?
• Who will take this test? • Who benefits from an
• What content will the test cover? administration of this test?
• How will the test be administered? • Is there any potential for harm
as the result of an
• What is the ideal format of the test?
administration of this test?
• Should more than one form of the • How will meaning be attributed
test be developed? to scores on this test?
© McGraw Hill 4
Norm-referenced versus criterion-referenced tests: Item
development issues 1
Generally, a good item on a norm-referenced achievement test is an item
for which high scorers on the test respond correctly and low scorers
respond incorrectly.
Ideally, each item on a criterion-oriented test addresses the issue of

whether the respondent has met certain criteria.
• Development of a criterion-referenced test may entail exploratory work
with at least two groups of test takers.
• The first group known to have mastered the knowledge or skill being
measured.
• The second group known not to have mastered it.
Test items may be pilot studied to evaluate whether they should be included
in the final form of the instrument.
© McGraw Hill 5
What is pilot study?
• Pilot work refers to preliminary research conducted during the creation

of a test prototype.
• During pilot work, the test developer aims to determine the most
effective way to measure the targeted construct.
• Pilot work is essential when constructing tests or measuring

instruments for publication and widespread distribution.
© McGraw Hill 6
Test Construction
Scaling: The process of setting rules for

assigning numbers in measurement.
L L Thurstone was very influential in the

development of sound scaling methods.
• He adapted psychophysical scaling
methods to the study of psychological
variables such as attitudes and
values.
• He is widely considered to be one of

the primary architects of modern
factor analysis.
© McGraw Hill 7
Types of scales
• Age-based scale
• Grade-based scale
• Stanine scale
• Unidimensional & Multidimensional Scale
• Comparatice & Categorical Scale
• There is no best type of scale!!
© McGraw Hill 8
Scaling Methods
• Numbers can be assigned to responses to calculate test scores using

a number of methods.
• Rating scales: A grouping of words, statements, or symbols on which

judgments of the strength of a particular trait, attitude, or emotion are
indicated by the test taker.
• If the final test score is obtained by summing the ratings across all
the items, it is termed a summative scale.
© McGraw Hill 9
© McGraw Hill 10
Scaling Methods
Likert scale: Each item presents

the test taker with five alternative
responses (sometimes seven),
usually on an agree–disagree or
approve–disapprove continuum.
• The use of rating scales of any
type results in ordinal-level
data.
© McGraw Hill 11
An example of a Likert scale:
© McGraw Hill 12
Method of paired comparisons: Test takers must choose between two
alternatives according to some rule.
• Example: Select the behavior that you think would be more justified:
• Cheating on taxes if one has a chance.

• Accepting a bribe in the course of one’s duties.
• For each pair of options, test takers receive a higher score for
selecting the option deemed more justifiable by the majority of a group
of judges.
• The test score would reflect the number of times the choices of a test
taker agreed with those of the judges.
© McGraw Hill 13
Comparative scaling: Entails judgments of a stimulus in comparison
with every other stimulus on the scale.
© McGraw Hill 14
Categorical scaling: Stimuli are placed into one of two or more
alternative categories that differ quantitatively with respect to some
continuum.
Guttman scale: Items range sequentially from weaker to stronger

expressions of the attitude, belief, or feeling being measured.
• Respondents who agree with the stronger statements of the attitude
will also agree with milder statements.
• The method of equal-appearing intervals can be used to obtain data
that are supposed to be interval in nature.
© McGraw Hill 15
If the following scale was a perfect Guttman scale, then all
respondents who agree with “a” (the most extreme position) should
also agree with “b,” “c,” and “d.”
© McGraw Hill 16
Writing Items
• Item pool: The reservoir or well from which items will or will not be
drawn for the final version of the test.
• Comprehensive sampling provides a basis for content validity of the
final version of the test.
• When devising a standardized test using a multiple-choice format, it is
usually advisable that the first draft contain approximately twice the
number of items that the final version of the test will contain.
• How does one develop items for the item pool?
• The test developer may write a large number of items from
personal experience or academic acquaintance with the subject
matter.
• Help may also be sought from others, including experts.
• Searches through the academic research literature may prove
fruitful, as may searches through other databases.
© McGraw Hill 17
Writing Items
• Item format: Includes variables such as the form, plan, structure,

arrangement, and layout of individual test items.
• Selected-response format: Items require test takers to select a
response from a set of alternative responses.
• Constructed-response format: Items require test takers to supply
or to create the correct answer, not merely to select it.
© McGraw Hill 18
Selected-response format:
1. Multiple-choice
2. Matching
3. True-false
© McGraw Hill 19
Multiple-choice format has three elements:
(1) a stem
(2) a correct alternative or option
(3) several incorrect alternatives or options variously referred to as
distractors or foils.
© McGraw Hill 20
Other commonly used selective
response formats include matching and
true–false items.
❖ In a matching item, the testtaker is
presented with two columns: premises
on the left and responses on the right.
The testtaker’s task is to determine
which response is best associated with
which premise.
❖ If the format contains only two
possible responses is called a binary-
choice item.
Perhaps the most familiar binary-choice
item is the true–false item.
© McGraw Hill 21
Constructed-response format: Items require test takers to supply or to create the
correct answer, not merely to select it.
1. completion item
2. short answer
3. essay
• A completion item requires the examinee to provide a word or phrase that completes
a sentence, as in the following example:
Standard deviation of data indicates the spread of scores from the _________
❖ A good completion item should be worded so that the correct answer is specific.
❖ A completion item may also be referred to as a short-answer item.
❖ There are no hard-and-fast rules for how short answer must be to be considered
a short answer
❖ But beyond a paragraph or two, the item is more properly referred to as an essay item
which requires writing a composition, typically one that demonstrates recall of facts,
understanding, analysis, and/or interpretation.
© McGraw Hill 22
Writing items for computer administration.
• Item bank: A relatively large and easily accessible collection of test
questions.
• Computerized adaptive testing (CAT): An interactive, computer-
administered test-taking process wherein items presented to the test
taker are based in part on the test taker’s performance on previous
items.
• CAT provides economy in testing time and number of items
presented.
• CAT tends to reduced floor effects and ceiling effects.
© McGraw Hill 23
Ceiling or floor effects occur when the tests or scales are relatively easy
or difficult such that substantial proportions of individuals obtain either
maximum or minimum scores and that the true extent of their abilities
cannot be determined
Ceiling effect
❖ occurs when the scores on a variable reaches a level that cannot be
exceeded
Floor effect
❖ a value below which a response cannot be made
© McGraw Hill 24
Scoring Items
In a cumulatively scored test, it is assumed that the higher the score on

the test, the higher the test taker is on the ability, trait, or other
characteristic that the test purports to measure.
Class scoring: Responses earn credit toward placement in a particular

class or category with other test takers whose pattern of responses is
presumably similar in some way.
• Example: Diagnostic testing.
© McGraw Hill 25
Scoring Items
Ipsative scoring: Comparing a test taker’s score on one scale within a

test to another scale within that same test.
• In the case of an ipsative scale, an employee's strengths in different

areas may be assessed by assigning points. For instance, the
supervisor may initially assign 20 points for communication, 30 points
for timeliness, and 50 points for work quality. However, after a few
months, the supervisor may assign different point values, such as 30
for communication, 30 for timeliness, and 40 for work quality. Despite
the changes, the total number of points allocated remains the same
(100).
© McGraw Hill 26
Test Tryout
• Test should be tried out on a similar population that it was designed

for.
• 5–10 subjects per item.
• Should be administered in the same manner and have the same
instructions as the final product.
© McGraw Hill 27
What is a good item?
• A good item is reliable and valid.

• A good item helps discriminate test takers.
• A good item is answered correctly by high scorers on the test as a
whole.
© McGraw Hill 28
Item Analysis
The nature of item analysis will vary depending on the goals of the test
developer.
Among the tools test developers might employ to analyze and select
items are:
• An index of the item’s difficulty.

• An index of the item’s reliability.
• An index of the item’s validity.
• An index of item discrimination.
© McGraw Hill 29
• Item-difficulty index: The proportion of respondents answering an
item correctly.
• An index of an item’s difficulty (shown by p) is obtained by calculating
the proportion of the total number of testtakers who answered the item
correctly
• If 50 of the 100 examinees answered item correctly, p = .50 & if 75
of the 100 examinees answered item correctly p = .75
• Note that the larger the item-difficulty index, the easier the item.
• For maximum discrimination among the abilities of the testtakers, the
optimal average item difficulty is approximately .5, with individual
items on the test ranging in difficulty from about .3 to .8.
© McGraw Hill 30
One factor
Item Reliability Index – Indication

of the internal consistency of the
scale.
❖ Inter-item consistency and factor
analysis provide evidence for
internal consistency Three factors
❖ Factor analysis can also provide an
indication of whether items that are
supposed to be measuring the same
thing load on a common factor. Nine factors
(all items are
independent)
© McGraw Hill 31
Item-validity index: Allows test developers to evaluate the validity of
items in relation to a criterion measure.
The higher the item-validity index, the greater the test’s criterion-related
validity.
Item-discrimination index (d): Indicates how adequately an item
separates or discriminates between high scorers and low scorers on an
entire test.
• A measure of the difference between the proportion of high scorers
answering an item correctly and the proportion of low scorers
answering the item correctly.
© McGraw Hill 32
Analysis of item alternatives: The quality of each alternative within a
multiple-choice item can be readily assessed with reference to the
comparative performance of upper and lower scorers.
© McGraw Hill 33
Item characteristic curves (ICC): A graphic representation of item
difficulty and discrimination.
• On Item A, test takers of low

ability tend to do better.
• On Item B, test takers of
moderate ability tend to do
better.
Ghiselli et al. (1981).
© McGraw Hill 34
• On Item C, test takers of high
ability tend to do better; it is a
good item
• Item D shows a high level of
discrimination; it might be good
if cut scores are being used
Ghiselli et al. (1981).
© McGraw Hill 35
Other Considerations in Item Analysis
Guessing: Test developers and users must decide whether they wish to
correct for guessing, but to date, there has not been any satisfactory
solution to the problem of guessing.
Item fairness: The degree, if any, a test item is biased.

• A biased test item is an item that favors one particular group of
examinees in relation to another when differences in group ability are
controlled.
Speed tests: Item analyses of tests taken under speed conditions yield
misleading or uninterpretable results
• The closer an item is to the end of the test, the more difficult it may
appear to be.
• In a similar vein, measures of item discrimination may be artificially high
for late-appearing items. This is so because testtakers who know the
material better may work faster and are thus more likely to answer the
later items
© McGraw Hill 36
Qualitative Item Analysis
• Qualitative methods: Techniques of data generation and analysis

that rely primarily on verbal rather than mathematical or statistical
procedures.
• Qualitative item analysis is a general term for various nonstatistical
procedures designed to explore how individual test items work.
• Think-aloud test administration: Respondents are asked to
verbalize their thoughts as they occur during testing.
• Expert panels: Experts may be employed to conduct a qualitative
item analysis.
• Sensitivity review: Items are examined in relation to fairness to all
prospective test takers, offensive language, stereotypes, or
situations.
© McGraw Hill 37
Test Revision
Revision as a stage in new test development.

• Items are evaluated based on their strengths and weaknesses; some
items may be eliminated.
• Some items may be replaced by others from the item pool.
• Revised tests will then be administered under standardized conditions
to a second sample.
• Once a test has been finalized, norms may be developed from the
data and it is said to be standardized.
© McGraw Hill 38
Revision in the life cycle of a test
• Existing tests may be revised if the stimulus material or verbal
material is dated, some outdated words are not readily understood or
are offensive, norms no longer represent the population, psychometric
properties could be improved, or the underlying theory behind the test
has changed.
• In test revision, the same steps are followed as with new tests (i.e.,
test conceptualization, construction, item analysis, tryout, and
revision).
© McGraw Hill 39
Revision in the Life Cycle of a Test
Cross-validation and co-validation.

• Cross-validation refers to the revalidation of a test on a sample of
test takers other than those on whom test performance was originally
found to be a valid predictor of some criterion.
• Item validities inevitably become smaller when administered to a
second sample; this is known as validity shrinkage.
• Co-validation: A test validation process conducted on two or more
tests using the same sample of test takers.
• Co-validation is economical for test developers.
© McGraw Hill 40
ANY QUESTIONS?
©© McGraw Hill
McGraw-Hill Education 41

Week13- ÃÄrenci

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week13- ÃÄrenci

Uploaded by

Copyright:

Available Formats

PSY358-Week 13

Test development is an umbrella

The five stages of test development.

Access the text alternative for slide images.

• The thought that “there ought to be a test for…” is impetus to

• What is the test designed to • What special training will be

Ideally, each item on a criterion-oriented test addresses the issue of

• Pilot work refers to preliminary research conducted during the creation

• Pilot work is essential when constructing tests or measuring

Scaling: The process of setting rules for

L L Thurstone was very influential in the

• He is widely considered to be one of

• There is no best type of scale!!

• Numbers can be assigned to responses to calculate test scores using

• Rating scales: A grouping of words, statements, or symbols on which

Likert scale: Each item presents

Access the text alternative for slide images.

• Cheating on taxes if one has a chance.

Guttman scale: Items range sequentially from weaker to stronger

• Item format: Includes variables such as the form, plan, structure,

In a cumulatively scored test, it is assumed that the higher the score on

Class scoring: Responses earn credit toward placement in a particular

Ipsative scoring: Comparing a test taker’s score on one scale within a

• In the case of an ipsative scale, an employee's strengths in different

• Test should be tried out on a similar population that it was designed

• A good item is reliable and valid.

• An index of the item’s difficulty.

Item Reliability Index – Indication

• On Item A, test takers of low

Ghiselli et al. (1981).

Access the text alternative for slide images.

Ghiselli et al. (1981).

Item fairness: The degree, if any, a test item is biased.

• Qualitative methods: Techniques of data generation and analysis

Revision as a stage in new test development.

Cross-validation and co-validation.

You might also like

Week13- ÃÄrenci

Week13- ÃÄrenci