Test Development

TEST DEVELOPMENT
TEST DEVELOPMENT
Test Development is an umbrella term for all that goes into the process of creating a test. The five stages of developing
a test are;
Test Conceptualization – an early stage of test development process wherein the idea for a particular test or test
revision is first conceived.
Test Construction – a stage in the process of test development that entails writing test items (or rewriting or otherwise
revising existing items), as well as formatting items, setting scoring rules, and otherwise designing and building a test.
Test Tryout – a stage in the process of test development that entails administering a preliminary version of a test to a
representative sample of test takers under conditions that simulate conditions under which the final version of the test
will be administered.
Item Analysis – a general term to describe various procedures, usually statistical, designed to explore how individual
test items work as compared to other items in the test and in the context of the whole test (e.g., to explore the level of
difficulty of individual items on an achievement test or the reliability of a personality test); contrast with qualitative
item analysis.
Test Revision – action taken to modify a test’s content or format for the purpose of improving the test’s effectiveness
as a tool of measurement.
TEST CONCEPTUALIZATION
Some Preliminary Questions;
 What is the test designed to measure?
 What is the objective of the test?
 Is there a need for this test?
 Who will use this test?
 Who will take this test?
 What content will the test cover?
 How till the test be administered?
 What is the ideal formant of the test?
 Should more than one form of the test be developed?
 What special training will be required of test users for administering or interpreting the test?
 What types of responses will be required of test takers?
 Who benefits from an administration of this test?
 Is there any potential harm as the result of an administration of this test?
 How will meaning be attributed to scores on this test?
Norm referenced test versus criterion referenced tests: Item development issues.
Norm referenced tests derive meaning from test scores by evaluating an individual test taker’s score and comparing it to scores of
a group of test takers on the same test.
Norm referenced tests are specifically designed to rank test takers on a “bell curve,” or a distribution of scores that resembles,
when graphed, the outline of a bell, a small percentage of students performing well, most performing average, and a small
percentage performing poorly. To produce a bell-curve each time, test questions are carefully designed to accentuate, or to stress or
emphasize, performance differences among test takers.
IQ tests are among the most well-known norm referenced test, as are developmental-screening tests, which are used to identify
learning disabilities in young children or determine eligibility for special-education services.
Criterion referenced tests derives meaning from test scores by evaluating an individual’s score with reference to a set standard, or
criterion. Also referred to as domain referenced test and content referenced test.
Criterion referenced testing and assessment are commonly employed in licensing contexts, and educational contexts, in which
mastery of a particular material must be demonstrated. The development of criterion referenced instruments derives from a
conceptualization of knowledge or skills to be mastered. A good criterion referenced test must be able to distinguish between a
group of test takers who display the minimal competence on a specified subject matter, and those group or test takers who do not
display the minimal competence required for a specific field.
Pilot work, refers to the preliminary research surrounding the creation of a
prototype test; a general objective of pilot work is to determine how best to
measure, gauge, assess, or evaluate to target constructs. This could also be
referred to as pilot study and pilot research.
TEST CONSTRUCTION
Scaling (1) in test construction, the process of setting rules for assigning
numbers in measurement; (2) the process by which a measuring device is
designed and calibrated and the way numbers (or other indices that are scale
values) are assigned to different amounts of the trait, attribute, or characteristic
measured; (3) assigning numbers in accordance with empirical properties of
objects or traits.
There is no one method of scaling. There is no best type of scale. Test

developers scale a test in a manner they believe is optimally suited to their
conception of the measurement of the trait that is being measured.
TEST CONSTRUCTION
Scaling Methods;
Rating scale is a system of ordered numerical or verbal descriptors on which judgements about the presence/absence or magnitude of a
particular trait, attitude, emotion, or other variable are indicated by raters, judges, examiners, or (when the rating scale reflects self-report) the
assessee.
Summative Scale is an index derived from the summing of selected scores on a test or subtest.
Likert Scale is a summative rating scale with five alternative responses ranging on a continuum from, for example, “strongly agree” to
“strongly disagree”.
Method of paired comparisons is a scaling method whereby one of a pair of stimuli (such as photos) is selected according to a rule (such as
“select the one that is more appealing”)
Comparative Scaling is a method of developing ordinal scales through the use of a sorting task that entails judging a stimulus in
comparison with every other stimulus used on the test.
Categorical Scaling is a system of scaling in which stimuli are placed into one of two or more alternative categories that differ
quantitatively with respect to some continuum.
Guttman Scale is a scale wherein items range sequentially from weaker to stronger expressions of the attitude or belief being measured.
Scalogram Analysis is an item-analysis procedure and approach to test development that entails a graphic mapping of a test taker’s
responses.
TEST CONSTRUCTION
Writing items
The test developer or item writer faces three questions to the test blueprint;
 What range of content should the items cover?
 Which of the many different types of item formats should be employed?
 How many items should be written in total and for each content area covered?
When devising a standardized test using a multiple-choice format, it is usually
advisable that the first draft contain approximately twice the number of items that the
final version of the test will contain. An item pool refers to the reservoir or well from
which items will or will not be drawn from the final version of the test; the collection
of item to be further evaluated for possible selection for use in an item bank.
TEST CONSTRUCTION
Item format refers to the form, plan, structure, arrangement, or layout of
individual test items, including whether the items require test takers to select a
response from existing alternative responses or to construct a response.
The two types of item format are the selected-response format and the
constructed-response format. Selected-response format refers to a form of
test item requiring test takers to select a response (e.g., true-false, multiple
choice, and matching items). Constructed-response format refers to a form
of test item requiring the test taker to construct or create a response, as
opposed to simply selecting a response (e.g., items on essay examinations, fill
in the blank, and short-answer tests).
TEST CONSTRUCTION
The three types of selected-response item formats are (1) multiple-choice; (2) matching items; and (3) true-false.
Multiple-choice format has three elements; (1) a stem, (2) a correct alternative or option, and (3) several incorrect alternatives or
options referred to as distractors or foils.
Matching item format refers to when a test taker is presented with two columns: premises and responses, and must determine
which response is best associated with which premise.
If the number of items in the two columns are the same, then a person unsure about one of the premises or response could merely
deduce it by matching all the other options first. A perfect score would then result even though the test taker did not actually know
all the answers. Providing more options that needed minimizes such a possibility. Another way to lessen the probability or chance
or guessing as a factor in the test score is to state in the directions that each response may be a correct answer once, more than once,
or not at all.
A multiple-choice item that contains only two possible responses is referred to as a binary-choice item. The most familiar binary-
choice item is the true-false item.
A true-false item is a sentence that requires the test taker to indicate whether the statement is or is not a fact.
Other varieties of binary-choice items include sentences to which the test taker responds with one of two responses, such as agree
or disagree, yes or no, right or wrong, or fact or opinion. A good binary choice contains a single idea, is not excessively long, and
is not subject to debate. Binary-choice items cannot contain distractor alternatives, unlike with multiple-choice items.
TEST CONSTRUCTION
Constructed response format
Completion item requires the examinee to provide a word or phrase that
completes a sentence. Completion items may also be referred to as a short-
answer item.
Essay item is a test item that requires the test taker to respond to a question
by writing a composition, typically one that demonstrates recall of facts,
understanding, analysis, and/or interpretation.
TEST CONSTRUCTION
Writing items for computerized administration. A number of computer programs are
designed to facilitate the construction, administration, scoring, and interpretation of
tests. These programs make use of two advantages of digital media; the ability to store
items in an item bank and the ability to individualize testing through a technique called
item branching.
Item bank refers to a collection of questions to be used in the construction of tests.
Computerized adaptive testing (CAT) refers to an interactive, computer-administered
test-taking process wherein items presented to the test taker are based in part on the test
taker’s performance on previous items.
Using CAT, the test administered may be different for each test taker. CAT also tends to
reduce floor effects and ceiling effects.
TEST CONSTRUCTION
Floor effect refers to a phenomenon arising from the diminished utility of a tool of assessment
in distinguishing test takers at the low end of the ability, trait, or other attribute being measured.
Ceiling effect refers to the diminishing utility of an assessment tool for distinguishing test
takers at the high end of the ability, trait, or other attribute being measured.
Item branching in the context of computerized adaptive testing, refers to the individualized
presentation of test items drawn from an item bank based on the test taker’s previous responses.
A computer that has stored a bank of achievement test items of different difficulty levels can be
programed to present items according to an algorithm or rule. For example, one rule might be
“don’t present and item of the next difficulty until two consecutive items of the current difficulty
level are answered correctly.” Another rule might be to “terminate the test when five consecutive
items of a given level of difficulty have been answered incorrectly.”
TEST CONSTRUCTION
Scoring items
Class scoring or category scoring is a method of evaluation in which test responses earn credit toward placement in a
particular class or category with other test takers. Sometimes test takers must meet a set number of responses
corresponding to a particular criterion in order to be placed in a specific category or class
Cumulative scoring is a method of scoring whereby points or scored accumulated on individual items or subtests are
tallied and then, the higher the total sum, the higher the individual is presumed to be on the ability, trait, or other
characteristic being measured.
Ipsative scoring is an approach to test scoring and interpretation wherein the test taker’s response and the presumed
strength of a measured trait are interpreted relative to the measured strength of other traits for that test taker.
When it comes to an Ipsative scoring test, it would be possible to draw only intra-individual conclusions about the test
taker. It would not be appropriate to draw inter-individual comparisons on the basis of an ipsatively scored test. It
would be inappropriate to compare two different test takers with each other.
Once the test developer has decided on a scoring model and has done everything else necessary to prepare the first draft
of the test administration, the next step is test tryout.
TEST TRYOUT
The test should be tried out on people who are similar in critical respects to the people for whom the
test was designed. An informal rule of thumb is that there should be no fewer than 5 subjects and
preferably as many as 10 for each item on the test.
What is a Good Item?
Item analysis is a general term to describe various procedures, usually statistical, designed to explore
how individual test items work as compared to other items in the test and in the context of the whole
test (e.g., to explore the level of difficulty of individual items on an achievement test or the reliability of
a personality test).
Qualitative item analysis is a general term for various non-statistical procedures designed to explore
how individual test items work, both compared to other items in the test, and in the context of the whole
test; in contrast to statistically based procedures, qualitative methods involve exploration of issues by
verbal means such as interviews and group discussions conducted with test takers and other relevant
parties.
TEST TRYOUT
Item analysis
Among the tools test developers might employ to analyze and select items are;
 An index of the item’s difficulty
 An index of the item’s reliability
 An index of the item’s validity
 An index of item discrimination
Item-difficulty index, in the context of achievement or ability testing and other contexts in which responses are keyed correct, a statistic
indicating how many test takers responded correctly to an item.
Item-endorsement index, in the context of personality assessment and other contexts in which the nature of the test is such that responses
are not keyed correct or incorrect, a statistic indicating how many test takers responded to an item in a particular direction.
Item-reliability index is a statistic designed to provide an indication of a test\s internal consistency; the higher the item-reliability index,
the greater the test’s internal consistency.
Item-validity index is a statistic indicating the degree to which a test measures what it purports to measure; the higher the item-validity
index, the greater the test’s criterion-related validity.
Item-discrimination index is a statistic designed to indicate how adequately a test item discriminates between high and low scorers.
TEST TRYOUT
Item-characteristic curves
Item-characteristic curve (ICC) is a graphic representation of the probabilistic relationship between a person’s level on a
trait (or ability or other characteristic being measured) and the probability for responding to an item in a predicted way; also
known as a category response curve, and item response curve, or an item trace line.
Other considerations in Item Analysis
Guessing. Methods designed to detect guessing, minimize the effect of guessing, and statistically correct for guessing have
been proposed, but no such method has achieved universal acceptance. The following three criteria that any correction for
guessing must meet, as well as the other interaction issues must be addressed:
1. A correction for guessing must recognize that, when a respondent guesses an answer on an achievement test, the guess
is not typically made on a totally random basis. It is more reasonable to assume that the test taker’s guess is based on
some knowledge of the subject matter and the ability to rule out one or more of the distractor alternatives.
2. A correction for guessing must also deal with the problem of omitted items. Sometimes, instead of guessing, the test
taker will simply omit a response to an item.
3. Some test takers may be luckier than others in guessing the choices that are keyed correct. Any correction for guessing
may seriously underestimate or overestimate the effects of guessing for lucky and unlucky test takers.
TEST TRYOUT
Item fairness is a reference to the degree of bias, if any, in a test item
Biased test item is a test item that favors one particular group of examinees
in relation to another when differences in group ability are controlled.
Speed test is a test, usually of achievement or ability, with a time limit: speed
tests usually contain items of uniform difficulty level. Item analyses of test
taken under speed conditions yield misleading or uninterpretable results. The
closer an item to the end of the test, the more difficult it may appear to be
because test takers simply may not get to items near the end of the test before
the time runs out.
TEST TRYOUT
Qualitative methods are techniques of data generation and analysis that rely primarily on verbal rather than
mathematical or statistical procedures.
Qualitative item analysis is a general term for various non-statistical procedures designed to explore how
individual test items work, both compared to other items in the test, and in the context of the whole test; in
contrast to statistically based procedures, qualitative methods involve exploration of issues by verbal means such
as interviews and group discussions conducted with test takers and other relevant parties.
“think aloud” test administration refers to a method of qualitative item analysis requiring examinees to
verbalize their thoughts as they take a test: useful in understanding how individual items function in a test and how
test takers interpret or misinterpret the meaning of individual items.
Expert panels in the context of test development process, is a group of people knowledgeable about the subject
matter being tested and/or the population for whom the test was designed who can provide input to improve the
test’s content, fairness and other related ways.
Sensitivity review is a study of test items, usually during test development, in which items are examined for
fairness to all prospective test takers and for the presence of offensive language, stereotypes, or situations.
TEST REVISION
Test revision
Much of the discussion of test revision in the development of a band-new test
may also apply to the development of subsequent editions of existing tests,
depending on just how “revised” the revision is.
Test revision as a Stage in New Test Development
A tremendous amount of information is generated at the item-analysis stage,
particularly given that a developing test may have hundreds of items. On the
basis of that information, some items from the original item pool will be
eliminated and others will be rewritten.
TEST REVISION
Test Revision in the Life Cycle of an Existing Test
There comes a time in the life of most test when the test will be revised in
some way or its publication will be discontinued. No hard-and-fast rule exists
for when to revise a test. The American Psychological Association offered the
general suggestions that an existing test be kept in its present form as long as
it remains useful but that it should be revised when new research data,
significant changes in the domain represented, or newly recommended
conditions of test use may reduce the validity of the test score interpretation.
TEST REVISION
Practically speaking, many tests are deemed to be due for revision when any of the
following conditions exist.
1. The stimulus materials look dated and current testtakers cannot relate to them.
2. The verbal content of the test, including the administration instructions and the test
items, contains dated vocabulary that is not readily understood by current testtakers.
3. As popular culture changes and words take on new meanings, certain words or expressions in the test items or directions may
be perceived as inappropriate or even offensive
to a particular group and must therefore be changed.
4. The test norms are no longer adequate as a result of group membership changes in the
population of potential testtakers.
5. The test norms are no longer adequate as a result of age-related shifts in the abilities
measured over time, and so an age extension of the norms (upward, downward, or in
both directions) is necessary.
6. The reliability or the validity of the test, as well as the effectiveness of individual test
items, can be significantly improved by a revision.
7. The theory on which the test was originally based has been improved significantly, and
these changes should be reflected in the design and content of the test.
TEST REVISION
Cross-validation is a revalidation on a sample of test takers other than the test takers on whom
test performance was originally found to be a valid predictor of some criterion
Validity shrinkage refers to the decrease in item validities that occurs after cross-validation
Co-validation the test validation process conducted on two or more tests using the same sample
of test takers; when used in conjunction with the creation of norms or the revision of existing
norms, this process may also be referred to as co-norming
Quality assurance during test revision
Anchor protocol is a test protocol scored by a highly authoritative scorer that is designed as a
model for scoring and a mechanism for resolving scoring discrepancies
Scoring drift is a discrepancy between the scoring in an anchor protocol and the scoring of
another protocol
TEST REVISION
Determining measurement equivalence across test taker populations
Item response theory (IRT) also referred to as latent-trait theory or the latent trait model, is a system of
assumptions about measurement (including the assumption that a trait being measured by a test is
unidimensional) and the extent to which each item measured that trait.
Differential item functioning (DIF) in item response theory IRT, it is a phenomenon in which the same
test item yield one result for members of one group and a different result for members of another group,
presumably as a result of group differences that are not associated with group differences in the construct
being measured.
DIF analysis in IRT, a process of group-by-group analysis of item response curves for the purpose of
evaluating measurement instrument or item equivalence across different groups of test takers.
DIF items in IRT, test items that respondents from different groups, who are presumably at the same level
of the underlying construct being measured, have different probabilities of endorsing as a function of their
group membership.
CREATING AND VALIDATING A TEST
OF ASEXUALITY
Human Asexuality may be defined as an absence of sexual attraction to anyone at
all.
A test designed to screen for Human asexuality, the Asexuality Identification Scale
(AIS) was developed. The AIS is a 12-item, sex- and gender- neutral, self-report
measure of asexuality. The AIS was developed in a series of stages.
Stage 1 included development and administration of eight open-ended questions to
sexual (n = 70) and asexual (n = 139) individuals. Subjects responded in writing to a
series of questions focused on definitions of asexuality, sexual attraction, sexual
desire, and romantic attraction. Participant responses were examined to identify
prevalent themes, and this information was used to generate 111 multiple-choice
items.
OF ASEXUALITY
In Stage 2, these 111 items were administered to another group of asexual (n = 165)
and sexual (n = 752) participants. The resulting data were then factor- and item-
analyzed in order to determine which items should be retained. The decision to retain
an item was made on the basis of our judgment as to which items best differentiated
asexual from sexual participants. Thirty-seven items were selected based on the
results of this item selection process.
In Stage 3, these 37 items were administered to another group of asexual (n = 316)
and sexual (n = 926) participants. As in Stage 2, the items were analyzed for the
purpose of selecting those items that best loaded on the asexual versus the sexual
factors. Of the 37 original items subjected to item analysis, 12 items were retained,
and 25 were discarded.
OF ASEXUALITY
In order to determine construct validity, psychometric validation on the 12-

item AIS was conducted using data from the same participants in Stage 3.
Known-groups validity was established as the AIS total score showed
excellent ability to distinguish between asexual and sexual subjects.
Specifically, a cutoff score of 40/60 was found to identify 93% of self-
identified asexual individuals, while excluding 95% of sexual individuals.
In order to assess whether the measure was useful over and above already-
available measures of sexual orientation, we compared the AIS to an
adaptation of a previously established measure of sexual orientation.
OF ASEXUALITY
Incremental validity was established, as the AIS showed only moderate

correlations with the Klein Scale, suggesting that the AIS is a better predictor
of asexuality compared to an existing measure. To determine whether the AIS
correlates with a construct that is thought to be highly related to asexuality (or,
lack of sexual desire), convergent validity was assessed by correlating total
AIS scores with scores on the Sexual Desire Inventory. As we expected, the
AIS correlated only weakly with Solitary Desire subscale of the SDI, while
the Dyadic Desire subscale of the SDI had a moderate negative correlation
with the AIS.
OF ASEXUALITY
Finally, we conducted discriminant validity analyses by comparing the AIS

with the Childhood Trauma Questionnaire, the Short-Form Inventory of
Interpersonal Problems-Circumplex scales, and the Big-Five Inventory in
order to determine whether the AIS was actually tapping into negative sexual
experiences or personality traits. Discriminant validity was established, as the
AIS was not significantly correlated with scores on the CTQ, IIP-SC, or the
BFI.
OF ASEXUALITY
Sexual and asexual participants significantly differed in their AIS total scores
with a large effect size. Further, the AIS passed tests of known-groups,
incremental, convergent, and discriminant validity. This suggests that the AIS
is a useful tool for identifying asexuality, and could be used in future research
to identify individuals with a lack of sexual attraction.

Test Development

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Test Development

Uploaded by

Copyright:

Available Formats

TEST DEVELOPMENT

There is no one method of scaling. There is no best type of scale. Test

In order to determine construct validity, psychometric validation on the 12-

Incremental validity was established, as the AIS showed only moderate

Finally, we conducted discriminant validity analyses by comparing the AIS

You might also like