Professional Documents
Culture Documents
Test Development
Test Development
TEST DEVELOPMENT
Test Development is an umbrella term for all that goes into the process of creating a test. The five stages of developing
a test are;
Test Conceptualization – an early stage of test development process wherein the idea for a particular test or test
revision is first conceived.
Test Construction – a stage in the process of test development that entails writing test items (or rewriting or otherwise
revising existing items), as well as formatting items, setting scoring rules, and otherwise designing and building a test.
Test Tryout – a stage in the process of test development that entails administering a preliminary version of a test to a
representative sample of test takers under conditions that simulate conditions under which the final version of the test
will be administered.
Item Analysis – a general term to describe various procedures, usually statistical, designed to explore how individual
test items work as compared to other items in the test and in the context of the whole test (e.g., to explore the level of
difficulty of individual items on an achievement test or the reliability of a personality test); contrast with qualitative
item analysis.
Test Revision – action taken to modify a test’s content or format for the purpose of improving the test’s effectiveness
as a tool of measurement.
TEST CONCEPTUALIZATION
Some Preliminary Questions;
What is the test designed to measure?
What is the objective of the test?
Is there a need for this test?
Who will use this test?
Who will take this test?
What content will the test cover?
How till the test be administered?
What is the ideal formant of the test?
Should more than one form of the test be developed?
What special training will be required of test users for administering or interpreting the test?
What types of responses will be required of test takers?
Who benefits from an administration of this test?
Is there any potential harm as the result of an administration of this test?
How will meaning be attributed to scores on this test?
TEST CONCEPTUALIZATION
Norm referenced test versus criterion referenced tests: Item development issues.
Norm referenced tests derive meaning from test scores by evaluating an individual test taker’s score and comparing it to scores of
a group of test takers on the same test.
Norm referenced tests are specifically designed to rank test takers on a “bell curve,” or a distribution of scores that resembles,
when graphed, the outline of a bell, a small percentage of students performing well, most performing average, and a small
percentage performing poorly. To produce a bell-curve each time, test questions are carefully designed to accentuate, or to stress or
emphasize, performance differences among test takers.
IQ tests are among the most well-known norm referenced test, as are developmental-screening tests, which are used to identify
learning disabilities in young children or determine eligibility for special-education services.
Criterion referenced tests derives meaning from test scores by evaluating an individual’s score with reference to a set standard, or
criterion. Also referred to as domain referenced test and content referenced test.
Criterion referenced testing and assessment are commonly employed in licensing contexts, and educational contexts, in which
mastery of a particular material must be demonstrated. The development of criterion referenced instruments derives from a
conceptualization of knowledge or skills to be mastered. A good criterion referenced test must be able to distinguish between a
group of test takers who display the minimal competence on a specified subject matter, and those group or test takers who do not
display the minimal competence required for a specific field.
TEST CONCEPTUALIZATION
Pilot work, refers to the preliminary research surrounding the creation of a
prototype test; a general objective of pilot work is to determine how best to
measure, gauge, assess, or evaluate to target constructs. This could also be
referred to as pilot study and pilot research.
TEST CONSTRUCTION
Scaling (1) in test construction, the process of setting rules for assigning
numbers in measurement; (2) the process by which a measuring device is
designed and calibrated and the way numbers (or other indices that are scale
values) are assigned to different amounts of the trait, attribute, or characteristic
measured; (3) assigning numbers in accordance with empirical properties of
objects or traits.
Item-difficulty index, in the context of achievement or ability testing and other contexts in which responses are keyed correct, a statistic
indicating how many test takers responded correctly to an item.
Item-endorsement index, in the context of personality assessment and other contexts in which the nature of the test is such that responses
are not keyed correct or incorrect, a statistic indicating how many test takers responded to an item in a particular direction.
Item-reliability index is a statistic designed to provide an indication of a test\s internal consistency; the higher the item-reliability index,
the greater the test’s internal consistency.
Item-validity index is a statistic indicating the degree to which a test measures what it purports to measure; the higher the item-validity
index, the greater the test’s criterion-related validity.
Item-discrimination index is a statistic designed to indicate how adequately a test item discriminates between high and low scorers.
TEST TRYOUT
Item-characteristic curves
Item-characteristic curve (ICC) is a graphic representation of the probabilistic relationship between a person’s level on a
trait (or ability or other characteristic being measured) and the probability for responding to an item in a predicted way; also
known as a category response curve, and item response curve, or an item trace line.
Other considerations in Item Analysis
Guessing. Methods designed to detect guessing, minimize the effect of guessing, and statistically correct for guessing have
been proposed, but no such method has achieved universal acceptance. The following three criteria that any correction for
guessing must meet, as well as the other interaction issues must be addressed:
1. A correction for guessing must recognize that, when a respondent guesses an answer on an achievement test, the guess
is not typically made on a totally random basis. It is more reasonable to assume that the test taker’s guess is based on
some knowledge of the subject matter and the ability to rule out one or more of the distractor alternatives.
2. A correction for guessing must also deal with the problem of omitted items. Sometimes, instead of guessing, the test
taker will simply omit a response to an item.
3. Some test takers may be luckier than others in guessing the choices that are keyed correct. Any correction for guessing
may seriously underestimate or overestimate the effects of guessing for lucky and unlucky test takers.
TEST TRYOUT
Item fairness is a reference to the degree of bias, if any, in a test item
Biased test item is a test item that favors one particular group of examinees
in relation to another when differences in group ability are controlled.
Speed test is a test, usually of achievement or ability, with a time limit: speed
tests usually contain items of uniform difficulty level. Item analyses of test
taken under speed conditions yield misleading or uninterpretable results. The
closer an item to the end of the test, the more difficult it may appear to be
because test takers simply may not get to items near the end of the test before
the time runs out.
TEST TRYOUT
Qualitative methods are techniques of data generation and analysis that rely primarily on verbal rather than
mathematical or statistical procedures.
Qualitative item analysis is a general term for various non-statistical procedures designed to explore how
individual test items work, both compared to other items in the test, and in the context of the whole test; in
contrast to statistically based procedures, qualitative methods involve exploration of issues by verbal means such
as interviews and group discussions conducted with test takers and other relevant parties.
“think aloud” test administration refers to a method of qualitative item analysis requiring examinees to
verbalize their thoughts as they take a test: useful in understanding how individual items function in a test and how
test takers interpret or misinterpret the meaning of individual items.
Expert panels in the context of test development process, is a group of people knowledgeable about the subject
matter being tested and/or the population for whom the test was designed who can provide input to improve the
test’s content, fairness and other related ways.
Sensitivity review is a study of test items, usually during test development, in which items are examined for
fairness to all prospective test takers and for the presence of offensive language, stereotypes, or situations.
TEST REVISION
Test revision
Much of the discussion of test revision in the development of a band-new test
may also apply to the development of subsequent editions of existing tests,
depending on just how “revised” the revision is.
Test revision as a Stage in New Test Development
A tremendous amount of information is generated at the item-analysis stage,
particularly given that a developing test may have hundreds of items. On the
basis of that information, some items from the original item pool will be
eliminated and others will be rewritten.
TEST REVISION
Test Revision in the Life Cycle of an Existing Test
There comes a time in the life of most test when the test will be revised in
some way or its publication will be discontinued. No hard-and-fast rule exists
for when to revise a test. The American Psychological Association offered the
general suggestions that an existing test be kept in its present form as long as
it remains useful but that it should be revised when new research data,
significant changes in the domain represented, or newly recommended
conditions of test use may reduce the validity of the test score interpretation.
TEST REVISION
Practically speaking, many tests are deemed to be due for revision when any of the
following conditions exist.
1. The stimulus materials look dated and current testtakers cannot relate to them.
2. The verbal content of the test, including the administration instructions and the test
items, contains dated vocabulary that is not readily understood by current testtakers.
3. As popular culture changes and words take on new meanings, certain words or expressions in the test items or directions may
be perceived as inappropriate or even offensive
to a particular group and must therefore be changed.
4. The test norms are no longer adequate as a result of group membership changes in the
population of potential testtakers.
5. The test norms are no longer adequate as a result of age-related shifts in the abilities
measured over time, and so an age extension of the norms (upward, downward, or in
both directions) is necessary.
6. The reliability or the validity of the test, as well as the effectiveness of individual test
items, can be significantly improved by a revision.
7. The theory on which the test was originally based has been improved significantly, and
these changes should be reflected in the design and content of the test.
TEST REVISION
Cross-validation is a revalidation on a sample of test takers other than the test takers on whom
test performance was originally found to be a valid predictor of some criterion
Validity shrinkage refers to the decrease in item validities that occurs after cross-validation
Co-validation the test validation process conducted on two or more tests using the same sample
of test takers; when used in conjunction with the creation of norms or the revision of existing
norms, this process may also be referred to as co-norming
Quality assurance during test revision
Anchor protocol is a test protocol scored by a highly authoritative scorer that is designed as a
model for scoring and a mechanism for resolving scoring discrepancies
Scoring drift is a discrepancy between the scoring in an anchor protocol and the scoring of
another protocol
TEST REVISION
Determining measurement equivalence across test taker populations
Item response theory (IRT) also referred to as latent-trait theory or the latent trait model, is a system of
assumptions about measurement (including the assumption that a trait being measured by a test is
unidimensional) and the extent to which each item measured that trait.
Differential item functioning (DIF) in item response theory IRT, it is a phenomenon in which the same
test item yield one result for members of one group and a different result for members of another group,
presumably as a result of group differences that are not associated with group differences in the construct
being measured.
DIF analysis in IRT, a process of group-by-group analysis of item response curves for the purpose of
evaluating measurement instrument or item equivalence across different groups of test takers.
DIF items in IRT, test items that respondents from different groups, who are presumably at the same level
of the underlying construct being measured, have different probabilities of endorsing as a function of their
group membership.
CREATING AND VALIDATING A TEST
OF ASEXUALITY
Human Asexuality may be defined as an absence of sexual attraction to anyone at
all.
A test designed to screen for Human asexuality, the Asexuality Identification Scale
(AIS) was developed. The AIS is a 12-item, sex- and gender- neutral, self-report
measure of asexuality. The AIS was developed in a series of stages.
Stage 1 included development and administration of eight open-ended questions to
sexual (n = 70) and asexual (n = 139) individuals. Subjects responded in writing to a
series of questions focused on definitions of asexuality, sexual attraction, sexual
desire, and romantic attraction. Participant responses were examined to identify
prevalent themes, and this information was used to generate 111 multiple-choice
items.
CREATING AND VALIDATING A TEST
OF ASEXUALITY
In Stage 2, these 111 items were administered to another group of asexual (n = 165)
and sexual (n = 752) participants. The resulting data were then factor- and item-
analyzed in order to determine which items should be retained. The decision to retain
an item was made on the basis of our judgment as to which items best differentiated
asexual from sexual participants. Thirty-seven items were selected based on the
results of this item selection process.
In Stage 3, these 37 items were administered to another group of asexual (n = 316)
and sexual (n = 926) participants. As in Stage 2, the items were analyzed for the
purpose of selecting those items that best loaded on the asexual versus the sexual
factors. Of the 37 original items subjected to item analysis, 12 items were retained,
and 25 were discarded.
CREATING AND VALIDATING A TEST
OF ASEXUALITY
In order to assess whether the measure was useful over and above already-
available measures of sexual orientation, we compared the AIS to an
adaptation of a previously established measure of sexual orientation.
CREATING AND VALIDATING A TEST
OF ASEXUALITY
Sexual and asexual participants significantly differed in their AIS total scores
with a large effect size. Further, the AIS passed tests of known-groups,
incremental, convergent, and discriminant validity. This suggests that the AIS
is a useful tool for identifying asexuality, and could be used in future research
to identify individuals with a lack of sexual attraction.