Chap008

Chapter 08 - Test Development
Chapter 08
Test Development
Multiple Choice Questions

1. Human asexuality is generally defined as

A. the absence of sexual attraction to anything at all.
B. a sexual attraction only to other asexual people.
C. an unwillingness or inability to experience sexual arousal.
D. the absence of sexual attraction to anyone at all.
Accessibility: Keyboard Navigation

2. Estimates suggest that approximately __% of the population might be asexual.

A. 1
B. 2
C. 3
D. 4

3. The concept of asexuality was first introduced by

A. William Masters.
B. Alfred Kinsey.
C. Virginia Johnson.
D. William Masters and Virginia Johnson.

8-1
Copyright © 2018 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of
McGraw-Hill Education.
4. Asexuality
A. is a sexual orientation.
B. is not a sexual orientation.
C. is considered by some to be a sexual orientation and by others not.
D. was de-listed as a sexual orientation in DSM-5.

5. It is an online community of asexual individuals which has become a source of recruitment
of subjects for asexuality research. It is called the
A. Asexuality and Visibility Education Network.
B. Friends of Asexuality.
C. League of Asexual and Non-Sexual Individuals.
D. American Society of Affiliated Individuals for Asexuality.

6. A disadvantage of recruiting asexual research subjects from a single online community is
that
A. the persons belonging to the online community may constitute a unique group within the
asexual population.
B. the persons belonging to the online community have already acknowledged their asexuality
as an identity.
C. asexual individuals who do not belong to the community will be systematically omitted.
D. All of these.

8-2
7. In response to the need for an instrument to help identify individuals who have experienced
a lifelong lack of sexual attraction, but who have never heard the term "asexual," Yule et al.
(2015) developed a test called the
A. Asexuality Evaluation Schedule.
B. Asexuality Identification Scale.
C. Asexual Research Subject Selector.
D. None of these

8. Many asexual individuals refer to themselves as

A. "selfies".
B. "ace".
C. "lone rangers".
D. "gender-neutral".

9. The test of asexuality developed by Yule et al. (2015) contains ___ items.
A. 12
B. 18
C. 36
D. 48

10. Brotto and Yule reported that the development of their measure of asexuality was
developed in four stages. Which best characterizes Stage 1?
A. literature search for definitions of asexuality
B. development of open-ended questions
C. literature search for correlates of asexuality
D. writing and submission of a research grant request

8-3
8-4
11. Brotto and Yule reported that the development of their measure of asexuality was
developed in four stages. Which best characterizes what they did during Stages 2 and 3?
A. analysis of variance
B. regression analysis
C. factor analysis
D. meta-analysis

12. In the course of developing their asexuality measure, Brotto and Yule were able to identify
about ____% of self-identified asexual individuals.
A. 88
B. 93
C. 94
D. 97

13. In order to determine whether their new measure of asexuality was useful over and above
already-available measures of sexual orientation, Brotto and Yule compared it to a previously
established measure of sexual orientation called the
A. Sexual Desire Inventory.
B. Solitary Desire subscale of the Sexual Desire Inventory.
C. Abernathy Measure of Sexual Orientation.
D. Klein Scale.

8-5
14. Brotto and Yule established the discriminant validity of their measure of asexuality by
comparing scores on it with scores on
A. the Childhood Trauma Questionnaire.
B. the Short-Form Inventory of Interpersonal Problems-Circumplex scales.
C. the Big-Five Inventory.
D. All of these

8-6
15. According to Brotto and Yule, their new measure of asexuality performed satisfactorily
on
A. a measure of incremental validity.
B. a measure of convergent validity.
C. a measure of discriminant validity.
D. All of these

16. Brotto and Yule expressed their belief that their new measure of asexuality
A. does not depend on one's self-identification as asexual.
B. is not capable of identifying the individual who exhibits characteristics of a lifelong lack of
sexual attraction in the absence of personal distress.
C. should be used with caution as a tool of recruitment with members of the asexuality
population.
D. All of these

17. An analysis of a test's item may take many forms. Thinking of the descriptions cited in
your text, which is NOT one of those forms?
A. item validity analysis
B. item discrimination analysis
C. item tryout analysis
D. item reliability analysis

8-7
18. As illustrated in the sample item-characteristic curve published in your textbook, the
vertical axis on the graph lists the
A. values of the score on the test ranging from 0 to 100.
B. values of the characteristic of the items on a scale of 1 to 10.
C. heteroscedasity of the item curve in values ranging from 0 to infinity.
D. probability of correct response in values ranging from 0 to 1.

8-8
19. Which statement is TRUE regarding test development and testtaker guessing?

A. Methods have been designed to detect guessing.
B. Methods have been designed to statistically correct for guessing.
C. Methods have been designed to minimize the effects of guessing.
D. All of these

20. Item banks
A. were once a profit center for the Wells Fargo Company.
B. originated as a result of investments made by Morgan-Stanley.
C. originated as a result of investments made by Morgan Freeman.
D. None of these

21. An item bank is

A. a computerized system whereby test items "pay dividends" only when used.
B. the optimum combination of reliability and validity in an item.
C. a set of items from which a test can be constructed.
D. a statistical "IRA" for data relating to high and low scorers on a test.

22. Item branching refers to

A. administering certain test items on a test depending on the testtakers' responses to previous
test items.
B. the creation of alternate and parallel forms of tests based on a group of testtakers' responses
to the original test.
C. statistical efforts to ensure that items translated into foreign languages are of the same
difficulty.
D. re-using items in an original test that were originally developed for use in a parallel test.

8-9
8-10
23. An anchor protocol is

A. a previously developed test with known validity that can be used as a comparison for
newly developed tests.
B. a statistical procedure in which weights are assigned to each item of a model test to
maximize predictive validity.
C. a list of guidelines for a standardized test used to ensure that all testtakers are similar in key
ways to the population of the original standardization sample.
D. a model for scoring and a mechanism for resolving scoring discrepancies.

24. Scoring drift refers to

A. the tendency of scorers to give higher scores to testtakers with certain characteristics (such
as age and gender) that is similar to themselves.
B. differences between the typical scoring of an item during standardization and subsequent,
more authoritative scoring of an item.
C. a gradual decline in inter-scorer reliability after 95% of the examinations have been scored
due to scorer fatigue.
D. a flexible method of scoring test items for populations other than that of the
standardization sample.

25. Item analysis is conducted to evaluate

A. item reliability.
B. item validity.
C. item difficulty.
D. All of these

8-11
26. The idea for a new test may come from

A. social need.
B. review of the available literature.
C. common sense appeal.
D. All of these

27. According to the text, which statement is TRUE of scaling?

A. There is only one best approach to scaling and only one best type of scale.
B. Ratio scaling leads to the least scoring drift.
C. Ratio scaling was first developed in the Republic of Samoa.
D. None of these

28. Guttman scales
A. are typically used with nominal categories.
B. typically are constructed so that agreement with one statement may predict agreement with
another statement.
C. typically are constructed so that agreement with one statement should not be correlated
with agreement with any other statement.
D. were originally developed by a Peace Corps task force.

29. Sorting techniques can be employed to develop

A. nominal scales.
B. ordinal scales.
C. interval scales.
D. All of these

8-12
30. Test items that contain alternatives with five points ranging from "strongly agree" to
"strongly disagree" are characterized as using this approach to scaling:
A. Guttman scaling.
B. Likert scaling.
C. Nielson scaling.
D. Opinion scaling.

31. Ideally, the first draft of a test should include at least how many items as compared with
the final version of the test?
A. about twice the number of the final version
B. about half the number of the final version
C. about three times the number of the final version
D. roughly the same number as the final version

32. The elements of a multiple-choice item include

A. a stem.
B. a distractor.
C. a foil.
D. All of these

33. Which is an example of the selected-response item format?

A. a multiple-choice item
B. a fill-in-the-blank item
C. Both a multiple-choice item and a fill-in-the-blank item
D. None of these

8-13
34. A well-written true-false item

A. includes multiple ideas.
B. has a correct response that is either true or false, and not subject to debate.
C. typically contains irrelevant information as a distracter.
D. Both includes multiple ideas and has a correct response that is either true or false, and not
subject to debate.

35. Multiple-choice items draw primarily on which testtaker ability?

A. recognition.
B. organization.
C. planning.
D. perceptual-motor skills.

36. An example of a selected-response type of item is

A. a multiple-choice item.
B. an essay item.
C. a matching item.
D. Both a multiple-choice item and a matching item.

37. With regard to the test tryout phase of test development,

A. test conditions should be as similar to the actual administration as possible.
B. at least 500 subjects should be included to ensure accurate results.
C. the sample used must be nationally representative.
D. All of these

8-14
38. According to your textbook, the minimum sample for a test tryout is
A. one-half of the number of testtakers in the standardization sample.
B. 25 testtakers.
C. 50 testtakers.
D. 500 testtakers.

39. An ADVANTAGE of applying item response theory (IRT) in test development is that
A. the principles underlying IRT make its application easy and appealing.
B. sample sizes used to test the utility of test items can be relatively small.
C. assumptions underlying IRT usage are weak.
D. item statistics are independent of the samples administered the test.

40. If 100 people take a test and 20 of those testtakers answer a particular item correctly, then
the p value of the item is
A. .25.
B. .20.
C. .40.
D. .04.

41. Which statement best describes the relationship between item difficulty and a "good"
item?
A. The difficulty level is not a factor in determining a "good" item.
B. An item with a high difficulty level is likely to be "good."
C. An item with a mid-range difficulty level is likely to be "good."
D. An item with a low difficulty level is likely to be "good."

8-15
42. An item-difficulty index can range from

A. 0 to 1.
B. .10 to .99.
C. .25 to .75.
D. 0 to 100.

43. An item-difficulty index of 1 occurs when

A. all examinees answer the item incorrectly.
B. all examinees answer the item correctly.
C. examinees are evenly divided between correct and incorrect responses.
D. None of these

44. The higher the item-difficulty index, the ________ the item.

A. easier
B. harder
C. more robust
D. less robust

45. An item-endorsement index is most likely to be used in which type of test?

A. a cognitive test
B. an achievement test
C. a vocational aptitude test
D. a personality test

8-16
46. In item analysis, the term item endorsement refers to the percent of testtakers who
A. responded correctly to a particular item.
B. indicate that they agree with a particular item.
C. passed the item on a pass/fail test of ability.
D. consented to answer an optional item.

47. The item-validity index is key in determining

A. construct validity.
B. criterion-related validity.
C. content validity.
D. All of these

48. It is needed to calculate the item-validity index. It is

A. the point-biserial correlation between the item score and the criterion score.
B. the mean of the item-score distribution.
C. the item-score standard deviation.
D. All of these

49. An item-reliability index provides a measure of a test's

A. test-retest reliability.
B. internal consistency.
C. stability.
D. All of these

8-17
50. To calculate an item-reliability index, one must have previously calculated

A. the correlation between the item score and the criterion.
B. the correlation between the item score and the total score.
C. the item-score standard deviation.
D. All of these

51. What is the optimal item-difficulty level for a true-false item?

A. .500
B. .625
C. .755
D. 1.000

52. An item-discrimination index typically compares

A. high scorers' performances with low scorers' performances on a particular item.
B. medium scorers' performances with low and high scorers' performances on a particular
item.
C. low scorers' performances with lower scorers' performances on a particular item.
D. one group of scorers' performances on the item with any other groups of scorers'
performances on the same item.

53. Which statement is TRUE regarding an item-discrimination index?

A. It has been used by e-Harmony.com and other dating sites for matchmaking.
B. There is more than one formula for calculating an item-discrimination index.
C. Tetrachoric correlation is most frequently used in any formula for an item-discrimination
index.
D. All of these.

8-18
8-19
54. As a distribution of scores gets flatter, what happens to the optimal boundary line for
determining higher- and lower-scoring groups for item-discrimination indices?
A. the optimal boundary line gets smaller
B. the optimal boundary line gets larger
C. the optimal boundary line does not change
D. the optimal boundary line ceases to be optimal

55. The greater the magnitude of the item-discrimination index, the more testtakers in the
higher-scoring group answered the item correctly, as compared to testtakers
A. who served as the non-test-taking control group.
B. in the lower-scoring group.
C. who participated in the test standardization.
D. None of these

56. Item-discrimination indexes can range from

A. .001 to 1.00.
B. -1 to +1.
C. 0% to 100%.
D. 1 to 100.

57. A negative item-discrimination index results for a particular item when

A. more high scorers than low scorers on a test get the item correct.
B. more low scorers than high scorers on a test get the item correct.
C. an item is found to be biased and unfair.
D. most testtakers do not enter the response keyed correct for the particular item.

8-20
58. What is the value of the item-discrimination index for an item that all the students in the
higher-scoring group answered correctly but that no one in the lower-scoring group answered
correctly?
A. -1
B. +1
C. .50
D. .25

59. What is the value of the item-discrimination index for an item answered correctly by an
equal number of students in the higher- and lower-scoring groups?
A. -1
B. +1
C. .50
D. 0

60. An item-characteristic curve includes all of the following EXCEPT

A. information that can be used to judge item bias.
B. information that can be used to judge item fairness.
C. item-discrimination information.
D. item-difficulty information.

61. The best type of item yields an item-characteristic curve that

A. has a positive slope.
B. has a negative slope.
C. is leptokurtic.
D. has few, if any, outliers.

8-21
8-22
62. Which is TRUE with regard to latent-trait models?

A. The latent trait is multidimensional.
B. The latent trait is unidimensional.
C. The latent trait cannot be measured by traditional models.
D. The latent trait surfaces before age 12.

63. Which statement is TRUE of guessing?

A. It occurs more often on achievement than personality tests.
B. It posts methodological problems for the testtaker.
C. Most testtakers guess based on little knowledge of the subject matter.
D. It poses methodological problems for the test developer.

64. Which is TRUE of item-characteristic curves?

A. They determine which items are fair.
B. They may be used as an aid in assessing whether or not items are biased.
C. They determine which items are most reliable under specified conditions.
D. They may be used as an aid in determining the kurtosis of a distribution of test scores.

65. All of the following are methods of evaluating item bias EXCEPT

A. noting differences between the item-characteristic curves.
B. noting differences in the item-difficulty levels.
C. noting differences in item-discrimination indexes.
D. noting differences in validity shrinkage.

8-23
66. In general, what can be said about an item analysis of a speeded test?
A. Results are often misleading and difficult to interpret.
B. Item-difficulty levels are higher toward the end of the test.
C. Item-discrimination levels are higher for later items.
D. All of these

67. Generous time limits are typically associated with

A. speeded conditions.
B. power conditions.
C. untimed conditions.
D. hazardous conditions.

68. Ability tests are typically standardized on a sample that is representative of the general
population and selected on the basis of variables such as
A. age.
B. gender.
C. geographic region.
D. All of these

69. Ideally, psychological or educational tests are revised

A. every decade.
B. when the test is no longer useful.
C. as a function of annual test sales.
D. None of these

8-24
70. Which of the following conditions may lead to the decision to revise a psychological or
educational test?
A. item content, including the vocabulary used in instructions and pictures, has become dated
B. test norms no longer represent the population for which the test is designed
C. reliability and validity of a test can be improved by a revision
D. All of these

71. As part of the test development process, a test revision may entail
A. re-wording, deletion, or development of new items.
B. development of a new edition of a test.
C. the reprinting of a test.
D. Both re-wording, deletion, or development of new items and development of a new edition
of a test.

72. With regard to the test revision process, it typically

A. takes about one year to complete.
B. includes all of the steps that the initial test development included.
C. is much less expensive than the original development of a test.
D. All of these

73. Co-validation is:
A. highly recommended and encouraged by test professionals.
B. also referred to as co-norming.
C. a strategy that can save time and money for the test publisher.
D. Both also referred to as co-norming and a strategy that can save time and money for the
test publisher.

8-25
8-26
74. During the norming of a new intelligence test, a test publisher administers to all of the
testtakers not only the new intelligence test, but a vision test using an eye chart. The publisher
has engaged in
A. test conceptualization.
B. cross-validation.
C. shared validation.
D. None of these

75. Which is TRUE of cross-validation of a test after standardization has occurred?

A. Cross-validation creates confusion regarding the meaning of the original standardization
data.
B. The cross-validation sample is composed of the same testtakers that participated in the
original test standardization.
C. Cross-validation often results in validity shrinkage.
D. All of these

76. The term used to describe the decrease in item validities that typically occurs during
cross-validation is
A. validity detriment.
B. validity decrement.
C. validity shrinkage.
D. cross-validation devaluation.

8-27
77. A test manual for a commercially prepared test should ideally include
A. a description of the test development procedures used.
B. test-retest reliability data.
C. internal-consistency reliability data.
D. All of these

8-28
78. A student raises concern that a professor has given different grades to two essay answers
that are very similar. From a psychometric perspective, the student is expressing concerns
about
A. criterion-related validity.
B. rater error.
C. test-retest reliability.
D. parallel forms reliability.

79. A student complains that a midterm examination did not include items from a particular
in-class lecture. From a psychometric perspective, the students is expressing concern about
the midterm's
A. test-retest reliability.
B. internal consistency reliability.
C. content validity.
D. cross-validation.

80. A student makes the following complaint after taking an exam: "I spent all night studying
Chapter 7 and there wasn't even one test question from that chapter!" From a psychometric
perspective, this student is concerned about the exam's
A. error variance.
B. test-retest reliability.
C. rater error.
D. None of these

8-29
81. A professor who asks a colleague to re-grade a set of essay questions is most likely trying
to address or prevent concerns about:
A. rater error.
B. validity shrinkage.
C. criterion-related validity.
D. test-retest reliability.

82. Most classroom tests developed by instructors for use in their own classroom are
A. subjected to formal procedures of psychometric evaluation.
B. only evaluated formally for content validity.
C. evaluated informally for their psychometric properties.
D. used without modification, year after year, until retirement or death.

83. Who is best associated with the development of the scaling methodology?

A. Galton
B. Cohen
C. Spearman
D. Thurstone

84. Which scaling method entails a process by which measures of item difficulty are obtained
from samples of testtakers who vary in ability?
A. difficulty scaling
B. absolute scaling
C. content scaling
D. sample-contingent scaling

8-30
85. The Likert scale is an example of which type of rating scale?

A. categorical
B. paired methods
C. summative
D. content

86. Which is NOT a typical question that is raised and answered during the test
conceptualization stage of test development?
A. What is the objective of the test?
B. Is there a need for the test?
C. How valid are the items on the test?
D. What types of responses will be required of the testtaker?

87. Which is a major difference between comparative scaling and categorical scaling?

A. Comparative scaling involves sorting stimuli; categorical scaling does not.
B. Comparative scaling involves making quantitative judgments; categorical scaling does not.
C. Comparative scaling involves putting stimulus cards in a set number of different piles
assigned a certain meaning; categorical scaling does not.
D. Comparative scaling involves rank-ordering each stimulus individually against every other
stimulus; categorical scaling does not.

8-31
88. In Guttman scaling

A. testtakers are presented with a forced-choice format.
B. each item is completely independent of every other item and nothing can be concluded as
the result of the endorsement of an item.
C. when one item is endorsed by a testtaker, the less extreme aspects of that item are also
endorsed.
D. when more than one item tapping a particular content area is endorsed, the less extreme
aspects of those items are also endorsed.

89. Which is TRUE of Thurstone's equal-appearing intervals method of scaling?

A. It is relatively simple to construct.
B. It demands that the testtaker sort item responses into stacks of similar content.
C. It uses judges' ratings to assign values to items.
D. It is typically devised using proprietary software developed by Louis Thurstone's
grandchildren.

90. When writing items for a test, a test developer would be well advised to incorporate
A. knowledge acquired from Cohen & Swerdlik (2017).
B. knowledge from information supplied in scholarly journals.
C. interviews with experts.
D. All of these

91. Which is an example of the use of a completion format on a test?

A. true-false items
B. matching items
C. short-answer items
D. multiple-choice item

8-32
8-33
92. Which is a major difference between multiple-choice questions and essay questions?

A. Essay questions involve primarily recognition, while multiple-choice questions involve
logical reasoning.
B. Essay questions are scored more objectively because the examiner is provided with more
information from the examinee.
C. Essay questions can test a wider range of material.
D. Essay questions allow for more creativity to be expressed by the examinee.

93. An advantage of using a true-false item format over a multiple-choice item format in a
teacher-made test designed for classroom use is
A. true-false items are applicable to a wider range of subject areas.
B. true-false items are easier to write.
C. true-false items reduce the odds of a correct answer as the result of guessing.
D. true-false items will never become dated.

94. In a cumulative model of scoring applied to an ability test

A. the higher the total score, the higher the testtaker is on the ability measured by the test.
B. the pattern of responses is critically important when judging the ability of the testtaker.
C. comparisons of the testtaker's performance on tests tapping similar abilities may easily be
made.
D. All of these

8-34
95. In ipsative scoring, a testtaker's scores are compared to

A. the scores of other testtakers from the same geographic area who are similar with regard to
key demographic variables.
B. his or her other scores on the same test.
C. the scores of other testtakers from past years who have taken the same test under the same
or similar conditions.
D. his or her other scores on a parallel form of the same test.

96. An individually administered designed for use with elementary-school-age student is in

the test tryout stage of test development. For the purposes of the tryout, this test should be
administered
A. as a group test to as many classes as possible in an elementary school.
B. individually to high school students for exploratory purposes.
C. individually to elementary-school-age students in an environment that simulates the way
that the final version of the test will be administered.
D. to experts in elementary school education to ensure that the items are appropriate for
elementary school-aged children.

97. A decision is made to use only a few subjects per item during the test tryout phase of a
test's construction. This decision is MOST LIKELY to lead to
A. "phantom factors" during test construction.
B. "phantom factors" during the test administration.
C. "phantom factors" during factor analysis.
D. "phantom deposits" in the test author's royalty account.

8-35
98. A DISADVANTAGE of applying classical test theory (CTT) in test development is that
A. the number of testtakers in the sample must be very large.
B. all CTT-based statistics are sample-dependent.
C. assumptions underlying CTT use are weak.
D. All of these

99. A "good" test item on an ability test is one

A. to which almost all testtakers respond correctly.
B. that distinguishes high scorers from low scorers.
C. to which almost all testtakers respond incorrectly.
D. in which it is absolutely impossible to guess the correct answer.

100. The optimal level of item difficulty is MOST typically

A. .5.
B. the midpoint between 1.0 and the chance of success by random guessing.
C. .25.
D. the midpoint between 0 and the chance of success by random guessing.

101. Test developers calculate an item-validity index to:

A. understand why an item is difficult or easy.
B. reduce the likelihood of an examinee's guessing.
C. maximize the criterion-related validity.
D. determine the internal consistency of the test.

8-36
102. The higher an item-validity index, the greater the __________ validity.

A. construct
B. content
C. criterion
D. face

103. The higher the item-reliability index,

A. the higher the internal consistency of the test.
B. the lower the internal consistency of the test.
C. the more likely the testtaker is to miss the item.
D. the more likely the test developer is to eliminate the item.

104. Factor analysis can help the test developer

A. to eliminate or revise items that do not load on the predicted factor.
B. to identify whether test items appear to be measuring the same construct.
C. Both to eliminate or revise items that do not load on the predicted factor and to identify
whether test items appear to be measuring the same construct.
D. None of these

105. An item-discrimination index is used on an ability test

A. to determine whether items are measuring what they are designed to measure.
B. to measure the difference between how many high scorers and how many low scorers
answered the item correctly.
C. to estimate how predictive the item is of the testtaker's future performance.
D. to measure the difference between how many median scorer and how many low scorers
answered the item correctly.

8-37
8-38
106. If an item-discrimination index is negative

A. high scorers are more likely to have answered the item correctly than low scorers.
B. low scorers are more likely to have answered the item correctly than high scorers.
C. the alternate form of the test is probably not equivalent.
D. the computer scoring is in error because this index is not supposed to be negative.

107. An analysis of item alternatives for a multiple-choice test can yield information about
A. the effectiveness of distracter choices.
B. which items are in need of revision.
C. testtaker response patterns.
D. All of these

108. An item-characteristic curve

A. is the single best index of guessing a test user has.
B. plots the reliability and the validity of the item.
C. Both is the single best index of guessing a test user has and plots the reliability and the
validity of the item.
D. None of these

109. When an item-characteristic curve of an ability test has an inverted U shape, it usually

indicates that
A. testtakers of moderate ability have the highest probability of answering the item correctly.
B. testtakers of low ability have the highest probability of answering the item correctly.
C. testtakers of high ability have the highest probability of answering the item correctly.
D. the item is working as well as any item on this test could be expected to work.

8-39
110. Which is TRUE of the latent-trait model of measurement?

A. Most research conducted on this model has been applied to achievement tests.
B. The variables being measured by a test are not directly observable and are assumed to be
unidimensional.
C. This model is applicable to all tests.
D. The variables being measured by a test are not directly observable and are assumed to be
multidimensional.

111. On a particular test, men and women tend to have the same total score. Men and women
do, however, tend to exhibit different response patterns to specific items. A reasonable
conclusion is that the test is:
A. unreliable.
B. invalid.
C. biased.
D. patently unfair.

112. A sensitivity review typically focuses on which of the following?

A. individual test items
B. the standardization sample
C. statistics used as part of validity and reliability studies
D. the extent to which latent traits are latent

8-40
113. As the result of a sensitivity review, items containing __________ may be eliminated
from a test.
A. offensive language
B. stereotypes
C. unfair reference to situations
D. All of these

8-41
114. A sensitivity review panel would most likely be made up of

A. only psychologists from the majority group.
B. only psychologists from a particular minority group.
C. psychologists representing both minority and majority groups.
D. measurement specialists from all continents known for their sensitivity.

115. As part of the process of test development, the term test revision BEST refers to the
A. rewording, deletion, or development of new items.
B. development of a completely new test.
C. reprinting of a test after a previous edition has sold out.
D. Both rewording, deletion, or development of new items and development of a completely
new test.

116. The think aloud test administration format

A. has examinees literally thinking aloud as they respond to each item on a test.
B. is a qualitative technique.
C. can help test developers understand how an examinee interprets particular items.
D. All of these

117. Expert panels may be used in the process of test development to

A. provide judgments concerning each item's reliability.
B. serve as expert witnesses in any future litigation.
C. screen test items for possible bias.
D. All of these

8-42
118. Having a large item pool available during test revision is

A. a disadvantage due to the great expense of item development.
B. often a waste of time because many of the items are eventually deleted.
C. an advantage because poor items can be deleted in favor of the good items.
D. a great perk for test developers who are swimming enthusiasts.

119. A test developer designs a test for the sole purpose of identifying the most highly skilled
individuals among those tested. During the test revision stage of test development, the test
developer will be particularly interested in
A. item bias.
B. item discrimination.
C. item reliability.
D. item validity.

120. In creating a test designed to measure personality constructs, the test developer's first
step would BEST be to
A. determine which items would lead to socially desirable responses.
B. create a large pool of potential items.
C. define the construct or constructs being measured.
D. select a representative sample of testtakers for test tryout.

8-43
121. The following item appears on an end-of-semester course evaluation in a test and

measurements course: The most interesting class I am taking this semester is "Tests and
Measurements." The possible responses are:
1. strongly agree.
2. agree.
3. unsure.
4. disagree.
5. strongly disagree.
This item illustrates what approach to scaling?

A. nomothetic
B. Likert
C. Guttman
D. ipsative

122. Which is TRUE of item analysis on speeded tests?

A. Results of the item analysis are relatively easy to interpret and are clear.
B. Item-difficulty levels are lower toward the end of the test.
C. Item-discrimination levels are higher toward the end of the test.
D. Later items tend to have low item-total correlations.

123. If 50 students were administered a classroom test, how many would be included in each
group for the purpose of calculating d, the item-discrimination index?
A. 25
B. 10
C. 13
D. 17

8-44
124. The Rokeach values measure involves presenting the subject with index cards, on each
of which a single value is listed. Testtakers are asked to place the cards in order of their own
concern about each of the values. This procedure BEST exemplifies
A. multidimensional scaling.
B. Likert scaling.
C. comparative scaling.
D. Murray scaling.

125. When analyzing a particular item's discriminative abilities for an ability test, the test
developer typically compares the responses to the item to
A. the highest and lowest scorers on the test.
B. the highest and middle scorers on the test.
C. the performance on the test of a minority groups to rule out any possible bias.
D. testtakers from predefined age groups to rule out any possible age discrimination.

126. In measurement employing latent-trait models

A. the underlying trait is assumed to be unidimensional.
B. all the test items are thought to measure a single trait.
C. the characteristic being measured is not measured directly.
D. All of these

127. Criterion-referenced testing and assessment is most typically employed in

A. licensing for occupations and professions.
B. the diagnosis of reading difficulties.
C. competition for scholarships.
D. situations where the criteria required for success are vague.

8-45
8-46
128. Items for an item bank

A. may be taken from existing tests.
B. are always written especially for the item bank.
C. have never before been administered.
D. earn interest at prime minus one percent.

129. You are interested in developing a test for social adjustment in a college fraternity or
sorority. You begin by interviewing persons who had graduated from college after having
been a member of a fraternity or sorority for at least 2 years. Which stage of the test
development process BEST describes the stage that you are in?
A. the test-tryout stage
B. the pilot work stage
C. the test construction stage
D. None of these

130. These tests are often used for the purpose of licensing persons in professions. The tests
referred to here are
A. pilot tests.
B. norm-referenced tests.
C. criterion-referenced tests.
D. Guttman scales.

8-47
131. It is a term that is used to refer to the preliminary research surrounding the creation of a
prototype of a test. Which of the following BEST describes that term?
A. pilot work.
B. pilot study.
C. pilot research.
D. All of these

8-48
132. In his article entitled "A Method of Scaling Psychological and Educational Tests," L. L.
Thurstone introduced absolute scaling which was a
A. procedure for obtaining a measure of item validity.
B. procedure for obtaining a measure of item difficulty.
C. procedure for deriving equal-appearing intervals.
D. procedure for divining item reliability.

133. As with the use of other rating scales, the use of Likert scales typically yields _______-
level data.
A. nominal
B. ordinal
C. interval
D. ratio

134. The method of paired comparisons is used to

A. minimize the opportunity of selecting a socially desirable response.
B. maximize the opportunity of selecting a socially desirable response.
C. provide testtakers with a sufficient number of pairs of choices to express their "true"
opinions.
D. provide testtakers with a limited number of pairs of choices in order to minimize testing
time.

8-49
135. Likert scales measure attitudes using continuums. A continuum of items measuring

___________ could be used for a Likert scale.
A. like it to do not like it
B. agree to disagree
C. approve to do not approve
D. All of these

8-50
136. In contrast to scaling methods that employ indirect estimation, scaling methods that
employ direct estimation do not require:
A. writing two sets of items for parallel forms.
B. the use of the method of equal-appearing intervals.
C. transforming testtaker responses into some other scale.
D. indirect methods to interpret testtaker responses.

137. All of the following are components of a multiple-choice item EXCEPT

A. a foil.
B. a correct alternative.
C. a stem.
D. a branch.

138. As described in the text, all of the following are elements of a matching item EXCEPT:
A. a column listing propositions.
B. a column listing responses.
C. a column listing premises.
D. a place to insert the correct number or letter choice.

139. The two columns of a matching item may contain different number of items because this
makes
A. the odds of cheating successfully on this type of item significantly less.
B. it more difficult to achieve a perfect score by guessing.
C. the role of chance a much greater factor than it would be otherwise.
D. it possible for testtakers to decline to respond to certain items.

8-51
140. Computer-adaptive testing has been found to

A. reduce by as much half the number of test items administered.
B. increase the number of test items administered by as much as double.
C. increase measurement error but within tolerable limits.
D. increase inter-item consistency by as much as 50%.

141. A strategy for cheating on an examination entails one testtaker memorizing items and
later recalling and reciting them for the benefit of a future testtaker. This cheating strategy
may be countered by
A. a computer-tailored test administration to each testtaker.
B. a computer-randomized presentation of test items.
C. Both a computer-tailored test administration to each testtaker and a computer-randomized
presentation of test items.
D. None of these

142. On a true/false inventory, a respondent selects true for an item that reads, "I summer in
Tehran." The individual scoring the test would BEST interpret this response as indicative of
the fact that this respondent
A. is extremely eccentric with respect to choice of time shares.
B. requires more sensation-seeking than Cape Cod has to offer.
C. is responding randomly to test items.
D. None of these

8-52
143. Jana takes a personality test administered by the "True Compatibility Dating Service."
According to the personalized, computerized personality profile that results, Jana learns that
her need for exhibitionism is much greater than her need for stability. Since the test analyzes
data only with regard to Jane, and no other client of the dating service, it may be assumed that
the test was scored using
A. a diagnostic model.
B. a cumulative model.
C. an ipsative model of scoring.
D. truly compatible models.

144. A math test developer is interested in deriving an index of the difficulty of the average
item for his math test. As his consultant on test development, you advise him that this index
could be obtained by:
A. identifying the item deemed to be average in difficulty and then deriving an item-difficulty
index for that item.
B. averaging the item-difficulty indices for all test items and then dividing by the total number
of items on the test.
C. dividing the total number of items on the test by the average item-difficulty index.
D. raising that very same question to a more knowledgeable test development consultant.

8-53
145. A test developer of multiple-choice ability tests reviews data from a recent test
administration. She discovers that testtakers who scored very high on the test as a whole, all
responded to item 13 with the same incorrect choice. Accordingly, the test developer
A. assumes that members of the high-scoring group are making some sort of unintended
interpretation of item 13.
B. plans to interview members of the high-scoring group to understand the basis for their
choice.
C. Both assumes that members of the high-scoring group are making some sort of unintended
interpretation of item 13 and plans to interview members of the high-scoring group to
understand the basis for their choice.
D. should remove item 13 from the test and place in its stead a note that reads: "Go to Item
14."

8-54
146. With regard to item-discrimination indices, a d equal to -1 is

A. a test developer's dream.
B. a test developer's nightmare.
C. a testtaker's dream.
D. an insomniac's nightmare.

147. On an item characteristic curve, the steeper the curve,

A. the more latent the trait is presumed to be.
B. the greater the item reliability.
C. the less the item discrimination.
D. the greater the item discrimination.

148. The Rasch model offers a way to model the probability that

A. a person on the border of passing and failing a test will actually succeed at a particular
criterion measure.
B. a person with X ability will be able to perform at a level of Y.
C. a test with a standard deviation of X will have a mean score of Y.
D. a test with X reliability will have Y validity.

149. The reason latent-trait theory is so-named has to do with the presumption that
A. latent traits exist in males and females to the same degree.
B. whatever the test is measuring is multidimensional in nature.
C. the variable being measured is never directly measurable itself.
D. None of these

8-55
150. Test scores measuring latent traits can, in theory at least, take on values ranging from
A. 0 to infinity.
B. negative infinity to positive infinity.
C. 0 to one million.
D. negative one million to positive one million.

151. On the item characteristic curves for a test of ability, a large number of items biased in
favor of male testtakers is found to coexist with the exact same number of items biased in
favor of female testtakers. Based on these findings, it would be reasonable for the test
developer to claim that the test
A. measures the same ability in the two groups.
B. is a fair test as any observed bias balances out.
C. demonstrates gender equality for the ability measured.
D. None of these

152. To ensure consistency in scoring, test developers have employed

A. anchor protocols.
B. resolvers.
C. revolvers.
D. Both anchor protocols and resolvers.

153. Possible applications of IRT were discussed in your textbook. Which of the following is
NOT one of those possible applications?
A. determining measurement equivalence across testtaker populations
B. identifying a common metric among several tests measuring the same construct
C. evaluating existing tests for the purpose of mapping test revisions
D. developing item banks

8-56
8-57
154. To increase the precision of a test, test developers may have to
A. increase the number of items.
B. increase the number of response options.
C. Both increase the number of items and increase the number of response options.
D. None of these

155. When a test is translated from one language in one culture to another language in another
culture, ______ can help ensure that the original test and the translated test are reasonably
equivalent and tapping the same construct.
A. a translator
B. IRT
C. bi-lingual people who are experts on the two cultures
D. All of these

156. A test item functions differently in one group of testtakers as compared to another group
of testtakers known to have the same level of an underlying trait. This phenomenon is known
as:
A. dysfunctional item syndrome.
B. DIF.
C. DIF item difference.
D. DIF item incongruity.

157. Instruments that contain items that function differentially

A. may have reduced validity.
B. may have inflated reliability.
C. are last to be banked in an item bank.
D. are informally referred to as "DIFFED."

8-58
8-59
158. The process of DIF analysis entails

A. scrutinizing item response curves for DIF items.
B. interviewing people from different cultures.
C. administering tests in different ways.
D. Both interviewing people from different cultures and administering tests in different ways.

159. When testing is conducted by means of a computer within a CAT context, it means that
A. a testtaker's response to one item may automatically trigger what item will be presented
next.
B. testing may be terminated based on some pre-set number of consecutive item failures.
C. testing may be terminated based on some pre-set, maximum number of items being
administered.
D. All of these

160. As mentioned in the text, CAT is available on a wide array of platforms including
A. the Internet.
B. X-box.
C. Playstation.
D. All of these

8-60
161. Looking at the item-characteristic curve (below), a reasonable conclusion about the

performance of the item illustrated would be that

A. as theta increases, the probability of a response scored correct increases.
B. as theta decreases, the probability of a response scored correct increases.
C. as theta increases, the probability of a response scored correct decreases.
D. None of these
162. The inspiration to create a new test may come from many varied sources. Thinking of the
illustrative descriptions of inspiration cited in your text, which of the following is NOT a
possible source of inspiration for the creation of a new test?
A. an emerging social phenomenon suggests the need for a psychological test
B. legislation has been passed ordering the creation of a new psychological test
C. a review of the literature suggests a need for a new psychological test
D. a test developer thinks "there is a need for this sort of test"

8-61
8-62
163. One of the questions that the developer of a new test must answer is, "How will the test
be administered?" The answer to this question may be
A. the test will be individually administered.
B. the test will be group administered.
C. the test will be individually or group administered.
D. None of these

164. One of the questions that the developer of a new test must answer is, "Should more than
one form of the test be developed?" In answering this question, a primary consideration is
A. development costs.
B. test content.
C. test reliability.
D. item discrimination.

165. A good item on a norm-referenced achievement test is an item that

A. demonstrates that the testtaker has met certain pre-specified criteria.
B. high scorers respond to correctly while low scorer respond to incorrectly.
C. both high and low scorers respond to correctly.
D. low scorers seek clarification regarding the meaning of the question.

166. The development of a criterion-referenced test usually entails

A. exploratory work with a group of testtakers who have mastered the material.
B. exploratory work with a group of testtakers who have not mastered the material.
C. both exploratory work with a group of testtakers who have mastered the material and
exploratory work with a group of testtakers who have not mastered the material.
D. None of these

8-63
8-64
167. In the field of psychometrics, pilot work refers to the

A. job of someone whose responsibility it is to fly an airplane, jet, or space vehicle.
B. preliminary research entailed in finalizing the form of a test.
C. efforts of the lead researcher on a test development team.
D. preliminary research conducted prior to the stage of test construction.

168. A close friend, who is now a beauty school dropout, is heard to complain: "I spent all
night studying ‘Shampoo' for the final examination and there was not a single question on that
subject!" As a budding expert in testing and assessment you hear that complaint as:
A. "I have a problem with that test's content validity!"
B. "There was excessive error variance in the test administration procedures!"
C. "The instructor should have paid more attention to the test's construct validity!"
D. "Now I am going to have to reconsider a career as a tanning technician!"

169. If all raw scores on a test are to be converted to scores that range only from 1 to 9, the
resulting scale is referred to as this type of scale:
A. a unidimensional scale.
B. a stanine scale.
C. a multidimensional scale.
D. None of these.

170. So-called "smiley face" scales may be used with

A. young children.
B. adolescents who have limited language skills.
C. adults who have limited language skills.
D. All of these

8-65
8-66
171. Using the method of paired comparisons yields

A. nominal level data.
B. ordinal level data.
C. interval level data.
D. ratio level data.

172. Test item writers must keep many considerations in mind. Which of the following is
NOT typically one of those considerations?
A. Will the test be administered by an instructor or a teaching assistant?
B. Which item format or formats should be employed?
C. How many items should be written in total?
D. What range of content should the items cover?

173. A test developer is designing a standardized test using a multiple-choice format. The
final form of the test will contain 50 items. It would be advisable for the first draft of this test
to contain, at least, how many items?
A. 50
B. 100
C. 150
D. 25

174. A test item written in a multiple-choice format has three elements. Which of the
following is NOT one of those elements?
A. foil
B. stem
C. leaf
D. correct option

8-67
8-68
175. Consider the following sample True/False item:
"I am going to ace this course in psychological testing and assessment." Circle TRUE or
FALSE according to your own belief.
This item is an example of an item that

A. is referred to in psychometric parlance as trinitarian in nature.
B. can only be used when a dichotomous choice can be made without qualification.
C. both is referred to in psychometric parlance as trinitarian in nature and can only be used
when a dichotomous choice can be made without qualification.
D. None of these

176. One of the advantages of computerized adaptive testing (CAT) is that

A. all test items are administered to all testtakers.
B. floor effects are reduced.
C. the ceiling has been removed.
D. the basement has been finished.

177. A test developer has created a pool of 30 items and is ready for a test tryout. At a
minimum, how many subjects should the test be administered to?
A. 60
B. 120
C. 150
D. 180

8-69
178. Test developers have at their disposal a number of statistical tools that may be applied
when selecting items for use on a test. In Chapter 8's Meet an Assessment Professional, Dr.
Scott Birkeland made reference to two such techniques. One was a measure of item
discrimination, and the other was a measure of item
A. reliability.
B. utility.
C. difficulty.
D. variance.

8-70

Chap008

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap008

Uploaded by

Copyright:

Available Formats

Chapter 08 - Test Development

Multiple Choice Questions

1. Human asexuality is generally defined as

Accessibility: Keyboard Navigation

2. Estimates suggest that approximately __% of the population might be asexual.

Accessibility: Keyboard Navigation

3. The concept of asexuality was first introduced by

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

8. Many asexual individuals refer to themselves as

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

19. Which statement is TRUE regarding test development and testtaker guessing?

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

21. An item bank is

Accessibility: Keyboard Navigation

22. Item branching refers to

Accessibility: Keyboard Navigation

23. An anchor protocol is

Accessibility: Keyboard Navigation

24. Scoring drift refers to

Accessibility: Keyboard Navigation

25. Item analysis is conducted to evaluate

Accessibility: Keyboard Navigation

26. The idea for a new test may come from

Accessibility: Keyboard Navigation

27. According to the text, which statement is TRUE of scaling?

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

29. Sorting techniques can be employed to develop

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

32. The elements of a multiple-choice item include

Accessibility: Keyboard Navigation

33. Which is an example of the selected-response item format?

Accessibility: Keyboard Navigation

34. A well-written true-false item

Accessibility: Keyboard Navigation

35. Multiple-choice items draw primarily on which testtaker ability?

Accessibility: Keyboard Navigation

36. An example of a selected-response type of item is

Accessibility: Keyboard Navigation

37. With regard to the test tryout phase of test development,

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

Accessibility: Keyboard Navigation

42. An item-difficulty index can range from