Professional Documents
Culture Documents
Of Small Beauties and Large Beasts
Of Small Beauties and Large Beasts
To cite this article: Martin Papenberg & Jochen Musch (2017): Of small beauties and large beasts:
The quality of distractors on multiple-choice tests is more important than their quantity, Applied
Measurement in Education, DOI: 10.1080/08957347.2017.1353987
Quantity
t
ip
CONTACT Martin Papenberg martin.papenberg@uni-duesseldorf.de Department of
cr
40225 Düsseldorf, Germany.
us
Abstract
an
In multiple-choice tests, the quality of distractors may be more important than their number.
We therefore examined the joint influence of distractor quality and quantity on test
M
functioning by providing a sample of 5,793 participants with five parallel test sets consisting
of items that differed in the number and quality of distractors. Surprisingly, we found that
ed
items in which only the one best distractor was presented together with the solution provided
pt
the strongest criterion-related evidence of the validity of test scores and thus allowed for the
most valid conclusions on the general knowledge level of test takers. Items that included the
ce
best distractor produced more reliable test scores irrespective of option number. Increasing
the number of options increased item difficulty, but did not increase internal consistency
Ac
multiple-choice (MC) testing. In its most common form, an MC item consists of a stem
posing a question along with a set of answer options. One of these answers is correct and
needs to be identified by the test-taker. Incorrect answer options are called distractors. The
MC format allows many examinees to be tested on a wide range of contents in a short amount
of time, and it enables the construction of objectively scored tests that warrant reliable and
t
ip
valid conclusions on the knowledge and ability levels of test-takers (Haladyna, 2004).
cr
creation of plausible and functional distractors can be a challenging task (Haladyna &
us
Downing, 1993; Lee & Winke, 2013). Given that a test’s ultimate purpose is to discriminate
between test-takers of high versus low ability, distractors should appear plausible to test-
an
takers with low ability but unattractive to those with better skills (Haladyna, 2004). Given
that writing answer options with sufficient discriminatory power is a difficult and time-
M
consuming task, it is of high practical relevance to know how many answer options are
ed
Psychometrically, it is not mandatory for all items on a test to have the same number
pt
of options (Zoanetti, Beaves, Griffin, & Wallace, 2013). Measurement textbooks have often
suggested that test writers develop at least four or five options for each item (e.g., Owen &
ce
Froman, 1987). This recommendation is based on the expectation that a larger number of
Ac
options reduces the influence of guessing and thereby increases the reliability of test scores.
However, empirical studies have suggested that guessing does not affect test scores much
(Ebel, 1968) and that most items do not have more than two or three options that are
frequently chosen (Tarrant & Ware, 2010). Moreover, Ebel (1969) argued that an appreciable
increase in the precision of test scores can be expected only when the number of options is
changed from two to three. Accordingly, several researchers have come to the conclusion
that three options may be optimal for MC testing (e.g., Baghaei & Amrahi, 2011; Edwards,
Arthur, & Bruce, 2012; Haladyna & Downing, 1993; Owen & Froman, 1987; Rodriguez,
2005; Tversky, 1964). Three options are comparably easy to devise, and empirically, this
When writing test items, however, the quality of the distractors may be more
important than their number. Moreover, the recommendation to write as many distractors as
t
ip
feasible rests on the validity of the assumption that any added distractors are functional
(Haladyna & Downing, 1989). However, little is known about the joint influence of the
cr
quality and quantity of distractors on test functioning, and distractor quality may be more
us
important than distractor quantity. The present study therefore experimentally investigated
the simultaneous influence of distractor quality and quantity on the reliability and validity of
an
test scores in a test of general knowledge.
We start with a summary of the theoretical and empirical research that has addressed
M
the question of the optimal number of answer options for MC items. Subsequently, we
ed
Several theoretical contributions have arrived at the conclusion that three answer
pt
options may be optimal for MC testing. For example, Tversky (1964) showed that given a
fixed total number of choice alternatives, the use of three alternatives at each of several
ce
sequential test. Ebel (1969) estimated the Kuder-Richardson 21 reliability coefficients (Kuder
& Richardson, 1937) for tests that varied in the number of answer options. He found that two
options yielded the lowest internal consistencies, which were however strongly increased by
adding a third option. Adding additional options did not increase internal consistency by
much in his theoretical analysis that was however based on the unrealistic assumptions that
endorsement rate and discriminability were equal for all options. Lord (1977) employed an
item response theory approach and showed that three options maximized information
collection efficiency for the medium ability range. More options were found to be better in
the low-ability range, but two options were optimal in the high-ability range. Lord (1977)
also assumed that all distractors were chosen with equal probability. However, this is not
necessarily the case (Haladyna & Downing, 1993) and may strongly depend on the quality of
the distractors.
t
ip
An experimental procedure has sometimes been employed to address the question of
the optimal number of options (Rodriguez, 2005). At least two versions of the same test, only
cr
differing in the number of options, can be provided to different groups of test-takers. The
us
psychometric properties of test sets are then compared. Most studies of this type have found
that 3-option items tend to be easier than items containing more options but may perform just
an
as well in terms of item discrimination and the precision of test scores (e.g., Costin, 1970;
Owen & Froman, 1987; cf. Rodriguez, 2005). When a reduction in item discrimination or
M
reliability was found for a reduced number of options, the effect was usually negligible
ed
Previous studies have focused on the optimal number of options and have usually not
pt
systematically investigated the impact of their quality, however. To create tests with fewer
options, distractors were either discarded randomly (Baghaei & Amrahi, 2011) or only the
ce
options with the lowest functioning were removed (Edwards & Winfred, 2012; Lee & Winke,
Ac
2013; Tarrant & Ware, 2010; Zoanetti et al., 2013). Arguably, deleting dysfunctional
distractors should be less detrimental to an item’s psychometric quality than deleting better
functioning distractors. The only study that systematically varied both the quality and the
quantity of distractors was Budescu and Nevo (1985). Starting with 5-option items, 4-, 3-,
and 2-option item sets that contained either the most attractive or the least attractive
distractors were created. However, this was done to test the proportionality assumption; that
is, the authors investigated whether the total testing time was proportional to the number of
items and the number of options per item. A strong negative relation between the rate of
performance and the number of options was observed: With an increasing number of
response alternatives, less time was spent on each option. Reliability coefficients increased
with an increasing number of options, but this effect was not tested for significance. The joint
effect of option quality and quantity on the validity of test scores was not examined because
t
ip
no external measure was available to obtain criterion-related evidence of validity.
To measure the functionality of distractors, two data-based criteria can be used. These
cr
are (a) endorsement rate and (b) distractor discrimination. Endorsement rate simply reflects
us
the proportion of test-takers who choose a distractor so that less frequently endorsed options
can be considered less functional. Distractor discrimination refers to how well a distractor
an
distinguishes between high- and low-achievers and can be computed as the point-biserial
correlation between distractor choice and total test score. Distractors yielding a high negative
M
correlation can be considered to have good functioning. Distractors with a zero or even
ed
positive correlation with the total test score have to be considered dysfunctional because their
choice does not indicate endorsement by a test-taker with low ability (DiBattista & Kurzawa,
pt
2011).
Following Haladyna and Downing (1993), most researchers have adopted the criterion
ce
discrimination (e.g., Tarrant, Ware, & Mohammed, 2009). Tarrant et al. (2009) investigated
distractor functionality in seven 4-option MC tests that had been administered to nursing
students at an English-language university in Hong Kong. Using the above two criteria, they
found that the average number of distractors with good functioning per item ranged from only
1.4 to 1.7. Similarly, DiBattista and Kurzawa (2011) investigated the functionality of MC
items that had been used in 16 tests at a Canadian university. They found that the number of
functional distractors per item ranged from only 1.1 to 2.6 on their 4- and 5-option tests. In all
studies investigating distractor functionality, items with a full set of functional distractors
have rarely been found, and Haladyna, Downing, and Rodriguez (2002) even surmised that
three might be a natural limit for plausible options in MC items. Test-writers trying hard to
think of additional distractors seem to be in danger of creating large beasts, i.e. MC items that
t
ip
contain many, but not very useful distractors. We argue that they instead should strive for the
creation of small beauties – MC items that contain few but well-chosen distractors.
cr
Is it possible that 2 options and thus, a single distractor is already sufficient to create a
us
small beauty? There has been relatively little research on 2-option items even though these
items are even easier to construct than 3-option items. In a review of tests based on 2-option
an
items (i.e., alternate choice items), Downing (1992) considered this item type to be viable for
testing and called for further investigation of 2-option testing. More than 20 years later,
M
however, research addressing this issue is still scarce. Researchers and practitioners seem to
ed
have been deterred by some early studies that reported that 2-option items performed poorly
in comparison with items containing more options. In particular, Straton and Catts (1980)
pt
reported a much lower reliability coefficient for a 2-option test compared with a
corresponding 4-option test (.47 vs. .68, respectively). However, answer options were deleted
ce
randomly in this study, a procedure that was likely to result in a poorly functioning distractor
Ac
in 2-option items. If the only distractor is implausible and not discriminating, 2-option items
distractor choice and total test score (Haladyna & Downing, 1993):
y 0 − y1 n0 n1
rdis = (1)
sy n2
where y1 is the mean test score of test-takers correctly identifying the solution, y0 is the mean
score of test-takers choosing the distractor (or one of the distractors in the case of an item
with more than 2 options), and sy is the standard deviation of all test scores. N1 is the number
of test-takers choosing the solution, n0 is the number of test-takers choosing the distractor,
and n is the total number of test-takers. For 2-option items, item discrimination can be
computed by the same formula if y1 and y0 are switched. Hence, item discrimination can be
t
ip
computed by multiplying the distractor discrimination by (-1):
cr
y1 − y 0 n0 n1
ritem = = −rdis (2)
us
sy n2
an
The discrimination of a 2-option item is hence inversely related to the discrimination of the
only distractor, and the quality of this distractor is central to the performance of a 2-option
M
item. We therefore presumed that small beauties – 2-option items that contain a single well-
ed
functioning distractor – can perform better than has been suggested previously. Well-
produce the necessary variance, and a sufficiently high negative correlation with the total test
score.
ce
Six previous studies investigated whether the number of answer options affected
Ac
criterion-related evidence of the validity of MC test scores. To this end, validity coefficients
were obtained by correlating test scores with some external criteria (Edwards et al., 2012;
Farhady & Shakery, 2000; Green, Sax, & Michael, 1982; Owen & Froman, 1987; Thanyapa
& Currie, 2014; Trevisan, Sax, & Michael, 1991). None of these studies reported any
meaningful or systematic relation between option number and validity coefficients. However,
option test with a validity coefficient of r = .4 and a 3-option test with a validity coefficient of
r = .3. To achieve sufficient statistical power and a .8 probability of detecting such a small
difference between two Pearson correlation coefficients with an alpha error probability of .05,
the two tests would each have to be administered to 953 test-takers (as shown by a power
analysis using the software G*Power 3.1; Faul, Erdfelder, Buchner, & Lang, 2009). None of
t
ip
the previous six studies had a sample size nearly that large. Green et al. (1982) investigated
3-, 4-, and 5-option MC tests. Correlations with course grades – which were determined
cr
independently of the performance in the MC tests – were not significantly different for the
us
three tests, but the power to detect potential differences was low because the mean number of
separately for three ability levels. The mean number of participants per condition was 145.
M
The authors also investigated course grades to obtain independent criterion-related evidence
ed
between test and course grade of r = .05 was found, and this was significantly better than the
pt
validity coefficient in the 3-option condition (r = -.13) and 5-option condition (r = -.14).
However, this result can hardly be interpreted because none of these correlations had any
ce
predictive value. Only when looking at all ability levels combined was a positive predictive
Ac
Owen and Froman (1987) compared the validity coefficients of 3-option and 5-option
tests. Scores from 3-option items correlated slightly higher (r = .75) with a posttest than a 5-
option test (r = .73). However, the difference between these two validity coefficients was not
tested for significance, and the mean number of participants per condition was only 57.
Farhady and Shakery (2000) investigated the validity coefficients of 3-, 4-, and 5-
option versions of the TOEFL but did not find significant differences between validity
coefficients for any TOEFL subtest. The mean number of participants per condition was 144.
evidence of the validity of 3- and 5-option MC items. The mean sample sizes per condition
were 107 and 205, respectively. Altogether, 10 comparisons were computed between 3- and
t
ip
5-option validity coefficients, but only one of these comparisons yielded a significant
difference, favoring the 5-option format. Finally, Thanyapa and Currie (2014) investigated
cr
validity coefficients for 3-, 4-, and 5-option MC tests. No significant differences were found,
us
but the mean sample size per condition was only 51. Although the lack of an effect of option
number on test validity in this and the other five previous studies can and has been regarded
an
as evidence for the viability of items containing fewer options, it can probably be better
explained by a lack of statistical power. In the six studies mentioned above, the total sample
M
sizes ranged from 114 to 435 and averaged only 110 participants per testing condition. This is
ed
equivalent to a statistical power of only .13 to detect a .1 increase of the validity coefficient
from .3 to .4.
pt
Using a general knowledge test, the present study aimed to determine the joint effects
of distractor quality and quantity on item difficulty and item discrimination. Moreover, we
ce
investigated how distractor quality and quantity affects criterion-related validity coefficients
Ac
and internal consistency (according to Cronbach, 1951), using a very large sample to ensure
incrementally deleted the one or two best or worst distractors, respectively, to create five
parallel test sets that differed in the quality and quantity of distractors. Each test set consisted
of the same 30 general knowledge items that varied systematically in the number and quality
of distractors. This was achieved by deleting either one or two from the original four answer
options. By deleting (a) the worst distractors or (b) the best distractors from 4-option items,
we obtained two parallel 2-option sets and two parallel 3-option sets that varied in distractor
life or high stakes applications of MC tests. However, it is not always easy to think of good
distractors, and there is evidence that the quality of distractors written for exams varies
widely (e.g. Brozo, Schmelzer & Spires, 1984; Tarrant & Ware, 2008). Accordingly, it is not
t
ip
only of theoretical, but also of practical relevance to analyze the extent to which poor
distractors affect item functioning. For this reason, we decided to investigate the effects of
cr
both, well and poorly constructed distractors in our controlled experiment.
us
Distractor quality was assessed in a pilot study and was based on a composite measure
that combined discrimination and endorsement rate. We expected items containing fewer
an
options to be easier, if only due to improved chances of guessing the correct solution; and we
expected item difficulty to depend on distractor quality. Items containing poor distractors
M
were expected to be easier than tests containing better distractors. We also hypothesized that
ed
an item’s psychometric properties would not change much when dysfunctional distractors
were deleted, but we expected that the deletion of good distractors would impair test quality
pt
considerably. The impairment was expected to be particularly large for the 2-option test
because this test was expected to be most sensitive to the quality of the only distractor.
ce
We did not have a clear hypothesis regarding the extent to which validity coefficients
Ac
would be affected by a different number of options, but we wanted to address this question
exploratively and with high statistical power. With regard to the quality of the distractors, we
expected that items containing better distractors (i.e., frequently endorsed distractors with
high discriminability) would allow for more valid conclusions on the general knowledge level
experimental procedure, the five experimental conditions and the three external criteria that
Participants
Participants were recruited from the SoSci Panel, a German online panel that supports
t
ip
scientific, noncommercial research (Leiner, 2012). Individuals registering on the SoSci Panel
cr
give their consent to receive up to four invitations to scientific online investigations per year.
Anyone can become a registered member by submitting his or her email address. Our survey
us
invitation was sent to native German speakers who varied in educational background.
an
Because only the distractors differed between testing conditions, we did not expect that
validity coefficients of different test sets would differ much; and on the basis of a pretest, we
M
expected validity coefficients of about .3 or .4. To obtain a power of at least .8 to detect
differences between validity coefficients of .3 and .4 at an alpha error level of .05, we aimed
ed
to obtain a group size of at least 953 participants per condition. A total of 6,500 panelists
began and 6,013 finished the survey. The number of dropouts did not differ significantly
pt
between the five testing conditions, χ2(4) = 7.44, p = .11. We had to discard the data of 201
ce
participants who indicated they were not native German speakers. In addition, a preliminary
analysis showed that some extreme outliers were present in the processing times of the survey
Ac
presumably because some participants took a break or were interrupted while answering the
followed an established procedure for dealing with outliers and excluded the data of 19
participants whose testing times were more than two standard deviations below or above the
mean of the respective condition (Ratcliff, 1993). These 19 participants were also excluded
from the other analyses, but this did not affect the results. This left a total sample of 5,793
participants (54.6% female) across the five experimental conditions. The mean age in the
sample was 31.77 years (SD = 11.19). Participants were incentivized by a lottery of three
Amazon gift cards (100, 50, and 30 Euro) and by the offer to inform them of their test score
and their performance in comparison with other test-takers after they had finished the test.
Materials
Test items from the BOWIT (Bochumer Wissenstest = Bochum Test of General
t
ip
Knowledge) were used in the present investigation. The BOWIT is a validated and published
general knowledge test in the German language (Hossiep & Schulte, 2008). It covers 11
cr
domains of knowledge and consists of 14 items per domain in two parallel Forms A and B.
us
We decided to extract items from form B and selected items from those five domains that we
felt were most typical of the German higher education curriculum. The domains were (a)
an
Biology/Chemistry, (b) Math/Physics, (c) Language/Literature, (d) Society/Politics, and (e)
History/Archeology. The original BOWIT items contain five options such that Option 5 is
M
always “none of the above is true.” Haladyna (2004, p. 117) recommended that a “none of the
ed
above” option should not be used when cognitive load for an MC item is low, as is the case
for general knowledge items. We therefore selected 63 items for which Option 5 was not the
pt
solution and deleted the fifth answer option from the selected BOWIT items to obtain the 4-
option items that formed the basis of the present investigation. To determine distractor
ce
functionality, these 63 items were first pretested in an online pilot study (n = 499). This
Ac
allowed us to rank order all answer options according to the two criteria used in the main
study. To this end, as measures of the distractors’ performance, we computed (a) their
endorsement rate (p) and (b) their point-biserial correlation with the total test score (r).
Because both of these measures have frequently been used as an index of distractor
end, we averaged the z-scores of the distractors’ endorsement rates and the z-scores of their
point-biserial correlations. This allowed us to rank order all distractors and to identify the one
or two worst or best distractors for each item. To avoid dropouts by reducing participant
burden, only a random subset of 30 BOWIT items was selected from the pretest items, and
Design
t
ip
answered a 4-option test. The deletion of distractors for participants in the other four
conditions was based on the composite measure, which indicated the options that
cr
discriminated best and were endorsed most frequently. The 3-option-worst-deleted test was
us
created by discarding the worst distractor from each item. The 3-option-best-deleted test was
created by discarding the best distractor from each item. The 2-option-worst-deleted test was
an
created by removing the two worst distractors from each item; and the 2-option-best-deleted
test was created by removing the two best distractors from each item.
M
Procedure
Items were presented as an online quiz using the software EFS Survey (Version 9.0,
ed
QuestBack GmbH, Germany). At the beginning of the questionnaire, participants were asked
pt
to indicate their age, sex, and educational background, including the total number of years
they had spent in school and in higher education. The total number of years spent in
ce
education was used as a first measure to obtain criterion-related evidence of the validity of
test scores. To obtain another criterion measure, participants were asked to provide a self-
Ac
rating of their general knowledge relative to other persons. Thus, respondents were asked to
provide an estimate of the percentage of the population that presumably had a higher degree
of general knowledge than the respondent. Correlations with the BOWIT test score were
negative for this variable as a higher value indicated a lower self-rated level of general
knowledge. Next, participants were asked to answer 10 additional items from the Spiegel
Student Pisa Test (Trepte & Verbeet, 2010) that were not manipulated with regard to the
number and quality of the distractors. These additional items served as the third measure to
obtain criterion-related evidence of validity. Two items from each of the five domains
covered by the Spiegel Pisa Test were presented: (a) Politics, (b) History, (c) Economics, (d)
Culture, and (e) Science. After working on the 10 Spiegel Pisa items, participants were
randomly assigned to one of the five testing conditions that differed in the number and
t
ip
quality of the distractors for the 30 BOWIT items that were presented in the final phase of the
study. For all MC items presented throughout the study, the item order and position of answer
cr
options were varied randomly. After working on all items, participants were thanked and
us
debriefed. To provide the respondents with additional feedback, they were informed about
their test score and their performance in comparison with the other test-takers.
an
Results
M
For each testing condition, item difficulties, item discriminations, Cronbach’s α, and
correlations with external validation criteria were computed. Total years of education, self-
ed
reported general knowledge, and the Spiegel Pisa test score served as external criteria for
with the BOWIT test score were computed for all BOWIT items. An alpha error level of .05
was applied for all significance tests. For p-values greater than .05, exact values are reported.
Ac
ANOVA effect sizes were computed using eta-squared (η2), which can be interpreted
small effect, η2 ≥ 0.06 a moderate effect, and η2 ≥ 0.14 a large effect (Cohen, 1988).
According to Cohen (1988, p. 110), q = z(r1) – z(r2), that is, the difference between the two
between two correlation coefficients. Therefore, we report Cohen’s q as an index of the effect
size for the comparison of correlation coefficients. According to Cohen (1988, p. 115), q = .1
Item Difficulty
Item difficulties (pitem) were determined for all 30 items separately for each testing
condition. Pitem was computed as the proportion of test-takers solving an item correctly.
Table 2 displays the mean item difficulties for the five testing conditions. Items were most
t
ip
difficult in the 4-option condition and were generally easier when the number of options was
reduced, particularly when good distractors were removed. The difficulty of the 30 items
cr
varied as a function of the condition, F(4, 116) = 87.72, p < .01, η2 = .75. Bonferroni post hoc
us
tests showed that there was no significant difference between the difficulties of the 4-option
(.55) and the 3-option-worst-deleted (.57) tests, p = .24, nor was there a significant difference
an
between the difficulties of the 3-option-best-deleted (.66) and the 2-option-worst-deleted
M
(.65) tests, p > .99. In these two cases, the better quality of the distractor options compensated
for their lower quantity. All other pairwise comparisons were significant at the p < .01 level.
ed
Thus, items with more options were generally solved less frequently than items with fewer
options (4 < 3 < 2), and items with low-quality distractors were solved more frequently than
pt
worst-deleted).
ce
Item Discrimination
Ac
score were computed as measures of item discrimination (ritem). Table 2 displays the mean
item discriminations for the five testing conditions. The highest item discriminations were
found for the 4-option condition (.33) with a slight decrease when the one (.30) or two (.29)
worst distractors had been deleted. Discriminations were impaired more when the best
distractor had been removed (.24) and was particularly bad when the two best distractors had
been deleted (.17). To test for differences between conditions, item discriminations were first
z-transformed (Fisher, 1925) to account for the skewed distribution of r. Consistent with the
descriptive results, item discriminations varied strongly as a function of the condition, F(4,
116) = 60.02, p < .01, η2 = .67. The effect size of this variation was large (η2 = .77) when 4-
arguably because the best distractor was no longer available in these items. The effect size
t
ip
was smaller (η2 = .29) when 4-option items were compared with 3-options-worst-deleted and
cr
2-options-worst-deleted items in which the best distractor was still available.
In pairwise comparisons of all five testing conditions, Bonferroni post hoc tests
us
allowed us to identify the conditions that differed significantly from each other.
an
A nonsignificant difference was found between the 3-option-worst-deleted and the 2-option-
worst-deleted tests (.30 vs. .29, p > .99, q = 0.01). Thus, item discrimination did not improve
M
if an additional distractor was added after the best distractor had already been included. All
other pairwise comparisons were significant at the p < .01 level. Thus, 4-option items
ed
discriminated significantly better than the items in all other testing conditions. Removing the
worst distractors led only to very minor impairments in item discrimination, and again,
pt
distractor quality was able to compensate for distractor quantity. This was shown by the
ce
superiority of the 2-option-worst-deleted set over the 3-option-best-deleted set (.29 vs. .24, p
< .01, q = 0.06). Thus, a small beauty (a 2-option-item with a high-quality distractor)
Ac
discriminated better than a large beast (a 3-option item that included two poor distractors).
Cronbach’s Alpha
Cronbach’s α was computed for each testing condition as an index of the internal
consistency of test scores (see Table 2). Descriptively, the 4-option items offered the highest
internal consistency (.82), and α decreased only slightly when the one (.80) or two (.79) worst
distractors had been removed. The reliability coefficient was impaired more severely when a
good distractor had been removed (.73) and was particularly high when the two best
distractors had been removed and only the worst distractor was still present (.61). To test for
significance was the same as for the item discriminations. With one exception, all pairwise
t
ip
comparisons were significant at the p < .05 level. A significant difference was found between
the 4-option and the 3-option-worst-deleted tests (.82 vs .80), χ2(1) = 5.48, p < .05.
cr
Significant differences were also found between the 4-option and the 3-option-best-deleted
us
tests (.82 vs. .73), χ2(1) = 49.61, p < .01; the 4-option and the 2-option-worst-deleted tests
(.82 vs .79), χ2(1) = 9.73, p < .01; the 4-option and the 2-option-best-deleted tests (.82 vs.
an
.61), χ2(1) = 164.59, p < .01; the 3-option-worst-deleted and the 3-option-best deleted tests
(.80 vs .73), χ2(1) = 22.65, p < .01; the 3-option-worst deleted and the 2-option-best deleted
M
tests (.80 vs. .61), χ2(1) = 113.28, p < .01; the 3-option-best-deleted and the 2-option worst-
deleted tests (.73 vs .79), χ2(1) = 15.79, p < .01; the 3-option-best-deleted and the 2-option-
ed
best deleted tests (.73 vs. .61), χ2(1) = 34.97, p < .01; and the 2-option-worst-deleted and the
pt
2-option-best deleted tests (.79 vs. .61), χ2(1) = 97.29, p < .01. The only nonsignificant
difference was found between the 3-option-worst-deleted and the 2-option-worst deleted tests
ce
Validity coefficients
were computed to several measures that served as external criteria for general knowledge.
Validity coefficients were obtained by correlating the test score for the 30 BOWIT items and
(a) the Spiegel Pisa test score (rPisa test), (b) self-rated general knowledge (rself-rated knowledge),
and (c) the total number of years a participant had spent in school and in continuing academic
education (reducation) (see Table 2 for all correlations). Descriptively, the 2-option-worst-
deleted test had the highest correlations with all external criteria, whereas the 2-option-best-
deleted test had the lowest correlations for two of the three criterion measures. Correlations in
the other conditions varied nonsystematically and were rather similar to each other.
implemented in the R-package cocor (Diedenhofen & Musch, 2015). Significant differences
t
ip
in validity coefficients as measured by correlations with the Spiegel Pisa test score as the
criterion were found between the 4-option and the 2-option-worst-deleted tests (.59 vs. 67, z
cr
= -3.16, p < .01, q = 0.13); the 4-option and the 2-option-best-deleted tests (.59 vs .53, z =
us
2.00, p < .05, q = 0.08); the 3-option-worst-deleted and the 2-option-worst-deleted tests (.58
vs. .67, z = -3.53, p < .01, q = 0.15); the 3-option-best-deleted and the 2-option-worst-deleted
an
tests (.60 vs. .67, z = -2.80, p < .01, q = 0.12); the 3-option-best-deleted and the 2-option-
best-deleted tests (.60 vs. .53, z = 2.37, p < .05, q = 0.10); and the 2-option-worst-deleted and
M
the 2-option-best-deleted tests (.67 vs. .53, z = 5.19, p < .01, q = 0.21). Thus, the 2-option-
ed
worst-deleted set employing only a single but well-chosen distractor achieved a higher
validity coefficient with regard to the Spiegel Pisa test than all sets employing 3-option or 4-
pt
option items, whereas the 2-option-best-deleted set employing a single poor distractor
rated general knowledge were found between the 4-option and the 2-option-best-deleted tests
(-.32 vs. -.23, z = -2.46, p < .05, q = 0.10); the 3-option-worst-deleted and the 2-option-best-
deleted tests (-.31 vs. -.23, z = -2.31, p < .05, q = 0.10); and the 2-option-worst-deleted and
the 2-option best-deleted tests (-.36 vs. -.23, z = -3.49, p < .01, q = 0.14).
Only with regard to the time spent in school and in higher education as the criterion ,
no significant differences in validity coefficients were found. The results across all three
measures that were used to obtain criterion-related evidence of validity can thus be
summarized as follows: The quality of distractors was more important than their quantity;
and small beauties (2-option items with a well-chosen distractor) generally performed better
Distractor Performance
t
ip
distractors in the different testing conditions. Endorsement rate (pdistractor) and point-biserial
correlations of distractor choice and total test score (rdistractor) are shown separately for each
cr
condition for the best, the second best, and the worst distractors. The total average distractor
us
discriminations and endorsement rates are also displayed for each test set. The results show
that functional distractors were endorsed more frequently and showed higher discrimination
an
than distractors that had been classified as less functional. Distractors received more attention
when there were fewer of them: The average distractor endorsement rate increased as the
M
number of options decreased. This is of course what had to be expected – responses are likely
ed
to be more spread out across options when there are more options. Furthermore, the average
distractor discrimination was more negative (i.e., better) as the number of options decreased.
pt
Table 2 also shows the absolute number of functional distractors per item. Their
relative proportion is also reported. This relative proportion was computed as the proportion
ce
of the number of functional distractors and the total number of distractors per item, separately
Ac
for each condition. For this analysis, a distractor was considered functional if its
discrimination—computed as its correlation with the total test score—was negative, and if it
was selected by more than 5% of all test-takers. However, no distractor yielded a zero or
positive discrimination index in any testing condition. For each distractor, the choice-total
correlation was negatively signed, indicating that all distractors performed rather well.
Therefore, distractors only had to be selected by less than 5% of the participants in order to
be classified as dysfunctional. As can be seen in Table 2, the absolute number of functional
distractors was lower but their relative proportion was higher when the number of options
was reduced, and the relative proportion of functional distractors was lower after distractor
relative proportion of functional distractors were always higher when the worst rather than
t
ip
the best distractors had been deleted. This pattern provides a successful manipulation check
for the procedure that we employed after the pretest to identify good versus poor distractors.
cr
Testing Time
us
The time the participants needed to answer all 30 BOWIT items was recorded. Testing
times differed between conditions, F(4, 5788) = 29.22, p < .01, η2 = .02. Bonferroni post hoc
an
tests showed that testing time in the 4-option condition was higher than in all other
M
conditions, p < .01 for all pairwise comparisons. The testing times in the two 3-option
conditions were not different from each other (p > .99), but they were significantly lower than
ed
in the 4-option condition and significantly higher than in the 2-option conditions (p < .05).
The testing times in the two 2-option conditions were not different from each other (p = .13),
pt
but they were significantly lower than in the 3- and 4-option conditions (p < .05). Thus,
testing time decreased as the number of options decreased. Table 2 displays the mean testing
ce
time per item for all five conditions and a correction factor indicating how many more items
Ac
could be processed in the respective condition relative to the 4-option full test set condition.
For example, if a factor of 1.34 was computed for the 2-option-worst-deleted condition, this
meant that 134 items of this item type could be answered in the same time that was needed to
answer 100 items of the 4-option type. This correction factor was then used for the
Spearman-Brown formula to predict the reliability coefficient assuming a fixed testing time
in all conditions, which, however, would have resulted in the presentation of a different
number of items per condition. When the testing time was accounted for, a slightly higher
reliability coefficient (.83) was predicted for small beauties, that is, 2-option-worst-deleted
items, than for full 4-option items (.82) or 3-option-worst-deleted items (.82).
Discussion
The goal of the present study was to investigate the interaction between the quality
t
ip
and quantity of distractors in MC testing. To this end, five test sets consisting of 30 items
cr
with four, three, or two answer options were compared on item difficulty, discrimination,
internal consistency, and criterion-related evidence of the validity of test scores. In addition
us
to varying the number of distractors, we also varied distractor quality by deleting either the
an
best or the worst distractors from the 4-option items in a stepwise fashion.
All previous investigations of the relation between the number of options and the
M
validity of MC test scores had a combined total sample size of 1,948. The present study had a
sample size of 5,793 and was thus about three times larger than all previous studies
ed
combined. Owing to this very large sample size, we found a clear pattern of results that
allowed us to draw conclusions with high statistical power. Items with fewer options were
pt
generally easier, and deleting good distractors made items easier than deleting poor
ce
distractors. The nonsignificant differences in difficulty between test sets with a different
number of options showed that item difficulty is not solely determined by the number of
Ac
options but also strongly depends on the quality of the distractors. Whereas average part-
whole-corrected item discriminations and Cronbach’s alphas were highest for 4-option tests
(.82), the deletion of poor distractors impaired internal consistency only slightly (to .80 and
.79 for the 3-option and 2-option sets, respectively). The internal consistency dropped much
more markedly after the best distractors were deleted (to .73 and .61 for the 3-option and 2-
option sets, respectively). Items for which the best distractor was available performed better
than items for which it was removed, and varying distractor quality was more influential than
It has repeatedly been reported that items with fewer options do not necessarily
perform worse than items containing more options. Using our large sample, we found that
more options actually do increase item quality if the added distractors are functional.
Importantly, however, all distractors on our test were discriminating; that is, no distractor was
t
ip
positively related to the total test score. For the 4-option items, 2.4 distractors out of the
maximum of three were functional on average. This is a large number in comparison with
cr
what has been found in other studies, and hence, it indicates that the BOWIT distractors were
us
rather well constructed.
Although adding even the least functional distractors improved test functioning, this
an
improvement was only very minor: Adding the second best to the best distractor increased
internal consistency by only .01, and adding another distractor led to a further increase of
M
only .02. A much more important determinant of the precision of test scores and item
ed
discrimination was whether the single best distractor was present or not.
surprising pattern. Most remarkable was the finding that distractor quality actually
overcompensated for distractor quantity. Small beauties (2-option-items with only a single
ce
but well-chosen distractor) showed higher validity coefficients than large beasts (3- or 4-
Ac
option items with poor distractors). Correlations with the three validity criteria were
consistently found to be highest for the 2-option-worst-deleted items, and the correlation with
the Spiegel Pisa general knowledge items was significantly higher for these items than for all
other test sets. Thus, the best test consisted of simple 2-option-items that combined the
solution with a single, well-chosen distractor. When the time needed to answer the questions
was also taken into account by computing time-corrected reliability coefficients based on a
Spearman-Brown correction factor, the surprisingly well-performing 2-option-worst-
distractor-deleted items showed not only the highest validity coefficients but also a slightly
higher – but not significantly higher – reliability coefficient. Some limitations should be
considered when interpreting the present results. First, our experimental investigation of test
functioning was limited to a single test domain. The same manipulation may lead to different
results with a different test. For the present test, we found that item functioning was largely
t
ip
dependent of the availability of one good distractor. This result may depend on the relative
plausibility of the other distractors. If all distractors are discriminating well, the very best
cr
distractor may be relatively less important for item functioning. However, constructing many
us
well-functioning distractors has proven to be a difficult task (cf. Haladyna et al. 2002).
strongest for the test in which only the best distractor was presented together with the
M
solution. While this finding may be the result of our use of a fully randomized design and a
ed
much larger sample than in all previous studies combined, the result should be replicated to
The strongest motivation for investigating the optimal number of answer options in
MC items has always been the attempt to facilitate the work of item writers. Perhaps for this
ce
reason, researchers have long questioned whether the use of as many options as four or five is
Ac
really necessary to ensure good test functioning. Our results strongly suggest that three and
even two options can be sufficient to obtain reliable and meaningful test scores as long as the
distractors are functional. Our findings also show that considering only the number of options
is not enough and that distractor quality affects test functioning much more than distractor
quantity. On the basis of our findings, we recommend that test-writers put their effort toward
creating one or two good distractors rather than a larger number of relatively poor distractors.
Even when only a single well-performing distractor was offered, items in the present study
yielded stronger criterion-related evidence of validity and were preferable to items with more
distractors that did not function as well. Hence, taken together, our findings suggest that a
t
ip
cr
us
an
M
ed
pt
ce
Ac
References
Baghaei, F., & Amrahi, N. (2011). The effects of the number of options on the psychometric
Budescu, D. V., & Nevo, B. (1985). Optimal number of options: an investigation of the
t
ip
Brozo, W. G., Schmelzer, R. V., & Spires, H. A. (1984). A study of testwiseness clues in
cr
college and university teacher-made tests with implications for academic assistance
centers (Technical Report 84-01). Georgia State University: College Reading and
us
Learning Assistance. an
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale,
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika,
pt
16(3), 297-334.
ce
DiBattista, D., & Kurzawa, L. (2011). Examination of the quality of multiple-choice items on
classroom tests. The Canadian Journal for the Scholarship of Teaching and Learning,
Ac
2(2), Article 4.
Diedenhofen, B., & Musch, J. (2016). cocron: a web interface and r package for the statistical
Diedenhofen, B., & Musch, J. (2015). cocor: a comprehensive solution for the statistical
Ebel, R. L. (1969). Expected reliability as a function of choices per item. Educational and
t
ip
Edwards, B. D., Arthur, W., & Bruce, L. L. (2012). The 3-option format for knowledge and
ability multiple-choice tests: a case for why it should be more commonly used in
cr
personnel testing. International Journal of Selection and Assessment, 20(1), 65-81.
us
Farhady, H., & Shakery, S. (2000). Number of options and economy of multiple-choice tests.
G* Power 3.1: tests for correlation and regression analyses. Behavior research
M
Feldt, L. S., Woodruff, D. J., & Salih, F. A. (1987). Statistical inference for coefficient alpha.
Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh, Scotland: Oliver
and Boyd.
ce
Green, K., Sax, G., & Michael, W. B. (1982). Validity and reliability of tests having differing
Ac
Haladyna, T. M. (2004). Developing and validating multiple-choice test items (3rd ed.). New
Hossiep, R., & Schulte, M. (2008). Bochumer Wissenstest (BOWIT). Manual. Göttingen:
t
ip
Hogrefe.
Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability.
cr
Psychometrika, 2(3), 151-160.
us
Lee, H., & Winke, P. (2013). The differences among three-, four-, and 5-option-item formats
panel. Poster presented at the GOR 2012, 6th March, Mannheim. Retrieved
ed
from https://www.soscisurvey.de/panel/download/SoSciPanel.GOR2012.pdf.
Owen, S. V., & Froman, R. D. (1987). What's wrong with 3-option multiple choice items?
ce
Ratcliff, R. (1993). Methods for dealing with reaction time outliers. Psychological bulletin,
114(3), 510-532.
Rodriguez, M. C. (2005). Three options are optimal for multiple-choice items: a meta-
24(2), 3-13.
Straton, R. G., & Catts, R. M. (1980). A comparison of 2-choice, 3-choice and 4-choice item
Tarrant, M., & Ware, J. (2008). Impact of item-writing flaws in multiple-choice questions on
198-206.
t
ip
Tarrant, M., & Ware, J. (2010). A comparison of the psychometric properties of three- and 4-
cr
30(6), 539-543.
us
Tarrant, M., Ware, J., & Mohammed, A. M. (2009). An assessment of functioning and non-
Thanyapa, I., & Currie, M. (2014). The number of options in multiple choice items in
M
language tests: does it make any difference? Evidence from Thailand. Language
ed
Trepte, S., & Verbeet, M. (Eds.). (2010). Allgemeinbildung in Deutschland. Erkenntnisse aus
pt
Trevisan, M. S., Sax, G., & Michael, W. B. (1991). The effects of the number of options per
ce
item and student ability on test validity and reliability. Educational and Psychological
Ac
Zoanetti, N., Beaves, M., Griffin, P., & Wallace, E. M. (2013). Fixed or mixed: a comparison
Overview of the five experimental conditions and the three external criteria for validation.
No distractor For each item, For each item, For each item, For each item,
t
ip
was deleted. The the worst the two worst the best the two best
cr
answer options terms of terms of terms of terms of
us
(one solution discrimination discrimination discrimination discrimination
presented. rate was deleted. rate were rate was deleted. rate were
M
Note. To obtain criterion-related evidence of validity, we collected the following external criteria for 5793 test
takers:
2. Score in a general knowledge test (Spiegel Student Pisa, Trepte & Verbeet, 2010)
Wissenstest = Bochum Test of General Knowledge). Test takers were assigned randomly to the five
t
ip
cr
us
an
M
ed
pt
ce
Ac
Table 2. Psychometric Properties of the Different Test Sets
t
ip
Cronbach’s α .823 .796 .786 .727 .609
cr
rPisa test .586 .576 .666 .596 .523
us
reducation .230 .237 .275 .262 .250
%functional-d 78 93 100 83 87
Note. pitem = average item difficulty for the 30 BOWIT items. ritem = average item discrimination. rPisa test =
correlation of BOWIT test score and Spiegel Pisa test score. rself-rated knowledge = correlation of BOWIT test score
and self-rated general knowledge. reducation = correlation of BOWIT test score and years spent obtaining an
education. pdis1 = average endorsement rate for the best distractor, pdis2 = average endorsement rate for the
t
second best distractor, pdis3 = average endorsement rate for the worst distractor, pdisAvg = average endorsement
ip
rate for all distractors available for the respective test set, rdis1 = average discrimination of the best distractor,
rdis2 = average discrimination of the second best distractor, rdis3 = average discrimination of the worst distractor,
cr
rdisAvg = average discrimination of all distractors available for the respective test set, nfunctional-d = average number
us
of functional distractors per item, %functional-d = average proportion of functional distractors per item. Testing
time was measured in seconds. Mtesttime is the average test time needed to finish all 30 BOWIT items, and
an
SDtesttime is its standard deviation. MtesttimeItem is the average time that was needed to work on one item. Time
correction factor indicates how many items of the respective item type could be answered in the same time as a
4-option reference item. Time-corrected alpha is a Spearman-Brown-corrected reliability based on the time
M
correction factor that assumes a fixed testing time for all conditions.
ed
pt
ce
Ac