Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Applied Measurement in Education

ISSN: 0895-7347 (Print) 1532-4818 (Online) Journal homepage: http://www.tandfonline.com/loi/hame20

Of small beauties and large beasts: The quality


of distractors on multiple-choice tests is more
important than their quantity

Martin Papenberg & Jochen Musch

To cite this article: Martin Papenberg & Jochen Musch (2017): Of small beauties and large beasts:
The quality of distractors on multiple-choice tests is more important than their quantity, Applied
Measurement in Education, DOI: 10.1080/08957347.2017.1353987

To link to this article: http://dx.doi.org/10.1080/08957347.2017.1353987

Accepted author version posted online: 25


Jul 2017.

Submit your article to this journal

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=hame20

Download by: [University of Nicosia] Date: 26 July 2017, At: 01:28


Of Small Beauties and Large Beasts: The Quality of Distractors

on Multiple-Choice Tests Is More Important Than Their

Quantity

Martin Papenberg and Jochen Musch

Department of Experimental Psychology, University of Düsseldorf

t
ip
CONTACT Martin Papenberg martin.papenberg@uni-duesseldorf.de Department of

Experimental Psychology, University of Düsseldorf, Universitätsstraße 1, Building 23.03,

cr
40225 Düsseldorf, Germany.

us
Abstract
an
In multiple-choice tests, the quality of distractors may be more important than their number.

We therefore examined the joint influence of distractor quality and quantity on test
M

functioning by providing a sample of 5,793 participants with five parallel test sets consisting

of items that differed in the number and quality of distractors. Surprisingly, we found that
ed

items in which only the one best distractor was presented together with the solution provided
pt

the strongest criterion-related evidence of the validity of test scores and thus allowed for the

most valid conclusions on the general knowledge level of test takers. Items that included the
ce

best distractor produced more reliable test scores irrespective of option number. Increasing

the number of options increased item difficulty, but did not increase internal consistency
Ac

when testing time was controlled for.


The most widely used method for assessing knowledge and cognitive ability is

multiple-choice (MC) testing. In its most common form, an MC item consists of a stem

posing a question along with a set of answer options. One of these answers is correct and

needs to be identified by the test-taker. Incorrect answer options are called distractors. The

MC format allows many examinees to be tested on a wide range of contents in a short amount

of time, and it enables the construction of objectively scored tests that warrant reliable and

t
ip
valid conclusions on the knowledge and ability levels of test-takers (Haladyna, 2004).

Considerable effort is necessary, however, to devise high-quality MC items. In particular, the

cr
creation of plausible and functional distractors can be a challenging task (Haladyna &

us
Downing, 1993; Lee & Winke, 2013). Given that a test’s ultimate purpose is to discriminate

between test-takers of high versus low ability, distractors should appear plausible to test-
an
takers with low ability but unattractive to those with better skills (Haladyna, 2004). Given

that writing answer options with sufficient discriminatory power is a difficult and time-
M

consuming task, it is of high practical relevance to know how many answer options are
ed

necessary for producing a high quality multiple-choice test.

Psychometrically, it is not mandatory for all items on a test to have the same number
pt

of options (Zoanetti, Beaves, Griffin, & Wallace, 2013). Measurement textbooks have often

suggested that test writers develop at least four or five options for each item (e.g., Owen &
ce

Froman, 1987). This recommendation is based on the expectation that a larger number of
Ac

options reduces the influence of guessing and thereby increases the reliability of test scores.

However, empirical studies have suggested that guessing does not affect test scores much

(Ebel, 1968) and that most items do not have more than two or three options that are

frequently chosen (Tarrant & Ware, 2010). Moreover, Ebel (1969) argued that an appreciable

increase in the precision of test scores can be expected only when the number of options is

changed from two to three. Accordingly, several researchers have come to the conclusion
that three options may be optimal for MC testing (e.g., Baghaei & Amrahi, 2011; Edwards,

Arthur, & Bruce, 2012; Haladyna & Downing, 1993; Owen & Froman, 1987; Rodriguez,

2005; Tversky, 1964). Three options are comparably easy to devise, and empirically, this

number has offered a sufficiently high test quality (Rodriguez, 2005).

When writing test items, however, the quality of the distractors may be more

important than their number. Moreover, the recommendation to write as many distractors as

t
ip
feasible rests on the validity of the assumption that any added distractors are functional

(Haladyna & Downing, 1989). However, little is known about the joint influence of the

cr
quality and quantity of distractors on test functioning, and distractor quality may be more

us
important than distractor quantity. The present study therefore experimentally investigated

the simultaneous influence of distractor quality and quantity on the reliability and validity of
an
test scores in a test of general knowledge.

We start with a summary of the theoretical and empirical research that has addressed
M

the question of the optimal number of answer options for MC items. Subsequently, we
ed

discuss the influence of the quality of distractors on test functioning.

Several theoretical contributions have arrived at the conclusion that three answer
pt

options may be optimal for MC testing. For example, Tversky (1964) showed that given a

fixed total number of choice alternatives, the use of three alternatives at each of several
ce

successive choice points maximized the discriminability, power, and information of a


Ac

sequential test. Ebel (1969) estimated the Kuder-Richardson 21 reliability coefficients (Kuder

& Richardson, 1937) for tests that varied in the number of answer options. He found that two

options yielded the lowest internal consistencies, which were however strongly increased by

adding a third option. Adding additional options did not increase internal consistency by

much in his theoretical analysis that was however based on the unrealistic assumptions that

endorsement rate and discriminability were equal for all options. Lord (1977) employed an
item response theory approach and showed that three options maximized information

collection efficiency for the medium ability range. More options were found to be better in

the low-ability range, but two options were optimal in the high-ability range. Lord (1977)

also assumed that all distractors were chosen with equal probability. However, this is not

necessarily the case (Haladyna & Downing, 1993) and may strongly depend on the quality of

the distractors.

t
ip
An experimental procedure has sometimes been employed to address the question of

the optimal number of options (Rodriguez, 2005). At least two versions of the same test, only

cr
differing in the number of options, can be provided to different groups of test-takers. The

us
psychometric properties of test sets are then compared. Most studies of this type have found

that 3-option items tend to be easier than items containing more options but may perform just
an
as well in terms of item discrimination and the precision of test scores (e.g., Costin, 1970;

Owen & Froman, 1987; cf. Rodriguez, 2005). When a reduction in item discrimination or
M

reliability was found for a reduced number of options, the effect was usually negligible
ed

(Haladyna & Downing, 1989).

Previous studies have focused on the optimal number of options and have usually not
pt

systematically investigated the impact of their quality, however. To create tests with fewer

options, distractors were either discarded randomly (Baghaei & Amrahi, 2011) or only the
ce

options with the lowest functioning were removed (Edwards & Winfred, 2012; Lee & Winke,
Ac

2013; Tarrant & Ware, 2010; Zoanetti et al., 2013). Arguably, deleting dysfunctional

distractors should be less detrimental to an item’s psychometric quality than deleting better

functioning distractors. The only study that systematically varied both the quality and the

quantity of distractors was Budescu and Nevo (1985). Starting with 5-option items, 4-, 3-,

and 2-option item sets that contained either the most attractive or the least attractive

distractors were created. However, this was done to test the proportionality assumption; that
is, the authors investigated whether the total testing time was proportional to the number of

items and the number of options per item. A strong negative relation between the rate of

performance and the number of options was observed: With an increasing number of

response alternatives, less time was spent on each option. Reliability coefficients increased

with an increasing number of options, but this effect was not tested for significance. The joint

effect of option quality and quantity on the validity of test scores was not examined because

t
ip
no external measure was available to obtain criterion-related evidence of validity.

To measure the functionality of distractors, two data-based criteria can be used. These

cr
are (a) endorsement rate and (b) distractor discrimination. Endorsement rate simply reflects

us
the proportion of test-takers who choose a distractor so that less frequently endorsed options

can be considered less functional. Distractor discrimination refers to how well a distractor
an
distinguishes between high- and low-achievers and can be computed as the point-biserial

correlation between distractor choice and total test score. Distractors yielding a high negative
M

correlation can be considered to have good functioning. Distractors with a zero or even
ed

positive correlation with the total test score have to be considered dysfunctional because their

choice does not indicate endorsement by a test-taker with low ability (DiBattista & Kurzawa,
pt

2011).

Following Haladyna and Downing (1993), most researchers have adopted the criterion
ce

that a distractor is dysfunctional if it is selected by less than 5% of all test-takers. In addition,


Ac

an option was sometimes considered to be dysfunctional if it had a zero or positive

discrimination (e.g., Tarrant, Ware, & Mohammed, 2009). Tarrant et al. (2009) investigated

distractor functionality in seven 4-option MC tests that had been administered to nursing

students at an English-language university in Hong Kong. Using the above two criteria, they

found that the average number of distractors with good functioning per item ranged from only

1.4 to 1.7. Similarly, DiBattista and Kurzawa (2011) investigated the functionality of MC
items that had been used in 16 tests at a Canadian university. They found that the number of

functional distractors per item ranged from only 1.1 to 2.6 on their 4- and 5-option tests. In all

studies investigating distractor functionality, items with a full set of functional distractors

have rarely been found, and Haladyna, Downing, and Rodriguez (2002) even surmised that

three might be a natural limit for plausible options in MC items. Test-writers trying hard to

think of additional distractors seem to be in danger of creating large beasts, i.e. MC items that

t
ip
contain many, but not very useful distractors. We argue that they instead should strive for the

creation of small beauties – MC items that contain few but well-chosen distractors.

cr
Is it possible that 2 options and thus, a single distractor is already sufficient to create a

us
small beauty? There has been relatively little research on 2-option items even though these

items are even easier to construct than 3-option items. In a review of tests based on 2-option
an
items (i.e., alternate choice items), Downing (1992) considered this item type to be viable for

testing and called for further investigation of 2-option testing. More than 20 years later,
M

however, research addressing this issue is still scarce. Researchers and practitioners seem to
ed

have been deterred by some early studies that reported that 2-option items performed poorly

in comparison with items containing more options. In particular, Straton and Catts (1980)
pt

reported a much lower reliability coefficient for a 2-option test compared with a

corresponding 4-option test (.47 vs. .68, respectively). However, answer options were deleted
ce

randomly in this study, a procedure that was likely to result in a poorly functioning distractor
Ac

in 2-option items. If the only distractor is implausible and not discriminating, 2-option items

cannot perform well. The following calculation illustrates this point.

The discrimination of a distractor may be computed as the point-biserial correlation of

distractor choice and total test score (Haladyna & Downing, 1993):

y 0 − y1 n0 n1
rdis = (1)
sy n2
where y1 is the mean test score of test-takers correctly identifying the solution, y0 is the mean

score of test-takers choosing the distractor (or one of the distractors in the case of an item

with more than 2 options), and sy is the standard deviation of all test scores. N1 is the number

of test-takers choosing the solution, n0 is the number of test-takers choosing the distractor,

and n is the total number of test-takers. For 2-option items, item discrimination can be

computed by the same formula if y1 and y0 are switched. Hence, item discrimination can be

t
ip
computed by multiplying the distractor discrimination by (-1):

cr
y1 − y 0 n0 n1
ritem = = −rdis (2)

us
sy n2
an
The discrimination of a 2-option item is hence inversely related to the discrimination of the

only distractor, and the quality of this distractor is central to the performance of a 2-option
M

item. We therefore presumed that small beauties – 2-option items that contain a single well-
ed

functioning distractor – can perform better than has been suggested previously. Well-

functioning distractors would however require a sufficiently high endorsement rate to


pt

produce the necessary variance, and a sufficiently high negative correlation with the total test

score.
ce

Six previous studies investigated whether the number of answer options affected
Ac

criterion-related evidence of the validity of MC test scores. To this end, validity coefficients

were obtained by correlating test scores with some external criteria (Edwards et al., 2012;

Farhady & Shakery, 2000; Green, Sax, & Michael, 1982; Owen & Froman, 1987; Thanyapa

& Currie, 2014; Trevisan, Sax, & Michael, 1991). None of these studies reported any

meaningful or systematic relation between option number and validity coefficients. However,

to detect differences in criterion-related evidence of validity, it is necessary to compare


correlation coefficients, and this requires very large sample sizes. Take, for example, a 4-

option test with a validity coefficient of r = .4 and a 3-option test with a validity coefficient of

r = .3. To achieve sufficient statistical power and a .8 probability of detecting such a small

difference between two Pearson correlation coefficients with an alpha error probability of .05,

the two tests would each have to be administered to 953 test-takers (as shown by a power

analysis using the software G*Power 3.1; Faul, Erdfelder, Buchner, & Lang, 2009). None of

t
ip
the previous six studies had a sample size nearly that large. Green et al. (1982) investigated

3-, 4-, and 5-option MC tests. Correlations with course grades – which were determined

cr
independently of the performance in the MC tests – were not significantly different for the

us
three tests, but the power to detect potential differences was low because the mean number of

participants per condition was only 64.


an
Trevisan et al. (1991) reported validity coefficients for 3-, 4-, and 5-option tests

separately for three ability levels. The mean number of participants per condition was 145.
M

The authors also investigated course grades to obtain independent criterion-related evidence
ed

of validity. For low-ability participants, a validity coefficient computed as the correlation

between test and course grade of r = .05 was found, and this was significantly better than the
pt

validity coefficient in the 3-option condition (r = -.13) and 5-option condition (r = -.14).

However, this result can hardly be interpreted because none of these correlations had any
ce

predictive value. Only when looking at all ability levels combined was a positive predictive
Ac

correlation found. However, differences in validity coefficients as a function of the number of

options were not tested for significance in this combined analysis.

Owen and Froman (1987) compared the validity coefficients of 3-option and 5-option

tests. Scores from 3-option items correlated slightly higher (r = .75) with a posttest than a 5-

option test (r = .73). However, the difference between these two validity coefficients was not

tested for significance, and the mean number of participants per condition was only 57.
Farhady and Shakery (2000) investigated the validity coefficients of 3-, 4-, and 5-

option versions of the TOEFL but did not find significant differences between validity

coefficients for any TOEFL subtest. The mean number of participants per condition was 144.

Edwards et al. (2012) conducted two experiments to investigate criterion-related

evidence of the validity of 3- and 5-option MC items. The mean sample sizes per condition

were 107 and 205, respectively. Altogether, 10 comparisons were computed between 3- and

t
ip
5-option validity coefficients, but only one of these comparisons yielded a significant

difference, favoring the 5-option format. Finally, Thanyapa and Currie (2014) investigated

cr
validity coefficients for 3-, 4-, and 5-option MC tests. No significant differences were found,

us
but the mean sample size per condition was only 51. Although the lack of an effect of option

number on test validity in this and the other five previous studies can and has been regarded
an
as evidence for the viability of items containing fewer options, it can probably be better

explained by a lack of statistical power. In the six studies mentioned above, the total sample
M

sizes ranged from 114 to 435 and averaged only 110 participants per testing condition. This is
ed

equivalent to a statistical power of only .13 to detect a .1 increase of the validity coefficient

from .3 to .4.
pt

Using a general knowledge test, the present study aimed to determine the joint effects

of distractor quality and quantity on item difficulty and item discrimination. Moreover, we
ce

investigated how distractor quality and quantity affects criterion-related validity coefficients
Ac

and internal consistency (according to Cronbach, 1951), using a very large sample to ensure

sufficient power to detect potential effects. Starting with 4-option MC items, we

incrementally deleted the one or two best or worst distractors, respectively, to create five

parallel test sets that differed in the quality and quantity of distractors. Each test set consisted

of the same 30 general knowledge items that varied systematically in the number and quality

of distractors. This was achieved by deleting either one or two from the original four answer
options. By deleting (a) the worst distractors or (b) the best distractors from 4-option items,

we obtained two parallel 2-option sets and two parallel 3-option sets that varied in distractor

quality. The deletion of well-functioning distractors is of course highly inappropriate in real-

life or high stakes applications of MC tests. However, it is not always easy to think of good

distractors, and there is evidence that the quality of distractors written for exams varies

widely (e.g. Brozo, Schmelzer & Spires, 1984; Tarrant & Ware, 2008). Accordingly, it is not

t
ip
only of theoretical, but also of practical relevance to analyze the extent to which poor

distractors affect item functioning. For this reason, we decided to investigate the effects of

cr
both, well and poorly constructed distractors in our controlled experiment.

us
Distractor quality was assessed in a pilot study and was based on a composite measure

that combined discrimination and endorsement rate. We expected items containing fewer
an
options to be easier, if only due to improved chances of guessing the correct solution; and we

expected item difficulty to depend on distractor quality. Items containing poor distractors
M

were expected to be easier than tests containing better distractors. We also hypothesized that
ed

an item’s psychometric properties would not change much when dysfunctional distractors

were deleted, but we expected that the deletion of good distractors would impair test quality
pt

considerably. The impairment was expected to be particularly large for the 2-option test

because this test was expected to be most sensitive to the quality of the only distractor.
ce

We did not have a clear hypothesis regarding the extent to which validity coefficients
Ac

would be affected by a different number of options, but we wanted to address this question

exploratively and with high statistical power. With regard to the quality of the distractors, we

expected that items containing better distractors (i.e., frequently endorsed distractors with

high discriminability) would allow for more valid conclusions on the general knowledge level

of participants than items containing distractors of low quality only.


Method

Table 1 illustrates the methods of the present investigation by describing the

experimental procedure, the five experimental conditions and the three external criteria that

were assessed to obtain criterion-related evidence of validity.

Participants

Participants were recruited from the SoSci Panel, a German online panel that supports

t
ip
scientific, noncommercial research (Leiner, 2012). Individuals registering on the SoSci Panel

cr
give their consent to receive up to four invitations to scientific online investigations per year.

Anyone can become a registered member by submitting his or her email address. Our survey

us
invitation was sent to native German speakers who varied in educational background.
an
Because only the distractors differed between testing conditions, we did not expect that

validity coefficients of different test sets would differ much; and on the basis of a pretest, we
M
expected validity coefficients of about .3 or .4. To obtain a power of at least .8 to detect

differences between validity coefficients of .3 and .4 at an alpha error level of .05, we aimed
ed

to obtain a group size of at least 953 participants per condition. A total of 6,500 panelists

began and 6,013 finished the survey. The number of dropouts did not differ significantly
pt

between the five testing conditions, χ2(4) = 7.44, p = .11. We had to discard the data of 201
ce

participants who indicated they were not native German speakers. In addition, a preliminary

analysis showed that some extreme outliers were present in the processing times of the survey
Ac

presumably because some participants took a break or were interrupted while answering the

questions. To allow for a meaningful comparison of testing times across conditions, we

followed an established procedure for dealing with outliers and excluded the data of 19

participants whose testing times were more than two standard deviations below or above the

mean of the respective condition (Ratcliff, 1993). These 19 participants were also excluded

from the other analyses, but this did not affect the results. This left a total sample of 5,793
participants (54.6% female) across the five experimental conditions. The mean age in the

sample was 31.77 years (SD = 11.19). Participants were incentivized by a lottery of three

Amazon gift cards (100, 50, and 30 Euro) and by the offer to inform them of their test score

and their performance in comparison with other test-takers after they had finished the test.

Materials

Test items from the BOWIT (Bochumer Wissenstest = Bochum Test of General

t
ip
Knowledge) were used in the present investigation. The BOWIT is a validated and published

general knowledge test in the German language (Hossiep & Schulte, 2008). It covers 11

cr
domains of knowledge and consists of 14 items per domain in two parallel Forms A and B.

us
We decided to extract items from form B and selected items from those five domains that we

felt were most typical of the German higher education curriculum. The domains were (a)
an
Biology/Chemistry, (b) Math/Physics, (c) Language/Literature, (d) Society/Politics, and (e)

History/Archeology. The original BOWIT items contain five options such that Option 5 is
M

always “none of the above is true.” Haladyna (2004, p. 117) recommended that a “none of the
ed

above” option should not be used when cognitive load for an MC item is low, as is the case

for general knowledge items. We therefore selected 63 items for which Option 5 was not the
pt

solution and deleted the fifth answer option from the selected BOWIT items to obtain the 4-

option items that formed the basis of the present investigation. To determine distractor
ce

functionality, these 63 items were first pretested in an online pilot study (n = 499). This
Ac

allowed us to rank order all answer options according to the two criteria used in the main

study. To this end, as measures of the distractors’ performance, we computed (a) their

endorsement rate (p) and (b) their point-biserial correlation with the total test score (r).

Because both of these measures have frequently been used as an index of distractor

functionality, we combined them into a composite measure of distractor performance. To this

end, we averaged the z-scores of the distractors’ endorsement rates and the z-scores of their
point-biserial correlations. This allowed us to rank order all distractors and to identify the one

or two worst or best distractors for each item. To avoid dropouts by reducing participant

burden, only a random subset of 30 BOWIT items was selected from the pretest items, and

these formed the final test for the main study.

Design

Five testing conditions were implemented. Participants in the first condition

t
ip
answered a 4-option test. The deletion of distractors for participants in the other four

conditions was based on the composite measure, which indicated the options that

cr
discriminated best and were endorsed most frequently. The 3-option-worst-deleted test was

us
created by discarding the worst distractor from each item. The 3-option-best-deleted test was

created by discarding the best distractor from each item. The 2-option-worst-deleted test was
an
created by removing the two worst distractors from each item; and the 2-option-best-deleted

test was created by removing the two best distractors from each item.
M

Procedure

Items were presented as an online quiz using the software EFS Survey (Version 9.0,
ed

QuestBack GmbH, Germany). At the beginning of the questionnaire, participants were asked
pt

to indicate their age, sex, and educational background, including the total number of years

they had spent in school and in higher education. The total number of years spent in
ce

education was used as a first measure to obtain criterion-related evidence of the validity of

test scores. To obtain another criterion measure, participants were asked to provide a self-
Ac

rating of their general knowledge relative to other persons. Thus, respondents were asked to

provide an estimate of the percentage of the population that presumably had a higher degree

of general knowledge than the respondent. Correlations with the BOWIT test score were

negative for this variable as a higher value indicated a lower self-rated level of general

knowledge. Next, participants were asked to answer 10 additional items from the Spiegel
Student Pisa Test (Trepte & Verbeet, 2010) that were not manipulated with regard to the

number and quality of the distractors. These additional items served as the third measure to

obtain criterion-related evidence of validity. Two items from each of the five domains

covered by the Spiegel Pisa Test were presented: (a) Politics, (b) History, (c) Economics, (d)

Culture, and (e) Science. After working on the 10 Spiegel Pisa items, participants were

randomly assigned to one of the five testing conditions that differed in the number and

t
ip
quality of the distractors for the 30 BOWIT items that were presented in the final phase of the

study. For all MC items presented throughout the study, the item order and position of answer

cr
options were varied randomly. After working on all items, participants were thanked and

us
debriefed. To provide the respondents with additional feedback, they were informed about

their test score and their performance in comparison with the other test-takers.
an
Results
M
For each testing condition, item difficulties, item discriminations, Cronbach’s α, and

correlations with external validation criteria were computed. Total years of education, self-
ed

reported general knowledge, and the Spiegel Pisa test score served as external criteria for

determining validity coefficients. As measures of distractor performance and as a check of


pt

the experimental manipulation, distractor endorsement rates and point-biserial correlations


ce

with the BOWIT test score were computed for all BOWIT items. An alpha error level of .05

was applied for all significance tests. For p-values greater than .05, exact values are reported.
Ac

ANOVA effect sizes were computed using eta-squared (η2), which can be interpreted

as the proportion of the variance explained by an independent variable. η2 ≥ 0.01 implies a

small effect, η2 ≥ 0.06 a moderate effect, and η2 ≥ 0.14 a large effect (Cohen, 1988).

According to Cohen (1988, p. 110), q = z(r1) – z(r2), that is, the difference between the two

Fisher-z-transformed correlations is the appropriate effect size to display the difference

between two correlation coefficients. Therefore, we report Cohen’s q as an index of the effect
size for the comparison of correlation coefficients. According to Cohen (1988, p. 115), q = .1

implies a small effect, q = .3 a moderate effect, and q = .5 a large effect.

Item Difficulty

Item difficulties (pitem) were determined for all 30 items separately for each testing

condition. Pitem was computed as the proportion of test-takers solving an item correctly.

Table 2 displays the mean item difficulties for the five testing conditions. Items were most

t
ip
difficult in the 4-option condition and were generally easier when the number of options was

reduced, particularly when good distractors were removed. The difficulty of the 30 items

cr
varied as a function of the condition, F(4, 116) = 87.72, p < .01, η2 = .75. Bonferroni post hoc

us
tests showed that there was no significant difference between the difficulties of the 4-option

(.55) and the 3-option-worst-deleted (.57) tests, p = .24, nor was there a significant difference
an
between the difficulties of the 3-option-best-deleted (.66) and the 2-option-worst-deleted
M
(.65) tests, p > .99. In these two cases, the better quality of the distractor options compensated

for their lower quantity. All other pairwise comparisons were significant at the p < .01 level.
ed

Thus, items with more options were generally solved less frequently than items with fewer

options (4 < 3 < 2), and items with low-quality distractors were solved more frequently than
pt

items with high-quality distractors (3-best-deleted > 3-worst-deleted; 2-best-deleted > 2-

worst-deleted).
ce

Item Discrimination
Ac

Part-whole-corrected point-biserial correlations between solution choice and total test

score were computed as measures of item discrimination (ritem). Table 2 displays the mean

item discriminations for the five testing conditions. The highest item discriminations were

found for the 4-option condition (.33) with a slight decrease when the one (.30) or two (.29)

worst distractors had been deleted. Discriminations were impaired more when the best

distractor had been removed (.24) and was particularly bad when the two best distractors had
been deleted (.17). To test for differences between conditions, item discriminations were first

z-transformed (Fisher, 1925) to account for the skewed distribution of r. Consistent with the

descriptive results, item discriminations varied strongly as a function of the condition, F(4,

116) = 60.02, p < .01, η2 = .67. The effect size of this variation was large (η2 = .77) when 4-

option items were compared with 3-options-best-deleted and 2-options-best-deleted items,

arguably because the best distractor was no longer available in these items. The effect size

t
ip
was smaller (η2 = .29) when 4-option items were compared with 3-options-worst-deleted and

cr
2-options-worst-deleted items in which the best distractor was still available.

In pairwise comparisons of all five testing conditions, Bonferroni post hoc tests

us
allowed us to identify the conditions that differed significantly from each other.
an
A nonsignificant difference was found between the 3-option-worst-deleted and the 2-option-

worst-deleted tests (.30 vs. .29, p > .99, q = 0.01). Thus, item discrimination did not improve
M
if an additional distractor was added after the best distractor had already been included. All

other pairwise comparisons were significant at the p < .01 level. Thus, 4-option items
ed

discriminated significantly better than the items in all other testing conditions. Removing the

worst distractors led only to very minor impairments in item discrimination, and again,
pt

distractor quality was able to compensate for distractor quantity. This was shown by the
ce

superiority of the 2-option-worst-deleted set over the 3-option-best-deleted set (.29 vs. .24, p

< .01, q = 0.06). Thus, a small beauty (a 2-option-item with a high-quality distractor)
Ac

discriminated better than a large beast (a 3-option item that included two poor distractors).

Cronbach’s Alpha

Cronbach’s α was computed for each testing condition as an index of the internal

consistency of test scores (see Table 2). Descriptively, the 4-option items offered the highest

internal consistency (.82), and α decreased only slightly when the one (.80) or two (.79) worst

distractors had been removed. The reliability coefficient was impaired more severely when a
good distractor had been removed (.73) and was particularly high when the two best

distractors had been removed and only the worst distractor was still present (.61). To test for

differences in Cronbach’s α, pairwise comparisons were conducted by employing Feldt,

Woodruff, and Salih's (1987) procedure as implemented on

http://comparingcronbachalphas.org/ (Diedenhofen & Musch, 2016). The pattern of

significance was the same as for the item discriminations. With one exception, all pairwise

t
ip
comparisons were significant at the p < .05 level. A significant difference was found between

the 4-option and the 3-option-worst-deleted tests (.82 vs .80), χ2(1) = 5.48, p < .05.

cr
Significant differences were also found between the 4-option and the 3-option-best-deleted

us
tests (.82 vs. .73), χ2(1) = 49.61, p < .01; the 4-option and the 2-option-worst-deleted tests

(.82 vs .79), χ2(1) = 9.73, p < .01; the 4-option and the 2-option-best-deleted tests (.82 vs.
an
.61), χ2(1) = 164.59, p < .01; the 3-option-worst-deleted and the 3-option-best deleted tests

(.80 vs .73), χ2(1) = 22.65, p < .01; the 3-option-worst deleted and the 2-option-best deleted
M

tests (.80 vs. .61), χ2(1) = 113.28, p < .01; the 3-option-best-deleted and the 2-option worst-

deleted tests (.73 vs .79), χ2(1) = 15.79, p < .01; the 3-option-best-deleted and the 2-option-
ed

best deleted tests (.73 vs. .61), χ2(1) = 34.97, p < .01; and the 2-option-worst-deleted and the
pt

2-option-best deleted tests (.79 vs. .61), χ2(1) = 97.29, p < .01. The only nonsignificant

difference was found between the 3-option-worst-deleted and the 2-option-worst deleted tests
ce

(.80 vs. .79), χ2(1) = 0.62, p = .43.


Ac

Validity coefficients

To obtain criterion-related evidence of validity, Pearson-product-moment correlations

were computed to several measures that served as external criteria for general knowledge.

Validity coefficients were obtained by correlating the test score for the 30 BOWIT items and

(a) the Spiegel Pisa test score (rPisa test), (b) self-rated general knowledge (rself-rated knowledge),

and (c) the total number of years a participant had spent in school and in continuing academic
education (reducation) (see Table 2 for all correlations). Descriptively, the 2-option-worst-

deleted test had the highest correlations with all external criteria, whereas the 2-option-best-

deleted test had the lowest correlations for two of the three criterion measures. Correlations in

the other conditions varied nonsystematically and were rather similar to each other.

Differences between correlations were investigated using Fisher’s z-test (1925) as

implemented in the R-package cocor (Diedenhofen & Musch, 2015). Significant differences

t
ip
in validity coefficients as measured by correlations with the Spiegel Pisa test score as the

criterion were found between the 4-option and the 2-option-worst-deleted tests (.59 vs. 67, z

cr
= -3.16, p < .01, q = 0.13); the 4-option and the 2-option-best-deleted tests (.59 vs .53, z =

us
2.00, p < .05, q = 0.08); the 3-option-worst-deleted and the 2-option-worst-deleted tests (.58

vs. .67, z = -3.53, p < .01, q = 0.15); the 3-option-best-deleted and the 2-option-worst-deleted
an
tests (.60 vs. .67, z = -2.80, p < .01, q = 0.12); the 3-option-best-deleted and the 2-option-

best-deleted tests (.60 vs. .53, z = 2.37, p < .05, q = 0.10); and the 2-option-worst-deleted and
M

the 2-option-best-deleted tests (.67 vs. .53, z = 5.19, p < .01, q = 0.21). Thus, the 2-option-
ed

worst-deleted set employing only a single but well-chosen distractor achieved a higher

validity coefficient with regard to the Spiegel Pisa test than all sets employing 3-option or 4-
pt

option items, whereas the 2-option-best-deleted set employing a single poor distractor

showed the least evidence of validity.


ce

Significant differences in validity coefficients as measured by correlations with self-


Ac

rated general knowledge were found between the 4-option and the 2-option-best-deleted tests

(-.32 vs. -.23, z = -2.46, p < .05, q = 0.10); the 3-option-worst-deleted and the 2-option-best-

deleted tests (-.31 vs. -.23, z = -2.31, p < .05, q = 0.10); and the 2-option-worst-deleted and

the 2-option best-deleted tests (-.36 vs. -.23, z = -3.49, p < .01, q = 0.14).

Only with regard to the time spent in school and in higher education as the criterion ,

no significant differences in validity coefficients were found. The results across all three
measures that were used to obtain criterion-related evidence of validity can thus be

summarized as follows: The quality of distractors was more important than their quantity;

and small beauties (2-option items with a well-chosen distractor) generally performed better

than large beasts (3- or 4-option items with poor distractors).

Distractor Performance

As a check of the success of our manipulations, Table 2 displays the performance of

t
ip
distractors in the different testing conditions. Endorsement rate (pdistractor) and point-biserial

correlations of distractor choice and total test score (rdistractor) are shown separately for each

cr
condition for the best, the second best, and the worst distractors. The total average distractor

us
discriminations and endorsement rates are also displayed for each test set. The results show

that functional distractors were endorsed more frequently and showed higher discrimination
an
than distractors that had been classified as less functional. Distractors received more attention

when there were fewer of them: The average distractor endorsement rate increased as the
M

number of options decreased. This is of course what had to be expected – responses are likely
ed

to be more spread out across options when there are more options. Furthermore, the average

distractor discrimination was more negative (i.e., better) as the number of options decreased.
pt

Table 2 also shows the absolute number of functional distractors per item. Their

relative proportion is also reported. This relative proportion was computed as the proportion
ce

of the number of functional distractors and the total number of distractors per item, separately
Ac

for each condition. For this analysis, a distractor was considered functional if its

discrimination—computed as its correlation with the total test score—was negative, and if it

was selected by more than 5% of all test-takers. However, no distractor yielded a zero or

positive discrimination index in any testing condition. For each distractor, the choice-total

correlation was negatively signed, indicating that all distractors performed rather well.

Therefore, distractors only had to be selected by less than 5% of the participants in order to
be classified as dysfunctional. As can be seen in Table 2, the absolute number of functional

distractors was lower but their relative proportion was higher when the number of options

was reduced, and the relative proportion of functional distractors was lower after distractor

options that performed well were deleted.

To summarize, we found that the average endorsement rate, discrimination, and

relative proportion of functional distractors were always higher when the worst rather than

t
ip
the best distractors had been deleted. This pattern provides a successful manipulation check

for the procedure that we employed after the pretest to identify good versus poor distractors.

cr
Testing Time

us
The time the participants needed to answer all 30 BOWIT items was recorded. Testing

times differed between conditions, F(4, 5788) = 29.22, p < .01, η2 = .02. Bonferroni post hoc
an
tests showed that testing time in the 4-option condition was higher than in all other
M
conditions, p < .01 for all pairwise comparisons. The testing times in the two 3-option

conditions were not different from each other (p > .99), but they were significantly lower than
ed

in the 4-option condition and significantly higher than in the 2-option conditions (p < .05).

The testing times in the two 2-option conditions were not different from each other (p = .13),
pt

but they were significantly lower than in the 3- and 4-option conditions (p < .05). Thus,

testing time decreased as the number of options decreased. Table 2 displays the mean testing
ce

time per item for all five conditions and a correction factor indicating how many more items
Ac

could be processed in the respective condition relative to the 4-option full test set condition.

For example, if a factor of 1.34 was computed for the 2-option-worst-deleted condition, this

meant that 134 items of this item type could be answered in the same time that was needed to

answer 100 items of the 4-option type. This correction factor was then used for the

Spearman-Brown formula to predict the reliability coefficient assuming a fixed testing time

in all conditions, which, however, would have resulted in the presentation of a different
number of items per condition. When the testing time was accounted for, a slightly higher

reliability coefficient (.83) was predicted for small beauties, that is, 2-option-worst-deleted

items, than for full 4-option items (.82) or 3-option-worst-deleted items (.82).

Discussion

The goal of the present study was to investigate the interaction between the quality

t
ip
and quantity of distractors in MC testing. To this end, five test sets consisting of 30 items

cr
with four, three, or two answer options were compared on item difficulty, discrimination,

internal consistency, and criterion-related evidence of the validity of test scores. In addition

us
to varying the number of distractors, we also varied distractor quality by deleting either the
an
best or the worst distractors from the 4-option items in a stepwise fashion.

All previous investigations of the relation between the number of options and the
M
validity of MC test scores had a combined total sample size of 1,948. The present study had a

sample size of 5,793 and was thus about three times larger than all previous studies
ed

combined. Owing to this very large sample size, we found a clear pattern of results that

allowed us to draw conclusions with high statistical power. Items with fewer options were
pt

generally easier, and deleting good distractors made items easier than deleting poor
ce

distractors. The nonsignificant differences in difficulty between test sets with a different

number of options showed that item difficulty is not solely determined by the number of
Ac

options but also strongly depends on the quality of the distractors. Whereas average part-

whole-corrected item discriminations and Cronbach’s alphas were highest for 4-option tests

(.82), the deletion of poor distractors impaired internal consistency only slightly (to .80 and

.79 for the 3-option and 2-option sets, respectively). The internal consistency dropped much

more markedly after the best distractors were deleted (to .73 and .61 for the 3-option and 2-

option sets, respectively). Items for which the best distractor was available performed better
than items for which it was removed, and varying distractor quality was more influential than

varying distractor quantity.

It has repeatedly been reported that items with fewer options do not necessarily

perform worse than items containing more options. Using our large sample, we found that

more options actually do increase item quality if the added distractors are functional.

Importantly, however, all distractors on our test were discriminating; that is, no distractor was

t
ip
positively related to the total test score. For the 4-option items, 2.4 distractors out of the

maximum of three were functional on average. This is a large number in comparison with

cr
what has been found in other studies, and hence, it indicates that the BOWIT distractors were

us
rather well constructed.

Although adding even the least functional distractors improved test functioning, this
an
improvement was only very minor: Adding the second best to the best distractor increased

internal consistency by only .01, and adding another distractor led to a further increase of
M

only .02. A much more important determinant of the precision of test scores and item
ed

discrimination was whether the single best distractor was present or not.

Our investigation of criterion-related evidence of validity yielded an interesting and


pt

surprising pattern. Most remarkable was the finding that distractor quality actually

overcompensated for distractor quantity. Small beauties (2-option-items with only a single
ce

but well-chosen distractor) showed higher validity coefficients than large beasts (3- or 4-
Ac

option items with poor distractors). Correlations with the three validity criteria were

consistently found to be highest for the 2-option-worst-deleted items, and the correlation with

the Spiegel Pisa general knowledge items was significantly higher for these items than for all

other test sets. Thus, the best test consisted of simple 2-option-items that combined the

solution with a single, well-chosen distractor. When the time needed to answer the questions

was also taken into account by computing time-corrected reliability coefficients based on a
Spearman-Brown correction factor, the surprisingly well-performing 2-option-worst-

distractor-deleted items showed not only the highest validity coefficients but also a slightly

higher – but not significantly higher – reliability coefficient. Some limitations should be

considered when interpreting the present results. First, our experimental investigation of test

functioning was limited to a single test domain. The same manipulation may lead to different

results with a different test. For the present test, we found that item functioning was largely

t
ip
dependent of the availability of one good distractor. This result may depend on the relative

plausibility of the other distractors. If all distractors are discriminating well, the very best

cr
distractor may be relatively less important for item functioning. However, constructing many

us
well-functioning distractors has proven to be a difficult task (cf. Haladyna et al. 2002).

An additional limitation of the present study is that our investigation of criterion-


an
related evidence of validity disclosed an unexpected result, namely that evidence was

strongest for the test in which only the best distractor was presented together with the
M

solution. While this finding may be the result of our use of a fully randomized design and a
ed

much larger sample than in all previous studies combined, the result should be replicated to

assess its generalizability.


pt

The strongest motivation for investigating the optimal number of answer options in

MC items has always been the attempt to facilitate the work of item writers. Perhaps for this
ce

reason, researchers have long questioned whether the use of as many options as four or five is
Ac

really necessary to ensure good test functioning. Our results strongly suggest that three and

even two options can be sufficient to obtain reliable and meaningful test scores as long as the

distractors are functional. Our findings also show that considering only the number of options

is not enough and that distractor quality affects test functioning much more than distractor

quantity. On the basis of our findings, we recommend that test-writers put their effort toward

creating one or two good distractors rather than a larger number of relatively poor distractors.
Even when only a single well-performing distractor was offered, items in the present study

yielded stronger criterion-related evidence of validity and were preferable to items with more

distractors that did not function as well. Hence, taken together, our findings suggest that a

small beauty is better than a large beast.

t
ip
cr
us
an
M
ed
pt
ce
Ac
References

Baghaei, F., & Amrahi, N. (2011). The effects of the number of options on the psychometric

characteristics of multiple choice items. Psychological Test and Assessment

Modeling, 53(2), 192-211.

Budescu, D. V., & Nevo, B. (1985). Optimal number of options: an investigation of the

assumption of proportionality. Journal of Educational Measurement, 22(3), 183-196.

t
ip
Brozo, W. G., Schmelzer, R. V., & Spires, H. A. (1984). A study of testwiseness clues in

cr
college and university teacher-made tests with implications for academic assistance

centers (Technical Report 84-01). Georgia State University: College Reading and

us
Learning Assistance. an
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale,

NJ: Lawrence Erlbaum.


M
Costin, F. (1970). The optimal number of alternatives in multiple-choice achievement tests -

some empirical evidence for a mathematical proof. Educational and Psychological


ed

Measurement, 30(2), 353-358.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika,
pt

16(3), 297-334.
ce

DiBattista, D., & Kurzawa, L. (2011). Examination of the quality of multiple-choice items on

classroom tests. The Canadian Journal for the Scholarship of Teaching and Learning,
Ac

2(2), Article 4.

Diedenhofen, B., & Musch, J. (2016). cocron: a web interface and r package for the statistical

comparison of Cronbach’s alpha coefficients. International Journal of Internet

Science, 11, 51-60.

Diedenhofen, B., & Musch, J. (2015). cocor: a comprehensive solution for the statistical

comparison of correlations. PLoS ONE 10(4): e0121945.


Downing, S. M. (1992). True-false, alternate-choice, and multiple-choice items. Educational

Measurement: Issues and Practice, 11(3), 27-30.

Ebel, R. L. (1968). Blind guessing on objective achievement tests. Journal of Educational

Measurement, 5(4), 321-325.

Ebel, R. L. (1969). Expected reliability as a function of choices per item. Educational and

Psychological Measurement, 29(3), 565-570.

t
ip
Edwards, B. D., Arthur, W., & Bruce, L. L. (2012). The 3-option format for knowledge and

ability multiple-choice tests: a case for why it should be more commonly used in

cr
personnel testing. International Journal of Selection and Assessment, 20(1), 65-81.

us
Farhady, H., & Shakery, S. (2000). Number of options and economy of multiple-choice tests.

Roshd Foreign Language Teaching Journal, 14, 132-140.


an
Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses using

G* Power 3.1: tests for correlation and regression analyses. Behavior research
M

methods, 41(4), 1149-1160.


ed

Feldt, L. S., Woodruff, D. J., & Salih, F. A. (1987). Statistical inference for coefficient alpha.

Applied Psychological Measurement, 11(1), 93-103.


pt

Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh, Scotland: Oliver

and Boyd.
ce

Green, K., Sax, G., & Michael, W. B. (1982). Validity and reliability of tests having differing
Ac

numbers of options for students of differing levels of ability. Educational and

Psychological Measurement, 42(1), 239-245.

Haladyna, T. M. (2004). Developing and validating multiple-choice test items (3rd ed.). New

York, NY: Routledge.

Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-

writing rules. Applied Measurement in Education, 2(1), 51-78.


Haladyna, T. M., & Downing, S. M. (1993). How many options is enough for a multiple-

choice test item. Educational and Psychological Measurement, 53(4), 999-1010.

Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice

item-writing guidelines for classroom assessment. Applied Measurement in

Education, 15(3), 309-334.

Hossiep, R., & Schulte, M. (2008). Bochumer Wissenstest (BOWIT). Manual. Göttingen:

t
ip
Hogrefe.

Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability.

cr
Psychometrika, 2(3), 151-160.

us
Lee, H., & Winke, P. (2013). The differences among three-, four-, and 5-option-item formats

in the context of a high-stakes english-language listening test. Language Testing,


an
30(1), 99-123.

Leiner, Dominik J. (2012). SoSci panel: the noncommercial online access


M

panel. Poster presented at the GOR 2012, 6th March, Mannheim. Retrieved
ed

from https://www.soscisurvey.de/panel/download/SoSciPanel.GOR2012.pdf.

Lord, F. M. (1977). Optimal number of choices per item - comparison of 4 approaches.


pt

Journal of Educational Measurement, 14(1), 33-38.

Owen, S. V., & Froman, R. D. (1987). What's wrong with 3-option multiple choice items?
ce

Educational and Psychological Measurement, 47(2), 513-522.


Ac

Ratcliff, R. (1993). Methods for dealing with reaction time outliers. Psychological bulletin,

114(3), 510-532.

Rodriguez, M. C. (2005). Three options are optimal for multiple-choice items: a meta-

analysis of 80 years of research. Educational Measurement: Issues and Practice,

24(2), 3-13.
Straton, R. G., & Catts, R. M. (1980). A comparison of 2-choice, 3-choice and 4-choice item

tests given a fixed total number of choices. Educational and Psychological

Measurement, 40(2), 357-365.

Tarrant, M., & Ware, J. (2008). Impact of item-writing flaws in multiple-choice questions on

student achievement in high-stakes nursing assessments. Medical Education, 42(2),

198-206.

t
ip
Tarrant, M., & Ware, J. (2010). A comparison of the psychometric properties of three- and 4-

option multiple-choice questions in nursing assessments. Nurse Education Today,

cr
30(6), 539-543.

us
Tarrant, M., Ware, J., & Mohammed, A. M. (2009). An assessment of functioning and non-

functioning distractors in multiple-choice questions: a descriptive analysis. BMC


an
Medical Education 9(1), 1-8.

Thanyapa, I., & Currie, M. (2014). The number of options in multiple choice items in
M

language tests: does it make any difference? Evidence from Thailand. Language
ed

Testing in Asia, 4(1), 1-21.

Trepte, S., & Verbeet, M. (Eds.). (2010). Allgemeinbildung in Deutschland. Erkenntnisse aus
pt

dem SPIEGEL-Studentenpisa-Test. Wiesbaden: VS Verlag für Sozialwissenschaften.

Trevisan, M. S., Sax, G., & Michael, W. B. (1991). The effects of the number of options per
ce

item and student ability on test validity and reliability. Educational and Psychological
Ac

Measurement, 51(4), 829-837.

Tversky, A. (1964). On the optimal number of alternatives at a choice point. Journal of

Mathematical Psychology, 1(2), 386-391.

Zoanetti, N., Beaves, M., Griffin, P., & Wallace, E. M. (2013). Fixed or mixed: a comparison

of three, four and mixed-option multiple-choice tests in a Fetal Surveillance

Education Program. BMC Medical Education, 13(1), 1-11.


Table 1.

Overview of the five experimental conditions and the three external criteria for validation.

Worst deleted Best deleted

4 options 3 options 2 options 3 options 2 options

(n = 1,142) (n = 1,167) (n = 1,162) (n = 1,149) (n = 1,173)

No distractor For each item, For each item, For each item, For each item,

t
ip
was deleted. The the worst the two worst the best the two best

full set of four distractor in distractors in distractor in distractors in

cr
answer options terms of terms of terms of terms of

us
(one solution discrimination discrimination discrimination discrimination

and three and and and and


an
distractors) was endorsement endorsement endorsement endorsement

presented. rate was deleted. rate were rate was deleted. rate were
M

The two deleted. The The two deleted. The


ed

remaining remaining remaining remaining

distractors and distractor and distractors and distractor and


pt

the solution the solution the solution the solution

were presented were presented were presented were presented


ce

as 3-option as 2-option as 3-option as 2-option


Ac

items. items. items. items.

Note. To obtain criterion-related evidence of validity, we collected the following external criteria for 5793 test

takers:

1. Self-rating of general knowledge

2. Score in a general knowledge test (Spiegel Student Pisa, Trepte & Verbeet, 2010)

3. Education (total number of years spent in school and at university)


Next, test takers worked on 30 multiple-choice general knowledge items from the BOWIT (Bochumer

Wissenstest = Bochum Test of General Knowledge). Test takers were assigned randomly to the five

experimental conditions differing in the quality and quantity of the distractors.

t
ip
cr
us
an
M
ed
pt
ce
Ac
Table 2. Psychometric Properties of the Different Test Sets

Worst deleted Best deleted

4 options 3 options 2 options 3 options 2 options

(n = 1,142) (n = 1,167) (n = 1,162) (n = 1,149) (n = 1,173)

pitem .551 .566 .652 .665 .841

ritem .331 .301 .293 .241 .169

t
ip
Cronbach’s α .823 .796 .786 .727 .609

cr
rPisa test .586 .576 .666 .596 .523

rself-rated knowledge -.318 -.312 -.356 -.291 -.224

us
reducation .230 .237 .275 .262 .250

pdis1 .251 .273 .348 ---- ----


an
pdis2 .138 .161 ---- .231 ----
M
Pdis3 .061 ---- ---- .105 .186

pdisAvg .150 .217 .348 .168 .186


ed

rdis1 -.263 -.284 -.373 ---- ----

rdis2 -.164 -.166 ---- -.243 ----


pt

rdis3 -.115 ---- ---- -.170 -.276


ce

rdisAvg -.181 -.225 -.373 -.207 -.276

nfunctional-d 2.333 1.867 1 1.667 0.867


Ac

%functional-d 78 93 100 83 87

Mtesttime 534.354 s 460.863 s 397.615 s 457.012 s 352.952 s

SDtesttime 492.776 s 394.793 s 556.518 s 308.134 s 370.940 s

MtesttimeItem 17.812 s 15.362 s 13.254 s 15.234 s 11.765 s

Time correction 1 1.159 1.344 1.169 1.514


factor

Time-corrected alpha .823 .819 .832 .757 .702

Note. pitem = average item difficulty for the 30 BOWIT items. ritem = average item discrimination. rPisa test =

correlation of BOWIT test score and Spiegel Pisa test score. rself-rated knowledge = correlation of BOWIT test score

and self-rated general knowledge. reducation = correlation of BOWIT test score and years spent obtaining an

education. pdis1 = average endorsement rate for the best distractor, pdis2 = average endorsement rate for the

t
second best distractor, pdis3 = average endorsement rate for the worst distractor, pdisAvg = average endorsement

ip
rate for all distractors available for the respective test set, rdis1 = average discrimination of the best distractor,

rdis2 = average discrimination of the second best distractor, rdis3 = average discrimination of the worst distractor,

cr
rdisAvg = average discrimination of all distractors available for the respective test set, nfunctional-d = average number

us
of functional distractors per item, %functional-d = average proportion of functional distractors per item. Testing

time was measured in seconds. Mtesttime is the average test time needed to finish all 30 BOWIT items, and
an
SDtesttime is its standard deviation. MtesttimeItem is the average time that was needed to work on one item. Time

correction factor indicates how many items of the respective item type could be answered in the same time as a

4-option reference item. Time-corrected alpha is a Spearman-Brown-corrected reliability based on the time
M

correction factor that assumes a fixed testing time for all conditions.
ed
pt
ce
Ac

You might also like