Multiple Choice Questions Can Be Designe

Multiple choice questions can be designed
or revised to challenge learners’ critical

thinking
Rochelle E. Tractenberg, Matthew

M. Gushta, Susan E. Mulroney & Peggy
A. Weissinger
Advances in Health Sciences

Education
Theory and Practice
ISSN 1382-4996
Adv in Health Sci Educ

DOI 10.1007/s10459-012-9434-4
1 23
Your article is protected by copyright and all
rights are held exclusively by Springer Science
+Business Media Dordrecht. This e-offprint
is for personal use only and shall not be self-
archived in electronic repositories. If you
wish to self-archive your work, please use the
accepted author’s version for posting to your
own website or your institution’s repository.
You may further deposit the accepted author’s
version on a funder’s repository at a funder’s
request, provided it is not made publicly
available until 12 months after publication.
1 23
Author's personal copy
Adv in Health Sci Educ
DOI 10.1007/s10459-012-9434-4
Multiple choice questions can be designed or revised

to challenge learners’ critical thinking
Rochelle E. Tractenberg • Matthew M. Gushta • Susan E. Mulroney •
Peggy A. Weissinger
Received: 17 July 2012 / Accepted: 21 November 2012

Springer Science+Business Media Dordrecht 2012
Abstract Multiple choice (MC) questions from a graduate physiology course were
evaluated by cognitive-psychology (but not physiology) experts, and analyzed statistically,
in order to test the independence of content expertise and cognitive complexity ratings of
MC items. Integration of higher order thinking into MC exams is important, but widely
known to be challenging—perhaps especially when content experts must think like nov-
ices. Expertise in the domain (content) may actually impede the creation of higher-com-
plexity items. Three cognitive psychology experts independently rated cognitive
complexity for 252 multiple-choice physiology items using a six-level cognitive com-
plexity matrix that was synthesized from the literature. Rasch modeling estimated item
difficulties. The complexity ratings and difficulty estimates were then analyzed together to
determine the relative contributions (and independence) of complexity and difficulty to the
likelihood of correct answers on each item. Cognitive complexity was found to be sta-
tistically independent of difficulty estimates for 88 % of items. Using the complexity
R. E. Tractenberg (&)
Collaborative for Research on Outcomes and -Metrics and Departments of Neurology, Biostatistics,
Bioinformatics & Biomathematics, and Psychiatry, Georgetown University Medical Center,
Building D, Suite 207, 4000 Reservoir Rd. NW, Washington, DC 20057, USA
e-mail: rochelle.tractenberg@gmail.com
R. E. Tractenberg
Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University Medical
Center, Washington, DC 20057, USA
R. E. Tractenberg
Department of Psychiatry, Georgetown University Medical Center, Washington, DC 20057, USA
M. M. Gushta
Wireless Generation, Washington, DC, USA
S. E. Mulroney
Department of Pharmacology & Physiology, Georgetown University Medical Center, Washington, DC,
USA
P. A. Weissinger
School of Medicine, Georgetown University, Washington, DC, USA
123
R. E. Tractenberg et al.
matrix, modifications were identified to increase some item complexities by one level,
without affecting the item’s difficulty. Cognitive complexity can effectively be rated by
non-content experts. The six-level complexity matrix, if applied by faculty peer groups
trained in cognitive complexity and without domain-specific expertise, could lead to
improvements in the complexity targeted with item writing and revision. Targeting higher
order thinking with MC questions can be achieved without changing item difficulties or
other test characteristics, but this may be less likely if the content expert is left to assess
items within their domain of expertise.
Keywords Cognitive complexity Higher order thinking Multiple-choice test items

Assessment
Introduction
‘‘Higher order thinking’’ reflects cognitive complexity in performance, often referring to

test items (see e.g., Moseley et al. 2005; Zheng et al. 2008). Assessments that incorporate
this complexity require respondents to exhibit skills ranging from lower order (list, define,
tell, describe, interpret, contrast, associate), through middle-order (solve, examine, classify,
explain, infer, decide), to the highest order (integrate, modify, create assess, judge, support)
cognitive levels to generate the correct answer (Williams and Haladyna 1982; Anderson
et al. 2001 after Bloom et al. 1956; Moseley et al. 2005). The best-known taxonomy rank
ordering these types of cognitive skills, Bloom’s Taxonomy, published in 1956, was
developed to support cognitively oriented educational objectives (Bloom et al. 1956);
many different taxonomies have followed (see e.g., Moseley et al. 2005). Higher order
thinking is valued across many curricula in higher and post-graduate education, and yet
obtaining test-based evidence to support the achievement of this objective is difficult.
Tests that do not directly assess higher order thinking do not directly support an
‘‘…argument about learner competencies that one seeks to develop and eventually support
with data from behaviors that activate these competencies.’’ (Rupp and Mislevy 2007 p. 205).
In spite of the widespread use of Bloom’s taxonomy to train faculty within ‘‘teaching
excellence’’ centers across many colleges and universities in the United States (and else-
where), writing ‘‘good’’ test items that purposely require higher order thinking is widely
recognized to be difficult (e.g., Haladyna 1997; Downing 2002; see also Case and Swanson
2002 for clinical and basic sciences contexts, and Bruff 2009 for questioning in real time).
‘‘A fundamental assumption in writing any test item is that we know what we are
testing’’ (Haladyna 1997, p. 13); and indeed the Standards for Educational and Psycho-
logical Testing (AERA/NCME/APA 1999) explicitly articulate the need to consider cog-
nitive operations for valid test design and interpretation. While instructors across the health
sciences know their material intimately, they may not be able to get beyond the
‘‘knowledge’’ level to capture more complex and abstract thinking. Furthermore, it might
be more challenging for domain experts (i.e., faculty in their fields) to write items using
Bloom’s taxonomy because experts think of their domains in qualitatively different ways
than novices (see Ericcson 2004). In their recent review of the perceptions by university-
level students and faculty, van de Watering and van der Rijt (2006) reviewed multiple
databases of published literature, and although they reported that standard setting (see e.g.,
Cizek and Bunch 2008) proceeds on the assumption that faculty know and can reliably
maintain a mental image of a ‘‘minimally competent student’’ or similar construct, there is
very little evidence that content area experts can actually estimate the difficulty of test
123
Increasing cognitive complexity of MC items
items or the capabilities of students to correctly answer these items. van de Watering and
van der Rijt (2006) concluded that without specific training in item writing, and possibly
collaborations with assessment experts, faculty may be impeded by their very expertise in
their efforts to assess their own students at the cognitive complexity levels that they would
like to target.
This study sought to test the hypothesis that cognitive complexity could be separated
from content in multiple choice test item reviews. We also sought to demonstrate whether
cognitive complexity (in terms of Bloom’s and other cognitive taxonomies) was separable
from item difficulty, so as to develop a framework for training subject-matter expert
instructors to write or revise test items that can provide evidence relating explicitly to
cognitive complexity without changing the ‘‘psychometric properties’’ of their existing
exams. This evidence could in turn be used to promote the creation of test items that vary
in both difficulty and cognitive complexity in order to facilitate the curricular goal of
integrating these skills across a curriculum where multiple choice tests may be common,
such as across the health sciences.
Methods
This study was conducted as an educational research project without human subjects
(under our institutional review board definitions) and was exempt.
Overview
Published cognitive complexity taxonomies were synthesized into a single matrix that
could facilitate the integration of this complexity into multiple choice questions (MCs)
developed by faculty, which form the bulk of assessments across many health sciences,
perhaps especially the undergraduate medical education assessment paradigm. This syn-
thesis was achieved by one cognitive psychology and assessment expert (RET) aligning the
levels and target cognitive behaviours, and making adjustments to the alignment with
independent input from the other two experts in the study (MMG & PAW). The matrix was
based on existing taxonomies to capitalize on the strengths of these contributions to the
literature; our intention was to ground the complexity ratings in well known, published
work in the field. Modern statistical methods were used for analysis and interpretation of
the matrix and the relationship between difficulty and complexity. Cognitive, and not
content-area, experts provided complexity ratings on MC items using this matrix as a proof
of concept for establishing a cognitive-complexity-training program for faculty.
The analyses were carried out on 252 MC items from three exams used in a semester-
long graduate level physiology course that are protected and used every year. In 2008 the
course enrolled 189 students. The Least Squares Distance Method (LSDM; Dimitrov 2007)
was used to validate cognitive processing requirements in the context of item difficulty,
which requires the specification of a matrix representing the cognitive complexity demands
of each item and item difficulty estimates. The methods for these three study elements
(complexity matrix, difficulty estimation, and LSDM) are described below.
Instruments and participants
Three multiple-choice exams for course credit were administered to the students—in one
class—coming from three different programs—general undergraduate medicine (MED;
123
n = 159), a specialized one-year pre-medical school preparatory program, Georgetown

Experimental Medical Studies (GEMS; n = 28), and a one-year program leading to a
Masters of Science degree in Physiology, the Special Master’s Program (SMP; n = 159).
All students were enrolled in the semester-long preclinical human physiology course at
Georgetown School of Medicine in 2008. The first exam was comprised of fifty (50) items,
the second of one hundred (100) items, and the third of one hundred and two (102) items.
All items were of the MC type; composed of five (5) response options and scored
dichotomously (right/wrong). Item ordering was varied across two forms (A and B) and
randomly administered to examinees at each administration. All students taking an exam
answered all items, and all items on the tests were the same irrespective of order. All three
cohorts took the same exams while completing the same semester long course during the
same semester.
Cognitive complexity and the Q matrix
Historically, the most widely studied and recognized method for describing cognitive
complexity has been the categorization described by Bloom’s Taxonomy of Educational
Objectives (Bloom et al. 1956). The taxonomy was recently revised by Anderson et al.
(2001) to incorporate advances in cognitive psychological theory and has also been
modified by Williams and Haladyna (1982) to make the taxonomic levels more straight-
forward for implementation in classroom assessment. Table 1 provides the dimensions of
cognitive complexity that were compiled with reference to the original Bloom’s taxonomy
and the revisions and modifications by Anderson et al. (2001) and Williams and Haladyna
(1982).
Three raters (co-authors on this paper), with Master’s (MMG) and Doctoral level (RET,
PAW) experience in assessment and cognitive psychology at the time of the study, applied
the rubric independently—and without content knowledge (i.e., emphasis was on assess-
ment principles and on cognitive requirements of the items, not on concepts). All raters
were familiar with the classification schemes from which the six-level complexity matrix
(the ‘‘cognitive complexity matrix’’ or CCM—Table 1) derived its descriptors and cate-
gories (i.e., Anderson et al. 2001; Bloom et al. 1956; Williams and Haladyna 1982). Exam
items were all rated independently using the cognitive complexity matrix shown in
Table 1. A single complexity rating resulted, agreed upon by all three raters, for each of the
252 items analyzed in this study; any discrepancies in independent ratings were resolved
by discussion among the reviewers. Examples of items that were rated include these two
questions with the identical stem, which were not rated at the same complexity level:
1. Select the FALSE statement.
a. Ventilation is increased at the bottom of the lung due to more compliant alveoli.
b*. Compliance increases at higher lung volumes.
c. Airway closure commonly occurs at the bottom of the lung.
d. The VA/Q in physiological dead space is infinity.
e. The alveolar O2 in a normal subject after 10 min of breathing pure oxygen will be
over 550 mmHg.
These options are all recognition, e.g.,
a. Ventilation is increased at the bottom of the lung due to more compliant alveoli.
b*. Compliance increases at higher lung volumes.
123
Table 1 Cognitive complexity matrix (CCM)

a. Remember/reiterate-answer based on recognition of previously seen example
Locate or retrieve relevant knowledge from memory
Recognize, reproduce, recall, restate (verbatim), apply labels (simple recall)/fill in the blank
Bloom et al.: ‘Knowledge’: memorization; Anderson et al.: ‘Remember’: recognize/recall
b. Understand/summarize- answer summarizes info already in question (and/or answers)
Report/summarize, focused recall, discriminate relevant from irrelevant info. Change from one type of
representation into another (e.g., paraphrase). Find a specific example (focused paraphrasing) or
illustration of a concept or principle; matching
Bloom et al.: ‘comprehension’: understanding, interpretation; Anderson et al.: ‘Understand’: interpret,
exemplify, classify, summarize, infer, compare, explain; Haladyna: ‘Understand’: define, demonstrate,
find, exemplify, illustrate, list
c. Apply/illustrate- answer extrapolates from seen examples to (really) new examples
Recognize previously unseen examples/exemplars; give a previously unseen example, identify what
examples represent. Apply a procedure to a familiar or unfamiliar task
Bloom et al.: ‘Application’: use information (e.g., an equation) to solve novel problems (not just 1 –2
features changed from essentially the same problem); Anderson et al.: ‘Apply’: execute, implement;
Haladyna classes illustration under ‘understanding’
d. Analyze/predict
Employ a rule to predict changes that are proximal or distal to the site of change or the time of change.
Use given criteria to articulate what comes next-separation of time/space differentiates this from simple
selection and application of correct formula (which is at c level). Determine how parts of a structure
relate to each other and the whole; determine purpose
Bloom et al.: ‘Analysis’: see patterns, organize; Anderson et al.: ‘analyze’: differentiate, organize,
attribute; Haladyna calls computation, figuring/determining, solving and concluding ‘Problem
Solving’-bridging levels c–d
e. Evaluate
Use criteria to make a decision, judgment or selection; determine what criteria were used in making a
judgment; analyze a situation or problem to determine the consequences; discuss both sides of an issue.
Explain how a decision was reached. Detect inconsistencies or fallacies; determine efficiency and
appropriateness of a procedure
Bloom et al.: ‘Synthesis’: draw conclusions, use old ideas to create new ones; Anderson et al.:
‘Evaluate’: check, critique; Haladyna: ‘Critical Thinking’: anticipate, appraise, critique, defend,
analyze, classify, compare, contrast, predict, distinguish, evaluate, hypothesize, infer, judge, relate,
value
f. Create/apply
The opposite of prediction: describe the chain of events leading to an outcome; given an outcome,
articulate the stages leading to the outcome (given the initial state). Problem solving; devise a
procedure to accomplish a task; formulate an alternate hypothesis or course of action (rather than
choose). Think and argue critically
Bloom et al.: ‘Evaluation’: compare, discriminate between ideas. Make rational arguments for course
of action or choice; Anderson et al.: ‘create’: generate, plan, produce. Haladyna: ‘Creativity’: build,
construct, create, design, invent, make, perform, plan, redesign
Modified from Williams and Haladyna (1982, p. 164), Haladyna (1997, p. 32); Bloom’s taxonomy (Bloom
et al. 1956) and Revised Bloom’s (Anderson et al. 2001)
Expert consensus: level 1.
2. Select the FALSE statement.

a. In the supine position the lung volumes are smaller than in the erect position.
b*. In the supine position PaO2 is higher than in the erect position.
c. In the supine position the FRC is decreased.
d. Presence of an obstruction in the airways will decrease Vmax.
123
e. During expiration, peak flow is effort-dependent because it takes time to form a

choke point.
These options are more focused recall—given something, something else will happen, e.g.,
b*. In the supine position PaO2 is higher than in the erect position.
e. During expiration, peak flow is effort-dependent because it takes time to form a
choke point.
Expert consensus: level 2.

Additionally, although the expert raters did agree among themselves on the item ratings,
they did not always agree with the instructor, as shown in these two examples.
3. A 26 year old man starts to complain of shortness of breath and tightness in the chest.
There is no history of exposure to noxious agents and the patient is a nonsmoker. On
admission we see a patient who is breathing rapidly and occasionally coughs a dry
cough. The X-ray shows diffuse fine opacities compatible with diffuse interstitial
fibrosis. The arterial blood gases are: PaO2: 55 mmHg, PaCO2: 25 mmHg, pH: 7.51,
HCO3: 20 mEq/L.
Select the FALSE statement. Changes in lung mechanics will include:
a. An increase in dead space ventilation.
b. A reduction in pulmonary capillary blood volume.
c*. An increased compliance at low lung volumes.
d. An increase in recoil pressure at all lung volumes.
e. An increase in the work of breathing.
Expert consensus: level 4 (use a rule to predict changes away from the site &/or time of
change; not just the application of formula-time/space changes; articulate what comes
next; see patterns, organize; differentiate).
Instructor rating: level 2 (focused recall—given something, something else will happen).
4. With increasing frequency of contraction, cardiac muscle:

a. Sequesters intracellular free Ca?? more rapidly.
b. Contracts with greater force due to increased preload.
c. Contracts with greater force at a constant preload.*
d. Has a reduced Vmax.
e. Has reduced maximum dP/dT.
Expert consensus: level 1: Locate or retrieve relevant knowledge from memory.

Instructor: level 3: identify what examples (answers) represent.
Examples of each cognitive complexity level, with additional example questions from
this exam, are shown in Table 2.
Based on the cognitive complexity matrix, six different (but not mutually exclusive)
components are required in order for a high probability of a correct answer; such a model is
called a ‘‘Q-matrix’’ (Tatsuoka 1983) and represents the complexity features, in matrix
form, indicating whether a given feature is present (1) or absent (0) in a given item. With
this information, the contribution of that feature to the probability of a correct response can
be estimated while controlling for (or simultaneously estimating) the item difficulty
123
Table 2 Cognitive complexity levels with examples

Cognitive complexity level 1: Remember; Recognize; Recall
Example physiology questions:
‘‘Of the substances listed below, which is most abundant in the interstitial fluid space?’’
‘‘The best index of physical fitness is: \select from list[’’
Cognitive complexity level 2: Focused recall
Paraphrase/interpret; Exemplify (give examples); Classify; Infer; Compare (two previously seen cases);
Explain.
Focused recall
‘‘In the development of atherosclerosis, the earliest pathologic change among the following choices
is:’’
‘‘An experiment is performed in which the myosin backbone protein is mutated to prevent movement
of the hinged regions on the cross-bridge. Which of the following is most likely?’’
Exemplify/Classify
‘‘An infant is found to be hypoglycemic immediately after birth. Higher than normal glucose levels
were found in the stool. Increasing the Na? content of the milk returned glucose levels to normal. A
decrease in which of the following in the intestinal epithelia may explain the infant’s symptoms?’’
Cognitive complexity level 3: illustration
Recognize what previously unseen examples represent
Give examples/identify what examples represent
Use/apply information (e.g., an equation) to solve novel problems
‘‘If 140 mmoles of sucrose were added to the vascular compartment of a 70 kg individual, which of the
following would NOT occur at equillibrium?’’
Cognitive complexity level 4: prediction
Use a rule to predict changes away from the site &/or time of change
Not just application of formula—incorporate time &/or space changes.
Articulate what comes next
See patterns, organize
Differentiate
‘‘The change in heart rate observed between the left and right panels (of a given electrocardiogram) is
probably…’’ \associated with/caused by/accompanied by[
(Mislevy and Huang 2007). In the present context, this is the estimated contribution of the
particular cognitive feature to item difficulty as estimated according to the Rasch model.
(See Gierl et al. 2000 for an excellent overview of the model and the Q-matrix.) Separate
Q-matrices were created for each of the three physiology exams.
Item difficulty estimation
Under the Rasch model (Bond and Fox 2007), the probability of correctly responding to a
test item is modeled as a logistic function, defined by a single parameter representing
person (ability) and item (difficulty). These parameters describe the ability of examinees
and the difficulty of items, respectively, and are located on a continuous scale with units
typically referred to as logits. For each item, the relationship between ability and difficulty
can be displayed graphically via the item characteristic curve (ICC), a figure (like Fig. 1)
123
Fig. 1 Item characteristic curve example
which plots the probability of a correct response across the range of ability, generally
represented as ranging from -5.0 to ?5.0. This range can be conceptualized similar to the
Z score range so that values further from zero represent more extreme ability or difficulty
levels.
An attractive characteristic of the Rasch model is that it results in a common scale for
both person and item parameters, meaning that ability parameter values can be directly
compared to item difficulty parameter values. This implies that persons with ability less
than an item difficulty value have a lower probability of correctly answering the item,
persons with ability greater than the item difficulty value have a higher probability of
correctly answering the item, and persons with ability equal to the item difficulty have an
equal probability (0.5) of answering the item correctly or incorrectly.
According to the Rasch model, specific patterns of responses to each test item are
expected; the degree of random variation around these expectations are described
according to fit statistics, given the item difficulty and examinee ability parameters. The
Infit statistic is sensitive to unexpected response patterns—increases or decreases in value
with respect to the amount of random variation—for examinees whose ability is near to the
estimate of item difficulty. Similarly, the Outfit statistic is sensitive to unexpected response
patterns provided by examinees whose ability is at the extremes of the ability range.
Together, the Infit and Outfit statistics are used to describe overall fit for each item; values
between 0.7 and 1.3 for both statistics are said to demonstrate adequate fit (Smith et al.
1998). WINSTEPS (Linacre 2007) was used to fit the items to the Rasch model in each
semester separately.
Least squares distance method
The LSDM is a method for validation and analysis of cognitive attributes (Dimitrov 2007)
which requires ‘‘a difficulty estimate’’ that can be derived from any specific model or
method, including classical test theory and Rasch and other modern test theory methods
(D. Dimitrov, personal communication July 2008). The LSDM uses existing difficulty
estimates and an appropriate Q-matrix to model attribute probabilities for fixed levels of
ability; estimation details are presented in the ‘‘Appendix’’. The LSDM then estimates the
probability for an individual with a given ability level (based on the Rasch \in our case[
123
model results) to correctly apply the specific cognitive skill that is articulated in the
Q-matrix (derived from the expert raters’ consensus).
These LSDM-derived item probabilities are calculated as the product of the attribute
probabilities across ability levels (see ‘‘Appendix’’), and approximate the probabilities that
are derived under the Rasch model. Recovery of the item probabilities by the LSDM is
demonstrated graphically by plotting the probabilities against the Rasch-based ICC and
numerically through calculation of the mean absolute difference (MAD) between the two
curves. Dimitrov (2007) suggested that the recovery of the ICC should be interpreted
according to the values of MAD: namely, as representing ‘‘very good’’ recovery
(MAD \ 0.02), ‘‘good’’ (0.02 B MAD \ 0.05), ‘‘somewhat good’’ (0.05 B MAD \
0.10), ‘‘somewhat poor’’ (0.10 B MAD \ 0.15), ‘‘poor’’ (0.15 B MAD \ 0.20), and
‘‘very poor’’ (MAD [ 0.20). The MAD values provide a key outcome of the LSDM,
indicating the degree to which the original item characteristic curves (estimated via Rasch
in our case) can be re-expressed according to the cognitive complexity structure hypoth-
esized by the Q-matrix. Additionally, if the response model (Rasch in our case) fits well,
then the outcome of the LSDM analyses can also suggest whether items, possibly matched
in terms of content and/or difficulty, might be retained or reworked in order to optimize the
correspondence between the assessment overall and the Q-matrix or table of test specifi-
cations that might be developed for this exam to incorporate content, difficulty and
complexity. An important note is that modern models (item response theory and/or Rasch)
provide the user with estimated fit of the response model to the observed data, while
classical test-theory difficulty estimates—which are not response models—do not provide
such information. Thus, LSDM can be used with item difficulty estimates derived from
classical test theory (i.e., p values from Scantron scoring for example), but the contribu-
tions of cognitive complexity to the item difficulty and response probabilities with such
difficulty estimates will be more challenging to interpret. We used Rasch modeling to
improve the interpretability of our results and achieve our study aims.
Results
We obtained results from each of the three methods outlined above, in terms of validation
of the complexity matrix (Table 1), Rasch model-based difficulty estimation, and the
incorporation of both into the LSDM procedure. Each is described below.
Cognitive complexity
Expert assignment of cognitive complexity level ratings to each item on the three exams
showed that the majority of the 252 items were classified as the lowest complexity—
Reiteration (Exam 1 = 24/50; Exam 2 = 41/100; and Exam 3 = 48/102). The results are
shown in Table 3.
Many of the remaining items were similarly distributed between Summarization (12, 26,
and 21) and Illustration (13, 29, and 30). Very few items were classified as Prediction items
(1, 4, and 3) and no items were classified as representing either of the two highest cognitive
complexity categories, Evaluation and Application. Thus, these were excluded from further
discussion and analysis of cognitive complexity in this sample.
The first four levels were represented on each of the three exams given in the course;
since the two highest levels of complexity require that students generate some response,
they tend to be incompatible with multiple choice questions. The independent expert
123
Table 3 Results of expert ratings consensus based on cognitive complexity matrix

Reiterate Summarize Illustrate Predict Mean item
(CC level 1) (CC level 2) (CC level 3) (CC level 4) complexity
(%) (%) (%) (%) rating (SD)
Exam 1 (50 items) 48 24 26 2 1.82 (0.90)

Exam 2 (100 items) 41 26 29 4 1.78 (0.91)
Exam 3 (102 items) 47 21 29 3 1.90 (0.93)
ratings falling into all four levels suggest that these four of six cognitive complexity levels
were, in fact, represented within the 252 items rated.
Item difficulty estimation
As noted, item difficulty was estimated in WINSTEPS (Linacre 2007) according to the
Rasch model. The results showed similar characteristics for each of the three exams
(Table 4). Average item difficulty is the same across forms (mean = 0.000, representing
the midpoint of the ability range, not zero ability) as a result of necessary software
specifications, and the standard deviation of item difficulties derived from modern response
models is typically about 1.000. Our Rasch model estimated the item difficulties as ranging
from -3.7 to 2.5, with Exam 1 demonstrating a narrower range of difficulty than the two
later exams. Student ability, h, was also estimated according to the Rasch model using
WINSTEPS (Table 5); distributional properties of student ability for Exams 1 and 2 were
similar, though Exam 2 demonstrates a lower maximum score. Student ability on Exam 3
(
h = 2.331) is much-increased over the other two exams as well as demonstrating mini-
mum ability scores that are a whole logit higher than the minimum values for either Exam
1 or 2.
These results suggest that items on exams later in the semester tend to be more difficult
than those of the first exam and that student performance also increases, as the minimum
student ability is shown to increase dramatically on Exam 3.
The item difficulty values estimated according to the Rasch model were regressed on
the four cognitive complexity categories that were identified by the expert raters as being
present within the 252 items. The multiple regression results indicated that cognitive
complexity accounts for very little of the variance in the item parameters (Exam 1:
R2 = 0.027; Exam 2: R2 = 0.036; Exam 3: R2 = 0.014). That is, the difficulty of the items
was essentially independent of the cognitive complexity in the items, with less than 4 % of
variability being shared between these two features for any item on any exam. Thus,
cognitive complexity and item difficulty are features of test items that are separable, and
each can be modified.
LSDM
Adequate item fit is a necessary condition for the application of LSDM (Dimitrov 2007).
Overall, the items in the three exams fit the Rasch model well: all items demonstrated
desirable Infit and just 8 % of all items (n = 20) demonstrated Outfit values beyond the
desired range (Table 3). Therefore, the use of the LSDM is valid for these data. Similar to
the ICC typically employed in item response and Rasch modeling, the Attribute Probability
123
Table 4 Distribution of item difficulty, infit, and outfit, by exam

Exam 1 (50) Exam 2 (100) Exam 3 (100a)
Difficulty
Mean 0.000 0.000 0.000
SD 1.082 1.111 1.217
Range -2.230 to 2.462 -3.639 to 2.502 -3.712 to 2.468
Infit
Mean 1.000 1.000 1.001
SD 0.044 0.061 0.049
Range 0.930 to 1.190 0.850 to 1.180 0.890 to 1.240
Prop. misfit 0.000 0.000 0.000
Outfit
Mean 0.971 1.004 0.960
SD 0.169 0.210 0.187
Range 0.450 to 1.500 0.480 to 1.910 0.420 to 1.880
Prop. misfit 0.080 0.080 0.080
a
Two items were excluded from the analysis since all students answered them correctly
Curve (APC) displays the probability of a correct response when a student possesses the
necessary attribute—in this case, cognitive complexity. Figure 2 displays the APCs for
each exam across the range of student ability.
According to Dimitrov’s (2007) criteria for values of the MAD, 76 % of all items
recovered the Rasch ICC with MAD values in the ‘‘somewhat good’’ to ‘‘very good’’ range
(Tables 5, 6).
The correlation between MAD and degree of misfit was found to be near zero for both
Infit (r = 0.106, 0.066, and 0.167) and Outfit (r = - 0.106, -0.180, and -0.003) statistics
when estimated for each exam. However, 8 % of all items had MAD values in the ‘‘poor’’
range in terms of ICC recovery, and 4 % were ‘‘very poor’’. These items with ‘‘poor’’ and
‘‘very poor’’ MAD values had Infit and Outfit values reflecting no problems in their
difficulty estimates from Rasch models. Thus, the LSDM method provided information
about the combination of item difficulty and complexity for all but 12 % of the 252 items
analyzed. Since difficulty information was estimable for these 12 %, and cognitive com-
plexity ratings were assigned to them all, the poor ICC recovery via LSDM suggests either
a stronger-than-desired association between complexity and difficulty, or other potential
problems with these items that would not have been identified using complexity or diffi-
culty alone.
Discussion
The LSDM recovery of the Rasch ICCs for individual items generally indicates that the Q-
matrix (Table 1) that we derived from cognitive complexity classifications reasonably
captures the hypothesized relations between cognitive complexity and test items. Very few
(4 %, n = 30) items demonstrated misfit and the majority of items (78 %) demonstrated
appropriate ICC recovery—small MAD values. Further, the correlation between MAD and
degree of misfit was found to be near zero for both Infit (r = 0.106, 0.066, and 0.167) and
123
Table 5 Distribution of student ability (estimated with Rasch models), by exam

Statistic Exam 1 Exam 2 Exam 3
Mean 1.869 1.651 2.331

SD 0.831 0.756 0.800
Range -0.502 to 5.687 -0.521 to 3.641 0.508 to 5.170
Fig. 2 Attribute probability curves for each exam
123
Table 6 Distribution of mean

Exam 1 Exam 2 Exam 3
absolute difference (MAD)
between the LSDM-based item
probabilities curve and Rasch- n 50 100 100
based Item Characteristic Curves Mean 0.067 0.068 0.076
(ICC) values, by exam SD 0.056 0.057 0.063
Min. 0.000 0.008 0.010
Max. 0.187 0.292 0.329
Good 37 79 75
Poor 13 18 9
Very poor 0 3 5
Outfit (r = - 0.106, -0.180, and -0.003) statistics, indicating that item fit was generally
not related to the probability of a correct response recovered by the LSDM. Our com-
plexity results are consistent with those of Zheng et al. (2008), among others, in that the
cognitive complexity of test items for first year medical students tends to be in the lower
three levels.
The LSDM results also support the validity of the cognitive complexity ratings of these
items. Although the LSDM statistics recovered the Rasch-based results, the APCs show
little to no differentiation between cognitive complexity categories with regard to the
discrimination of low- and high-ability students, or with regard the general difficulty of
items that require Reiteration versus Summarization versus Illustration versus Prediction
(see Fig. 2). Counter-intuitively, these results indicate that the modeled cognitive com-
plexity categories are equally difficult; it may otherwise be stated that item difficulty is not
the most appropriate quality or dimension by which to differentiate cognitive processing
requirements. This is in keeping with recent findings on cognitive complexity and student
performance, where difficulty and complexity are shown to be independent dimensions of
item characteristics (Gushta et al. 2009). Thus, both can be targeted in either item writing
or in item evaluation/consideration. It is possible that, without explicit attention to the
cognitive complexity of their multiple choice test items, many content-area experts writing
test items for their students are simply unable to take advantage of the full spectrum of
Bloom’s taxonomy (i.e., write more cognitively complex, but not more difficult) items.
There are valuable and important differences in an emphasis on knowledge (facts),
identification of patterns, and deep(er) levels of understanding (exemplified by elaborative,
explanatory responses by students) at different points in the medical curriculum (Custers
and Boshuizen 2002, pp. 194–195) and across the health sciences. However, although
Bloom’s Taxonomy has represented the hierarchy of cognitive complexity since 1956, it is
rare to find, and difficult to create, assessments that span the desired range of complexity.
For this study we synthesized several hierarchical representations of cognitive complexity
into a single matrix, and then tested the independence of these characteristics from the
difficulty of multiple choice test items. Our results suggest that these features were indeed
independent in over 80 % of the items studied. Importantly, if we had used a 3-level
taxonomy, as has been suggested within different disciplines as a method to ‘‘simplify’’
Bloom’s taxonomy (e.g., veterinary medical education, van Hoeij et al. 2004; economics,
Buckles and Siegfried 2006), over 95 % of items would have fallen into the first level
(Reiteration, Summarization, Illustration). It is well documented in higher education
contexts that creating valid test items with high(-er) cognitive complexity levels is
extremely challenging; if there were fewer categories, it might make the complexity matrix
123
simpler, but it does nothing to facilitate testing at those levels. As noted by Buckles and
Siegfried (2006), ‘‘(w)e do not know how to test synthesis and evaluation using multiple-
choice questions and suspect that it cannot be done or, at a minimum, requires efforts
beyond the abilities of most academic question writers.’’ (p. 50).
With more, concretely defined, levels and targeted training, item complexity might be
shifted by reasonable, targeted amounts by any item writers with appropriate training. Our
approach is to focus faculty training on cognitive complexity, and then to get non-content
experts to review items from other departments to improve the level of cognitive com-
plexity. This manuscript represents our proof of concept that attention to cognitive com-
plexity, and not content expertise, might finally increase our ability to test at the higher
levels in Bloom’s taxonomy.
‘‘Assessment is a critical component of instruction; properly used, it can aid in
accomplishing key curricular goals.’’ (Case and Swanson (2002, p. 9). In the context of
medical education, Gruppen and Frohna (2002) recommend ‘‘…viewing clinical reasoning
from various methodological perspectives while holding constant the content and task
demands of the problem’’ (emphasis added) in order to understand how students learn, and
demonstrate that they have learned, clinical reasoning (p. 225). They also note that
ordering or quantifying levels of expertise in clinical reasoning would be a ‘‘boon’’ to the
study (and by our inference, the teaching) of clinical reasoning (p. 226). Distinguishing
cognitive complexity from difficulty in test items can promote the articulation and inte-
gration of higher order thinking across curricula in the health sciences (e.g., Case and
Swanson 2002).
Tardieua et al. (1992) studied the differences between novices and experts in the
complexity of mental models derived from written material in the domain of memory.
Experts in domain-specific knowledge (memory in this 1992 experiment) were argued to
have constructed a fuller (higher-level) mental model of what they were reading\memory
text[ than the domain novices did for the same material. Shelton (1999) studied decision
making by less- and more- expert auditors in the context of auditing. Irrelevant information
within the task vignettes tended to affect less expert auditors, diluting their judgments
while experts were able to overcome any influence of irrelevant information and make
what might be considered to be unimpeded auditing judgments. These findings suggest that
decision making by experts for their area of expertise may be more automatically (e.g.,
Anderson 2005) executed, thereby impeding their abilities to focus on features such as
cognitive complexity in the case of item writing or reviewing. We have found that experts
in assessment and cognitive psychology can distinguish levels of cognitive complexity in
test items from a domain in which they are not expert.
This study supports the concept that expertise in cognitive psychology and assessment
facilitated cognitive complexity ratings of physiology test items; presumably this was
achieved by constructing a cognitive complexity model for each test item; experts in
physiology/non-experts in cognitive psychology might actually be more likely to construct
the *physiology model* for the same test item; the ‘‘irrelevant’’ information from this
‘physiology model’ for a test item may be interfering with experts’ ability to target, or
review, the cognitive complexity of exam items within their domains, as suggested by van
de Watering and van der Rijt (2006).
A significant limitation of this study is that we did not set out to compare the ratings
abilities of cognitive psychology experts and subject matter experts. Our results are con-
sistent with the findings of van de Watering and van der Rijt (2006), but a study that
directly compared the two types of experts rating the same items would be more definitive.
This study was instead designed to derive analytical evidence of the separability of
123
cognitive complexity and difficulty in multiple choice test items. The statistical analyses
tend to support this, which is also consistent with previous reports. Even without the direct
comparison of subject matter experts’ and non-experts’ ratings of the same items, these
results do support the idea that instructors can increase the cognitive complexity required
by their existing test items (or new ones) without changing the items’ ‘‘psychometric
properties’’ (i.e., difficulties).
We synthesized the cognitive complexity matrix from multiple existing sources, which
we intended to strengthen the face validity of the resulting tool. However, another
important limitation of this work is that we did not validate this Matrix. In fact, we labeled
it a ‘‘Matrix’’ rather than a taxonomy since it only places existing taxonomies within a
single, common framework (rather than constituting a novel classification). We hoped to
capitalize on the strengths of the taxonomies that we utilized, and also to begin to
understand why, when there are so many excellent (validated) resources for improving
item writing, MCQ exams continue to focus on the least cognitively complex levels. Our
results only suggest, rather than demonstrate, that construct-irrelevant interference, in the
form of subject matter expertise, may play a part in the persistence of MCQ targeting low
complexity levels.
We also sought to support the concept that subject matter non-experts could review and/
or revise test items to increase their cognitive complexity. Since the subject matter non-
experts in the study provided the cognitive complexity ratings to begin with, this evidence
is somewhat circular. The only way to undercut that circularity is to demonstrate that
complexity ratings are reliable, within rater/across items and across raters for the same
items. This study only generated pilot evidence across raters for the same items, and no
reliability estimates. If it is feasible to cross train faculty in the relevant cognitive psy-
chological theory and assessment constructs, such training programs will require assess-
ment of both types of reliability (within rater/across items and across raters for the same
items) and efforts to detect and manage drift (in ratings) if it is present.
Instructors might be very interested to participate in a training program where they learn
to increase the cognitive complexity of test items—and by extension, the learning going on
in their courses—even if this means they will be reviewing test items outside their areas of
expertise. This sort of program could lead to a pool of raters that could perform com-
plexity-specific ratings for test items that come from disciplines or departments within their
school or institution but outside of their specific domain of expertise. We are in the process
of designing a second study to test this theory and this cross-training model. We have
developed a faculty development workshop introducing test specification tables (e.g.,
Crocker and Algina 1986) that focus on either content or difficulty (depending on the
faculty interest), but also add the second dimension of higher order thinking (cognitive
complexity). So far (March 2012), the workshop has been offered to the first year pre-
clinical course directors within our School of Medicine to encourage and support the
generation of new test items, or the modification of existing test items, that vary on
continua of both cognitive complexity and difficulty. Incorporating cognitive complexity in
a test specifications table template that any instructor could use has the potential to
improve the formative and summative information gleaned from the assessment. Cross-
training faculty to support each other in the development (or review/revision) of valid
exam items that tap higher order thinking may facilitate the representation of the full
continuum of complexities originally captured with Bloom’s taxonomy.
Acknowledgments This work was supported by a Curricular Innovation, Research, and Creativity in
Learning Environment (CIRCLE) Grant (intramural (GUMC)) to RET.
123
Conflict of interest No declarations of interest to report for any co-author.
Appendix
The Least Squares Distance Model (LSDM; Dimitrov 2007) uses existing IRT item
parameter estimates obtained from a separate procedure or program and an appropriate Q-
matrix to model attribute probabilities for fixed levels of ability. These probability esti-
mates are calculated as intact units for each fixed level of theta according to the following
equations:
Y
K
Pij ¼ ½PðAk ¼ 1jhi Þqjk ; then
k¼1
X
K
ln Pij ¼ qjk ln PðAk ¼ 1jhi Þ;
k¼1
similar to the Rasch model, where Pij is the probability of a correct response on item j by
person i given ability hi; PðAk ¼ 1jhi Þ is the probability of correct performance on attribute
Ak for the person with ability level hi; and qjk is the Q-matrix element (0, 1) associated with
item j and attribute Ak.
With n binary items, this generates a system of n linear equations with K unknowns,
ln PðAk ¼ 1jhi Þ, for each fixed level of ability. This system of equations is represented in
matrix algebra form as L = QX where L is the vector of elements lnPij (known); Q is the
Q-matrix (known); and X is an unknown vector of elements Xk = ln PðAk ¼ 1jhi Þ.
By minimizing the Euclidean norm of the vector ||QX 2 L||, the results of the unknown
vector X and the least squares distance (LSD) are generated and the probability of a correct
response for a student with ability hi on a item associated with attribute Ak is
PðAk ¼ 1jhi Þ = exp(Xk). The LSDM-calculated item probabilities are calculated as the
product of the attribute probabilities across ability levels; these probabilities approximate
the probabilities calculated under the Rasch model.
References
American Psychological Association, National Council on Measurement in Education, American Educa-

tional Research Association. (1999). Standards for educational and psychological testing, 2E.
Washington, DC: American Educational Research Association.
Anderson, L. W., Krathwohl, D. R., Airasian, P. W., Cruikshank, K. A., Mayer, R. E., Pintrich, P. R., et al.
(Eds.). (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of
educational objectives. New York: Longman.
Anderson, J. R. (2005). Cognitive psychology and its implications, 6E. New York, NY: Worth Publishers.
Bloom, B. J., Englehart, M. D., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of
educational objectives: The classification of educational goals, by a committee of college and uni-
versity examiners. Handbook I: Cognitive domain. New York: David McKay.
Bond, T. G., & Fox, C. M. (2007). Applying the Rasch Model: Fundamental measurement in the human
sciences, 2E. Mahwah, NJ: Lawrence Erlbaum Associates.
Bruff, D. (2009). Teaching with classroom response systems: Creating active learning environments. San
Francisco, CA: Jossey Bass.
Buckles, S., & Siegfried, J. J. (2006). Using multiple-choice questions to evaluate in-depth learning of
economics. The Journal of Economic Education, 37(1), 48–57.
123
Case, S. M., & Swanson, D. B. (2002). Constructing written test questions for the basic and clinical
sciences, 3E-Revised. Philadelphia: National Board of Medical Examiners.
Cizek, G. J., & Bunch, M. B. (2008). Standard setting: A guide to establishing and evaluating performance
standards on tests. Newbury Park, CA: Sage Publications.
Crocker, L., & Algina, J. (1986). Introduction to classical & modern test theory. Belmont, CA: Wadsworth
Group.
Custers, E. J. F. M., & Boshuizen, H. P. A. (2002). The psychology of learning. In G. R. Norman, C. P. M.
van der Vleuten, & D. L. Newble (Eds.), International handbook of research in medical education
(Vol. 1, pp. 163–203). Dordrecht: Kluwer.
Dimitrov, D. (2007). Least squares distance method of cognitive validation and analysis for binary items
using their item response theory parameters. Applied Psychological Measurement, 31, 367–387.
Downing, S. M. (2002). Assessment of knowledge with written test forms. In G. R. Norman, C. P. M. van
der Vleuten, & D. L. Newble (Eds.), International handbook of research in medical education (Vol. 2,
pp. 647–672). Dordrecht: Kluwer.
Ericcson, K. A. (2004). Deliberate practice and the acquisition and maintenance of expert performance in
medicine and related domains. Academic Medicine, 9(10 suppl), S70–S81.
Gierl, M. J., Leighton, J. P., & Hunka, S. M. (2000). Exploring the logic of Tatsuoka’s rule-space model for
test development and analysis. An NCME instructional module. Educational Measurement: Issues and
Practice, 19(3), 34–44.
Gruppen, L. D., & Frohna, A. Z. (2002). Clinical Reasoning. In G. R. Norman, C. P. M. van der Vleuten, &
D. L. Newble (Eds.), International handbook of research in medical education (Vol. 1, pp. 205–230).
Dordrecht: Kluwer.
Gushta, M. M., Yumoto, F., & Williams, A. (2009). Separating item difficulty and cognitive complexity in
educational achievement testing. Paper presented at the annual meeting of the American Educational
Research Association, San Diego, CA.
Haladyna, T. M. (1997). Writing test items to evaluate higher order thinking. Needham Heights, MA: Allyn
& Bacon.
Linacre, J. M. (2007). A User’s guide to WINSTEPS Rasch-model computer program. Chicago, IL:
Author. Downloaded 10 October 2007 from http://www.winsteps.com/winsteps.htm.
Mislevy, R. J., & Huang, C.-W. (2007). Measurement models as narrative structures. In M. von Davier & C.
H. Carstensen (Eds.), Multivariate and mixture distribution Rasch models: Extensions & applications
(pp. 16–35). New York: Springer.
Moseley, D., Baumfield, V., Elliott, J., Gregson, M., Higgins, S., Miller, J., et al. (2005). Frameworks for
thinking. Cambridge, UK: Cambridge University Press.
Rupp, A. A., & Mislevy, R. J. (2007). Cognitive foundations of structured item response models. In J.
P. Leighton & M. J. Gierl (Eds.), Cognitive diagnostic assessment: Theories and applications (pp.
205–241). Cambridge: Cambridge University Press.
Shelton, S. W. (1999). The effect of experience on the use of irrelevant evidence in auditor judgment. The
Accounting Review, 74(2), 217–224.
Smith, R. M., Schumacker, R. E., & Bush, J. J. (1998). Using item mean squares to evaluate fit to the Rasch
model. Journal of Outcome Measurement, 2, 66–78.
Tardieua, H., Ehrlicha, M.-F., & Gyselincka, V. (1992). Levels of representation and domain-specific
knowledge in comprehension of scientific texts. Language and Cognitive Processes, 7(3–4), 335–351.
doi:10.1080/01690969208409390.
Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response
theory. Journal of Educational Measurement, 20(4), 345–354.
van de Watering, G., & van der Rijt, J. (2006). Teachers’ and students’ perceptions of assessments: A review
and a study into the ability and accuracy of estimating the difficulty levels of assessment items.
Educational Research Review, 1(2), 133–147.
van Hoeij, M. J. W., Haarhuis, J. C. M., Wierstra, R. F. A., & van Beukelen, P. (2004). Developing a
classification tool based on Bloom’s Taxonomy to assess the cognitive level of short essay questions.
Journal of Veterinary Medical Education, 31(3), 261–267.
Williams, R. D., & Haladyna, T. M. (1982). Logical operations for generating intended questions (LOGIQ):
A typology for higher level test items. In G. H. Roid & T. M. Haladyna (Eds.), A technology for test-
item writing (pp. 161–186). New York: Academic Press.
Zheng, A. Y., Lawhorn, J. K., Lumley, T., & Freeman, S. (2008). Application of Bloom’s taxonomy
debunks the ‘‘MCAT Myth’’. Science, 319, 414–455. doi:10.1126/science.1147852.
123

Multiple Choice Questions Can Be Designe

Uploaded by

Copyright:

Available Formats

You might also like

Multiple Choice Questions Can Be Designe

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Choice Questions Can Be Designe

Uploaded by

Copyright:

Available Formats

Multiple choice questions can be designed

or revised to challenge learners’ critical

Rochelle E. Tractenberg, Matthew

Advances in Health Sciences

Adv in Health Sci Educ

Multiple choice questions can be designed or revised

Rochelle E. Tractenberg • Matthew M. Gushta • Susan E. Mulroney •

Received: 17 July 2012 / Accepted: 21 November 2012

Keywords Cognitive complexity Higher order thinking Multiple-choice test items

‘‘Higher order thinking’’ reflects cognitive complexity in performance, often referring to

Instruments and participants

n = 159), a specialized one-year pre-medical school preparatory program, Georgetown

Cognitive complexity and the Q matrix

Table 1 Cognitive complexity matrix (CCM)

Expert consensus: level 1.

2. Select the FALSE statement.

e. During expiration, peak flow is effort-dependent because it takes time to form a

Expert consensus: level 2.

4. With increasing frequency of contraction, cardiac muscle:

Expert consensus: level 1: Locate or retrieve relevant knowledge from memory.

Table 2 Cognitive complexity levels with examples

Item difficulty estimation

Fig. 1 Item characteristic curve example

Least squares distance method

Table 3 Results of expert ratings consensus based on cognitive complexity matrix

Exam 1 (50 items) 48 24 26 2 1.82 (0.90)

Item difficulty estimation

Table 4 Distribution of item difficulty, infit, and outfit, by exam

Table 5 Distribution of student ability (estimated with Rasch models), by exam

Mean 1.869 1.651 2.331

Fig. 2 Attribute probability curves for each exam

Table 6 Distribution of mean

Conflict of interest No declarations of interest to report for any co-author.

American Psychological Association, National Council on Measurement in Education, American Educa-

You might also like