Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

[Downloaded free from http://www.thejhs.org on Friday, January 19, 2018, IP: 130.115.94.

137]

Original Article

Flawed multiple‑choice questions put on the scale:


What is their impact on students’ achievement in a
final undergraduate surgical examination?
Ahmad Abdul Azeem Abdullah Omer, Mohammed Elnibras Abdulrahim, Ibrahim Abdullah Albalawi
Department of Surgery, Faculty of Medicine, University of Tabuk, Tabuk, Saudi Arabia

ABSTRACT
Background: Violation of item‑writing guidelines is still frequently encountered in assessments in medical colleges. Flawed
multiple‑choice (MC) items affect students’ performance and distort examinations’ results.
Aims: The aim was to assess the frequency and impact of flawed MC items on students’ achievement in our setting.
Settings and Design: This is a quantitative descriptive study conducted at the Faculty of Medicine in the University of Tabuk,
Saudi Arabia.
Methods: We evaluated a 100 single‑correct answer MC questions summative surgical examination administered to 44 6th year
final medical students in November 2014. MC items, which contain one or more violation of item‑writing guidelines, were
classified as flawed, those with no violation, were classified as standard. The passing rates and median scores of high‑ and
low‑achieving students were calculated on both standard and flawed test scales. Item performance parameters  (difficulty
index, discrimination power and internal consistency reliability (Kuder–Richardson formula 20) were calculated for standard
and flawed items. Descriptive and comparative statistics with the relevant tests of significance were performed using the  SPSS
(IBM SPSS Inc. Chicago, Illinois) computer software version 16.
Results: Thirty‑nine flawed items were identified (39%) which contain 49 violations of the item‑writing guidelines. The passing
rate was 93.2% and 91.8% on the total and standard scales, respectively. Flawed items benefited low‑achieving students and
disadvantaged the high‑achieving students. Overall, flawed items were less difficult, less discriminating and less reliable than
standard items.
Conclusions: The frequency of flawed items in our examination was high and reflects the need for more training and faculty
development programmes.

Keywords: Flawed multiple‑choice items, high‑achieving students, item analysis, low‑achieving students, standard items

INTRODUCTION education.[1,2] The ability of this type of questions to


sample widely over a subject, in addition, to their
Multiple‑choice (MC) questions are extensively objectivity and easy marking have contributed to their
used in the assessment of knowledge in medical popularity in the field of assessment.[3‑5] When they
are well constructed, they can test higher cognitive
functions and discriminate well between examinees
Address for correspondence:
with reasonable validity and reliability.[3,6] However,
Dr. Ahmad Abdul Azeem Abdullah Omer, Assistant Professor of
General Surgery, Department of Surgery, Faculty of Medicine,
poorly constructed MC items may have a negative
University of Tabuk, P.O. Box: 3718, Tabuk 71481, Saudi Arabia. impact on students’ performance in achievement tests.
E‑mail: a.omer@ut.edu.sa
This is an open access article distributed under the terms of the Creative
Access this article online Commons Attribution-NonCommercial-ShareAlike 3.0 License, which allows
others to remix, tweak, and build upon the work non-commercially, as long as the
Quick Response Code:
Website: author is credited and the new creations are licensed under the identical terms.

www.thejhs.org For reprints contact: reprints@medknow.com

How to cite this article: Omer AA, Abdulrahim ME, Albalawi IA. Flawed
DOI: multiple-choice questions put on the scale: What is their impact on students’
10.4103/2468-6360.191908 achievement in a final undergraduate surgical examination?. J Health Spec
2016;4:270-5.

270 © 2016 Journal of Health Specialties | Published by Wolters Kluwer - Medknow


[Downloaded free from http://www.thejhs.org on Friday, January 19, 2018, IP: 130.115.94.137]

Omer, et al.: Effect of flawed multiple‑choice questions

[7]
Some reports indicated that poorly crafted MC items and the whole community, there is a genuine need
are still being used commonly in medical colleges.[3,8] to construct good quality MC items to improve the
Despite the fact that MC item‑writing guidelines are reliability and validity of our examination results and
well developed and shared in the medical education consequently the quality of our graduates.[2],12,14]
literature,[9‑11] the frequency of occurrence of flawed MC
items is still substantial.[11,12] The effect of item‑writing What we already know:
flaws on students’ performance in achievement test • When well-constructed, MC questions can be a valid
is bimodal; making questions either easier or more tool for assessment in high-stakes examinations.
difficult to answer.[12,13] Some flawed items clue test‑wise • Frequency of flawed MC questions is still high in
examinees to the correct answer and thereby advantage assessments in medical colleges.
those students over other performance categories of • Flawed MC items affect students’ performance and
examinees. Flawed items also introduce unnecessary distort examination results.
difficulty to the question and consequently affect
the students’ performance on the construct being Study of the frequency of occurrence of flawed items
tested (construct‑irrelevant variance).[2,7,12] Haladyna and their nature and effect was not done before in our
and Downing have stated that ‘test‑wiseness could be
setting, which is a newly established medical college
taught and that some students could increase their
found in the year 2006. We believe that such a study
scores after such training’.[10] They also added that
is essential to shed light on the quality of our MC
MC item faults can be detected by examinees with or
examinations regarding the frequency, nature and effect
without training. Test‑wiseness has been referred to as
of flawed items in our achievement tests. Solutions
the ability of the student to recognise the correct answer
and recommendations would then be appropriately
without knowing the question material.[14] Tarrant
proposed based on the findings to improve the quality
and Ware have reviewed 10 MC examinations used in
of our examinations and the inferences we made out
nursing and found that the percentage of flawed items
of their results.
ranged between 27% and 75%. They also highlighted
in their series that borderline students benefited from
flawed items because a greater proportion of them In this study, we evaluated a summative surgical
passed the tests while they would have otherwise failed examination administered to the 6th year final medical
if flawed items were removed from the tests. They students aiming to:
concluded that flawed items have impacted negatively • Determine the frequency and type of flawed MC items
on the high‑achieving students.[14] Based on these • Assess the effect of flawed MC items on the high‑ and
findings, Tarrant and Ware showed low discrimination low‑achieving students
power of the tests they assessed since the marks of the • Compare item performance parameters (difficulty
borderline students were artificially inflated while those index, discrimination power and reliability) of
of high‑achieving students were lowered. In his study, flawed and standard MC items to assess the quality
Downing has reviewed a year‑one basic science MC test of their performance.
and found that one‑third of the items in the test were
flawed and that these items were more difficult than METHODS
the standard items measuring the same content. He also
found that flawed items failed one‑quarter of students This was a quantitative descriptive study conducted
more than standard items did. [7] In another study, to evaluate the frequency and the impact of MC item
Downing evaluated four basic science examinations flaws on the performance of the students in the written
for the effect of violation of item‑writing guidelines and part of a final medical surgical examination. The
found that 36 - 65% of all questions were flawed. He examination was composed of 100 single‑best answer
also pointed out that flawed items were more difficult MC questions administered to the 6th year final medical
than standard items and they tend to fail more students. students (n = 44) in our college in November 2014.
He also found that the reliability of flawed items was Based on the opinion of subject experts, questions
higher than that of the standard items.[8] Almuhaidib were analysed and categorised into two groups:
examined 10 summative undergraduate MC tests and Flawed items, which contained one or more violation
pointed out that the average frequency of flawed items of the MC item‑writing guidelines published in the
was 17.64%. She also found that flawed items were literature (Haladyna, Downing and Rodriguez 2010,
easier and poorly discriminating than standard items Haladyna and Downing 1989), and standard items
and that they tend to benefit low‑achieving students which did not contain any violation of those guidelines.
and penalise their high‑achieving counterparts.[13] The college implements a criterion‑referenced fixed
Based on the patient safety concerns in medicine and pass/fail mark strategy in its examination, which is
the responsibility towards the different stakeholders set at 60%. The pass rate was calculated for the whole

Journal of Health Specialties / October 2016 / Vol 4 | Issue 4 271


[Downloaded free from http://www.thejhs.org on Friday, January 19, 2018, IP: 130.115.94.137]

Omer, et al.: Effect of flawed multiple‑choice questions

class on the total test scale (involving both flawed and Table 1: Nature and frequency of flaws encountered among all
standard items) and on the standard scale (involving questions (n = 49)
only the standard items) and comparison was made.
Flaw Frequency
The item performance parameters (difficulty index,
Unfocused stem 14
discrimination power and internal consistency Implausible distractor 6
reliability (Kuder‑Richardson formula 20 [KR20]) Longer correct option 5
were calculated for the flawed and standard items Pair of opposites 5
and compared to each other. The median scores of the Clanging effect 3
high‑ and low‑achieving students (11 students in each Logical clue 3
group) on the total, standard and flawed scales were Convergence 3
also calculated and comparison was made to highlight Negatively worded stem 2
any differences that may exist. Absolutes 2
Enemies 2
RESULTS Vague frequency terms 1
Options not in logical order 1
Thirty‑nine questions were assigned as flawed, found Unnecessarily wordy options 1
containing one or more violation of the conventional MC More than one correct answer 1
item‑writing guidelines representing 39% of the total
test items. A total of 49 flaws were identified distributed
over the 39 questions. The type and frequency of item Table 2: Median scores of high‑achieving and low‑achieving
students on different test scales (n = 100)
flaws identified are shown in Table 1.
Score High‑achieving (%) Low‑achieving (%)
The overall pass rate on the total test scale was 93.2% Total test scale 85 62.6
whereas it was found to be 91.8% on the standard scale. Standard item scale 86.8 60.7
Flawed item scale 82.7 63.6
The median score of the high‑achieving students on
the total test scale was 85% while on the standard and
Table 3: Comparison of averages of difficulty index and
flawed scales was 86.8% and 82.7%, respectively when discrimination power and reliability of flawed and standard items
we corrected the difference in the number of questions
Parameter Flawed Standard Significance
of flawed and standard categories to 100 for ease of items items (t‑test)*
comparison. On the other hand, the median score (n = 39) (n = 61)
of the low‑achieving students on the total test scale Difficulty index 0.74 0.75 0.772
was 62.6%, while on the standard and flawed scales Discrimination power 0.20 0.28 0.08
was 60.7% and 63.6%, respectively. These results are Reliability 0.55 0.80 ‑
summarised in Table 2. *P<0.05

were violated, indicates poor knowledge of the test


Flawed items were found less discriminating than developers of those guidelines and underlines the need
standard items (0.2 vs. 0.28) and were slightly more for more training efforts in this field. Page and Caldwell
difficult (0.74 vs. 0.75). However, those differences were have identified in their study 17 violations of the 32
found statistically not significant as shown in Table 3. item‑writing guidelines. They reasoned lack of training
Reliability KR20 (internal consistency) of the total test and that articles related to item‑writing guidelines
scale was 0.84, of the standard items was 0.80 and that are published in educational literature outside the
of the flawed items was 0.55. Obviously, the reliability mainstream of medical journals. They also concluded
of standard items is greater than that of flawed items. a substantial difference in MC item quality between
trained and non‑trained individuals.[11] MC item‑writing
DISCUSSION skills can be improved by training and regular practice
of reviewing MC for item flaws as shown by Fayyaz
The percentage of flawed items in this study was Khan et al., in 2013.[12] They have shown reduction in
relatively high (39%), however, it coincides with the frequency of item‑writing flaws in their study from
the results of Tarrant and Ware (27 - 75%) and 67% in 2009 to 21% in 2011 following conduction of
Downing (33%) and 36 - 65% in two different studies, training workshops for the staff. Regular review of MC
but was higher than that of Almuhaidib’s (17.64%). items for flaws before and after test administration and
This result was not surprising in view of the small provision of feedback to item writers were indicated as
number of workshops and training conducted for the useful strategies to help increase the staffs’ awareness on
staff on MC item‑writing guidelines. The finding that MC item‑writing guidelines and reduce the frequency
14 out of the 32 conventional item‑writing guidelines of flawed MC questions in examinations.[12] Baig et al.,

272 Journal of Health Specialties / October 2016 / Vol 4 | Issue 4


[Downloaded free from http://www.thejhs.org on Friday, January 19, 2018, IP: 130.115.94.137]

Omer, et al.: Effect of flawed multiple‑choice questions

proposed training and encouraging the staff to write MC difficulty of the question. Some flaws clue the test‑wise
item that test higher‑order cognitive level as a mean to students to the correct answer making the question
reduce MC item flaws.[15] easier while other flaws may confuse the students and
lend themselves more difficult to answer. The slightly
In this study, flawed items passed slightly more students increased difficulty of flawed items in comparison to
than would otherwise have passed if the flawed items standard items observed in this study may be explained
were removed from the test. This is in accordance by the construct‑irrelevant variance factor introduced
with the findings of Tarrant and Ware and Almuhaidib by the flawed items adding to their difficulty. Flawed
but was in contrary to what Downing showed in items were less discriminating than standard items
two separate studies.[7,8] The small magnitude of this in this study, which has also been supported by
difference (1.4%) may be explained by the small number Tarrant and Ware and Almuhaidib in their series.
of students taking the test. This finding indicates that Similarly, Pate and Caldwell showed that flawed MC
some low‑achieving students benefit from flawed items, items negatively affect students’ performance without
which is understood in view of our knowledge that improvement in the discrimination between high‑ and
some flawed items advantage test‑wise students by low‑achievers.[11] The low discrimination power of the
providing clues to the correct answer leading to inflation flawed items in comparison to the standard items is a
of their results. The finding that the performance of logical consequence in which flawed items affected
low‑achieving students was better in flawed items than low‑achieving students positively and high‑achieving
in standard ones, indicated by their median scores on students negatively. In that case, the scores of
those items’ scales, may further explain the above result. high‑achieving students would be reduced and those of
However, this finding was contradicted by the almost low‑achieving students be artificially inflated leading
equal overall difficulty index of flawed and standard to low discrimination power of those questions.
items. Again, this might be influenced by the small
number of students taking the test. The KR20 determines the internal consistency of
responses across a specific number of items. [14] It
On the other hand, it appears that high‑achieving also points to how much different parts of a test are
students were disadvantaged by the flawed items since homogenous and consistently measure one single
their median score was better on the standard scale construct (unidimensional).[16‑18] Downing in 2005,
than on the total and flawed test scales. This finding reasoned the higher reliability of flawed items in his
agrees with what Tarrant and Ware already proved study to the tendency of internal consistency reliability
in their study. They highlighted that high‑achieving to be affected by random rather than systematic errors
students do not tend to use test‑wiseness strategies of measurement. However, it has been concluded
to answer questions and therefore, their performance that systematic errors of measurement also influence
is affected negatively with some flawed items. This the internal consistency reliability as well as random
finding may also elaborate on the role of unnecessary ones.[19] These systematic errors of measurement include
difficulty sometimes added to flawed MC questions and test‑specific factors, which include poor construction of
is referred to as the ‘construct‑irrelevant variance’. This test items. The lower reliability of flawed MC items in
variable represents an unnecessary difficulty added to comparison to the standard items, in this study, likely
the question which would then jeopardise the construct indicates low homogeneity with standard items and
being tested and distorts the students’ performance in the tendency of flawed items to measure a different
such a way that the inferences we could make from the construct than what standard items aim to assess. This
test’s results were less valid. is further explained by the ‘construct‑irrelevant’ factor,
which contaminates the flawed items, making them
There were no significant differences found between either more or less difficult to answer due to factors
the difficulty index and the discrimination power that are not related to the construct being tested. In this
of standard and flawed items. The figures state setting, students’ responses and performance, overall,
that flawed items were only 1% more difficult than are influenced by factors which are least related to
standard items. While this agrees with the finding of their ability in the subject under assessment. The lower
Tarrant and Ware who found no substantial differences scores of high‑performing students in the flawed items
in difficulty index of standard and flawed items, in comparison to their performance in standard items,
Downing showed consistently increased difficulty of and the opposite likewise to low‑performing students,
flawed items in comparison to standard items in two already support that. Since the internal consistency
separate studies.[7,8] Almuhaidib, on the other hand, reliability considers student’s responses and is related
pointed out that flawed items were easier than standard to their scores in a particular test, the validity of the test
items. These mixed results were not surprising if we results are also expected to be jeopardised when the
considered the varying effects of flawed items on the reliability is low,[16,17] which is particularly significant in

Journal of Health Specialties / October 2016 / Vol 4 | Issue 4 273


[Downloaded free from http://www.thejhs.org on Friday, January 19, 2018, IP: 130.115.94.137]

Omer, et al.: Effect of flawed multiple‑choice questions

high‑stakes examinations like the one in this study. The Further similar studies are essentially required in our
intricate relationship between reliability and validity is setting across different classes and examinations to
another important factor that necessitates paying more help further explore the true picture of the quality of
attention to the issue of poorly constructed MC items our assessments.
and their effects.
Acknowledgement
Summary Box: We would like to thank Mr. Abdalla Elkhalifa who provided
• F lawed MC items advantage low-achieving generous help with the statistical analysis of the results.
students and negatively affect their high-achieving
counterparts. Financial support and sponsorship
• Flawed MC items are less discriminating, have Nil.
lower internal consistency, and are slightly less
difficult than standard items. Conflicts of interest
• A n increased effort is required to spread the
knowledge of item-writing guidelines among There are no conflicts of interest.
faculty members.
REFERENCES
Limitations 1. Kapur E, Kulenović A. Analysis of difficulty and discrimination
The results of this study cannot be generalised for more indices in one‑best answer multiple choice questions of an
than one reason. First, the number of students taking anatomy paper. Folia Medica 2010;45:14-20.
the test is small which may affect the accuracy and 2. McCoubrie P. Improving the fairness of multiple‑choice
reliability of the results. The small number of students is questions: A literature review. Med Teach 2004;26:709‑12.
dictated by the small number of students’ intake owing 3. Tarrant M, Ware J. A comparison of the psychometric properties
to the small setting nature of our college at present. of three‑ and four‑option multiple‑choice questions in nursing
assessments. Nurse Educ Today 2010;30:539‑43.
Second, only one examination has been included in
4. Epstein RM. Assessment in medical education. N Engl J Med
this study, which is administered to a single class of
2007;356:387‑96.
students among all other faculty students. This may
5. Hettiaratchi ES. A comparison of student performance in two
have introduced selection bias affecting and limiting
parallel physiology tests in multiple choice and short answer
the credibility of the results. Nevertheless, this study is forms. Med Educ 1978;12:290‑6.
the first one in our setting, which is a newly established 6. Palmer EJ, Devitt PG. Assessment of higher order cognitive
medical college, to evaluate the frequency of flawed skills in undergraduate education: Modified essay or
items in our examinations and their impact on the multiple choice questions? Research paper. BMC Med Educ
students’ achievement, shedding light on the quality 2007;7:49.
of our summative assessments and opening the door 7. Downing  SM. Construct‑irrelevant variance and flawed test
for further work in the field. questions: Do multiple‑choice item‑writing principles make any
difference? Acad Med 2002;77 10 Suppl: S103‑4.
CONCLUSION 8. Downing SM. The effects of violating standard item
writing principles on tests and students: The consequences
The frequency of flawed items was relatively high of using flawed test items on achievement examinations
in this study. While this result pertains actually in medical education. Adv Health Sci Educ Theory Pract
to the examination that has been evaluated, there 2005;10:133‑43.
is no logical reason to judge the contrary in other 9. Haladyna T, Downing S, Rodriguez M. A review of
examinations in our setting. This necessitates a great multiple‑choice item‑writing guidelines for classroom
assessment. Appl Meas Educ 2010;15:309‑34.
deal of attention on the faculty’s administration side
10. Haladyna T, Downing S. Validity of a taxonomy of multiple‑choice
to encourage efforts in continuous training and faculty
item‑writing rules. Appl Meas Educ 1989;2:51‑72.
development programmes to spread the knowledge of
11. Pate A, Caldwell D. Effects of multiple‑choice item‑writing
MC item‑writing principles among the staff members. It
guideline utilization on item and students performance. Curr
was shown that flawed MC items negatively affected the Pharm Teach Learn 2013;6:130‑4.
performance of high‑achieving students and at the same 12. Fayyaz Khan H, Farooq Danish K, Saeed Awan A, Anwar M.
time, advantaged low‑achieving students in recognition Identification of technical item flaws leads to improvement of
of the correct answer, thereby artificially inflating their the quality of single best multiple choice questions. Pak J Med
results and allowing more of them to pass the test than Sci 2013;29:715‑8.
would otherwise have. Flawed MC questions were 13. Almuhaidib N. Types of item‑writing flaws in multiple choice
found slightly less difficult, less discriminating and less question pattern – A comparative study. Umm Al Qura Univ J
reliable than standard MC items although these results Educ Psychol Sci 2010;2:10‑45.
were not statistically significant. 14. Tarrant M, Ware J. Impact of item‑writing flaws in multiple‑choice

274 Journal of Health Specialties / October 2016 / Vol 4 | Issue 4


[Downloaded free from http://www.thejhs.org on Friday, January 19, 2018, IP: 130.115.94.137]

Omer, et al.: Effect of flawed multiple‑choice questions

questions on student achievement in high‑stakes nursing coefficient. Psychometrika 1940;5:305‑10.


assessments. Med Educ 2008;42:198‑206. 18. Berk R. Ask Mister Assessment Person: How Do You Estimate the
15. Baig M, Ali SK, Ali S, Huda N. Evaluation of multiple choice Reliability of Teacher Licensure/Certification Tests 2000. p. 1‑16.
and short essay question items in basic medical sciences. Pak J Available from: http://www.images.pearsonassessments.com/images/
Med Sci 2014;30:3‑6. NES./2000_11Berk_440_1.pdf. [Last accessed on 2016 Sep 18].
16. Schumacker R, Smith E. Reliability a Rasch perspective. Educ 19. Leech N, Onwuegbuzie A, O’Conner R. Assessing internal
Psychol Meas 2007;67:394‑409. consistency in counseling research. Couns Outcome Res Eval
17. Dressel P. Some remarks on the Kuder‑Richardson reliability 2011;2:115‑25.

Author Help: Reference checking facility


The manuscript system (www.journalonweb.com) allows the authors to check and verify the accuracy and style of references. The tool checks
the references with PubMed as per a predefined style. Authors are encouraged to use this facility, before submitting articles to the journal.
• The style as well as bibliographic elements should be 100% accurate, to help get the references verified from the system. Even a
single spelling error or addition of issue number/month of publication will lead to an error when verifying the reference.
• Example of a correct style
Sheahan P, O’leary G, Lee G, Fitzgibbon J. Cystic cervical metastases: Incidence and diagnosis using fine needle aspiration biopsy.
Otolaryngol Head Neck Surg 2002;127:294-8.
• Only the references from journals indexed in PubMed will be checked.
• Enter each reference in new line, without a serial number.
• Add up to a maximum of 15 references at a time.
• If the reference is correct for its bibliographic elements and punctuations, it will be shown as CORRECT and a link to the correct
article in PubMed will be given.
• If any of the bibliographic elements are missing, incorrect or extra (such as issue number), it will be shown as INCORRECT and link to
possible articles in PubMed will be given.

Journal of Health Specialties / October 2016 / Vol 4 | Issue 4 275

You might also like