Professional Documents
Culture Documents
Manual On Test Item Construction Techniques
Manual On Test Item Construction Techniques
For
On
DRAFT
June 2011
Facilitators:
Rana Riaz Saeed, Senior Manager-NTS
Murtaza Noor, Project Manager-HEC
TABLE OF CONTENTS
1. BACKGROUND ................................................................................................................ 4
1.2. Objectives ......................................................................................................................... 4
1.3. Training Module ................................................................................................................ 4
1.4. Aims .................................................................................................................................. 5
1.5. Objectives of the Module .................................................................................................. 5
1.6. Brief Introduction of Module Development Experts.......................................................... 5
Day-1:
3. INTRODUCTION TO TESTING, ASSESSMENT AND EVALUATION .......................... 7
SESSION-1:
Introduction of Testing, Assessment and Evaluation in Semester System ..................... 7
SESSION 2: Types of Tests and Assessment
Types of Educational Tests .............................................................................................. 7
Group Work: Types of Educational Assessments ............................................................ 8
SESSION 3:
4. PLANNING THE TEST..................................................................................................... 9
Instructional Objectives and Course Outcomes ............................................................... 9
Planning the Test .............................................................................................................. 9
The Contents..................................................................................................................... 9
Analysis of Curriculum and Textbooks ............................................................................. 9
Judgments of Experts ....................................................................................................... 9
Objectives of the Test ..................................................................................................... 10
Guidelines for Writing Objectives of the Test ................................................................. 10
Bloom’s Taxonomy of Educational Objective ................................................................. 10
Illustrative Examples ....................................................................................................... 11
Preparing the Test Blue Print ......................................................................................... 12
Preparing an Outline of Test Contents ........................................................................... 12
Techniques of Determining Test Contents ..................................................................... 12
Table of Specifications.................................................................................................... 15
Test Length ..................................................................................................................... 15
General Principles of Writing Questions for an Achievement test ................................. 16
Day-2
Session-1: Presentation of Home Assignment
Session- 2&3
5. TYPES AND TECHNIQUES OF TEST ITEM WRITING ............................................... 18
What is a Test? ............................................................................................................... 18
Achievement Test ........................................................................................................... 18
Preparing the test according to Plan .............................................................................. 19
Items commonly used for Tests of Achievement ........................................................... 19
Constructing Objective Test Items ................................................................................. 19
Alternative Response Items ............................................................................................ 20
Uses of True-False Items ............................................................................................... 20
Suggestions for Constructing True-False Items ............................................................. 21
Short-Answer / Completion Items ................................................................................... 21
Constructing Short-Answer Items ................................................................................... 22
Multiple Choice Questions .............................................................................................. 22
Characteristics of Multiple Choice Questions................................................................. 23
Desired Characteristics of items ..................................................................................... 23
Advantages of MCQ........................................................................................................ 23
Limitations of MCQ ......................................................................................................... 23
Rules for Constructing Multiple - Choice Items .............................................................. 23
A variety of multiple choice items ................................................................................... 24
Context Dependent Items ............................................................................................... 25
Day-3
6. Step 2: PREPARING THE TEST ACCORDING TO PLAN .......................................... 26
Step 3: Test Administration and Use .............................................................................. 27
Awarding Grades .......................................................................................................... 28
Day 4
Session 1
8. EVALUATION OF ITEMS .............................................................................................. 43
Item Analysis ................................................................................................................... 43
Item Difficulty................................................................................................................... 43
Item Discrimination ......................................................................................................... 43
Interpreting Distracter Values ......................................................................................... 45
Effectiveness of Distracters ............................................................................................ 46
Session 2: Item Analysis and Subsequent Decision ................................................ 46
Decisions Subsequent to Item Analysis ......................................................................... 47
Session 3: Desirable Qualities of a Test .................................................................... 47
Reliability ......................................................................................................................... 47
How to Estimate Reliability? ........................................................................................... 48
Validity ............................................................................................................................. 49
Practicality ....................................................................................................................... 50
Objectivity........................................................................................................................ 50
APPENDICES
Appendix-A: Activity material for flawed items
Appendix-B: Table of Specifications / Test Blueprint
Appendix-C: Cognitive Domain, Instructional Objectives and item example
Appendix-D: Professional Ethics in Assessment (Chapter Reading)
Appendix-E: PowerPoint Presentations
Training Module on “Effective Test Item Construction Techniques”
The Higher Education Commission (HEC) is the primary regulator of higher education in
Pakistan. Its main purpose is to upgrade the Universities of Pakistan to be centres of
education, research and development. It facilitates development of higher educational
system in Pakistan. The HEC is also playing a leading role towards building a knowledge
based economy in Pakistan by giving out hundreds of doctoral scholarships for education
abroad every year.
In Pakistan there exists several educational assessment and evaluation systems including
provincial, federal or/and divisional Secondary and Higher Secondary Boards of examination
and universities' annual or semester exams. Generally, the marking and evaluation persons
either have little or no training/orientation of test item development and marking techniques.
There is no proper institute to impart training on an internationally recognized and
standardized testing / examination paper construction and marking techniques. Keeping in
view the outlined background, NTS is planning to conduct a series of workshops for teaching
faculties of Public Sector universities in collaboration with HEC.
Objectives
The overall objective of the workshops is to acquaint university faculty members with pre-
steps in student's evaluation in semester system and effective test item constructions
techniques.
Training Module
In order to initiate the training workshop, NTS and HEC developed a training module on
“Effective Test Item Construction Techniques” by engaging educational/ psychometric
experts from various faculties of HEC affiliated universities/ educational institutions in a two-
day consultative meeting held on 17 – 18 January 2011 at NTS Secretariat Islamabad.
This training module would be used for conducting trainings for target university teachers
both from natural and social sciences in all major cities of Pakistan.
Page 1
Training Module on “Effective Test Item Construction Techniques”
Aims
This module is designed to provide an orientation to the University teachers about the goals,
concepts, principles and concerns of testing, assessment and evaluation. It will also provide
teachers an opportunity to design and construct useful test items for assessment and
evaluation in their respective subjects for undergraduates/post graduate classes.
In addition, HECs’ selected social sciences departments under the project titled
“Development/Strengthening of Selected Departments of Social Sciences & Humanities”
would be part and parcel of this initiative as one of the key objective of the project includes
linkages with other professional organizations.
Page 2
Training Module on “Effective Test Item Construction Techniques”
She has been a researcher and writer in Psychology, Social and Gender Issues. Her
research publications are more than hundred and she is also author of six books. She is
member of many National and International Professional Organizations.
Her Publications include “Psychology of Women” a text book for MA students, Psychological
profile of rural women, Case studies of successful women and Voiceless Melodies etc.
Currently she is working as Executive Director, Gender and Psychological services an NGO.
Dr. Iffat S. Dar got a PhD from the University of Minnesota –USA and with an emphasis on
Psychological assessment and her master degree from Columbia University, Network-USA.
Her last regular assignment was as Chief Psychologist at Federal Public Service
Commission (FPSC), Islamabad. Since her retirement she has been working as consultant
with several national and international organizations like Ministry of Education Islamabad,
World Bank, Asian Development Bank, UNESCO and UNICEF. She has been a visiting
professor at various universities of Pakistan like QAU, FJWU and Foundation University.
Prof. Dr Rukhsana Kausar did her PhD from Surrey University, UK and Post-Doc from St.
Georges' Hospital London, US as a Commonwealth Fellow. She has been working at the
Department of Applied Psychology, University of the Punjab, for the last 23 years in various
academic and administrative capacities. Currently she is serving 2nd tenure of
Chairpersonship of Department of Applied Psychology. Dr. Rukhsana has worked for two
years at Fatima Jinnah Women University, Rawalpindi as chairperson, and Associate
Dean. She has been an active researcher and has supervised numerous M. Sc. M. Phil and
PhD research theses. Her research work has been presented extensively in international
conferences and she has about 45 research articles published in Journals of national and
international repute. She is member of various international professional bodies in the
discipline such as BPS, APA, IMSA, EHPS, ICP, IAAP and ECP.
Dr. Mah Nazir Riaz is Professor of Psychology and Dean of Social Sciences, Frontier
Women University Peshawar, Pakistan. Dr. Riaz received her Doctorate in Psychometrics
from University of Peshawar, NWFP, Pakistan (1979). Her academic career spans 40 years
of University teaching. She joined University of Peshawar on October 1 st 1968 as a lecturer
and retired as a Professor of Psychology in December 2003 from the same University. She
won University Gold Medal (1966) and President of Pakistan’s Award at completion of
Masters’ studies (1966) as well as a Professor of Psychology for her outstanding academic
achievements (2003). She won Star Women International Award (1996). She received
Distinguished Professor Award for meritorious services from Ministry of Education Govt.
of NWFP (2003). She also served as Professor of Psychology at National Institute of
Psychology, Center of Excellence, Quaid-i-Azam University Islamabad, for three years
(1999-2002). During this period, she won President of Pakistan’s Award “Izaz-e-Kamal”
(Gold Medal & cash Prize) for her life time achievements. She joined Frontier Women
University Peshawar as Dean of Social Sciences in 2006. She was nominated as Eminent
Educationist and Researcher by Higher Education Commission, Islamabad (2006).
She has published more than 60 research papers in national and international journals.
Besides, she is author of three Textbooks (1) Psychology, 2005; (2) Areas of Psychology
Page 3
Training Module on “Effective Test Item Construction Techniques”
Page 4
Training Module on “Effective Test Item Construction Techniques”
Page 5
Training Module on “Effective Test Item Construction Techniques”
Page 6
Training Module on “Effective Test Item Construction Techniques”
Day 1
Session 1: Rationale to Assessment and Evaluation in Semester System
The session will begin with brief introduction of the participant. The trainer will introduce
herself and then ask the participants to introduce themselves.
Trainer will talk about the importance of well constructed examinations as exams are the
goal posts which act as guide and motivators for students to learn. We all know from our
own experiences how students prepare for the examinations. They not only learn what
interests them the most or are presented in a better way but also what type of paper they
expect from the teacher. Due to this factor a well prepared examination paper is a guarantee
of effective teaching learning process.
Examinations have undergone radical change in the past fifty years due to improvements in
measurement techniques and better understanding of learning processes. From a lengthy
three hours essay type examination one can asses more comprehensively in thirty minutes
objective type paper which can assess not only the knowledge but also comprehension and
application of knowledge. Additionally a well prepared paper can evaluate the students
objectively and quickly and large number of students in a class is not a problem.
Why should we change our traditional system to newer system? Obviously the volume of
knowledge has increased so much that our youth need to learn larger number of subjects
for the same degree that educational institution started with semester system and now most
of the universities are going on to quarter system. The students have to cover at least one
year courses in one semester. This has become necessary to meet the world standard of
education. Additionally, there is a concept of continuous assessment which requires more
than one examination in one semester as well as class assignments and projects to asses
different types of learning e.g. ability to express in writing, ability to collect data and draw
conclusions from empirical information etc.
This session is based on Blooms Taxonomy and can one devise the techniques to assess
different types of abilities.
There are different types of tests which are used by the educationists and psychologists. In
fact psychologists have influenced the evaluation system more than one likes to give them
credit for .One can find large number of tests on all types of abilities and aptitudes and
abnormalities in the catalogues however we are going to talk about those tests which
concern with class room achievements only. These are also called scholastic achievement
tests. We are going to talk about the following three types of tests:
Page 7
Training Module on “Effective Test Item Construction Techniques”
Reference Material
Thorndike book on Educational Measurement is a bible of test
development,
Guilford’s book on statistic and more modern books on test
development should be read by all the participants and be provided
as references during the workshop.
The participants will discuss different types of assessments in an interactive session and will
be divided in four groups each assigned one type of assessment and present their report.
Page 8
Training Module on “Effective Test Item Construction Techniques”
The learning outcome for this session is to help the participants to:
Conceive of Instructional objectives and course outcomes
Design table of specifications in accordance with instructional objectives
Education is a process that helps the students change in many ways. Some aspects of the
change are intentional whereas others are quite unintentional. Keeping in view this
assumption, one of the important tasks of university teachers is to decide, as far as possible
how they want their students to change and to determine their role in this process of change.
Secondly, upon the completion of a particular unit/ course, the teachers need to determine
whether their students have changed in the desired manner. Furthermore, they have to
assess the unintentional outcomes as well.
1) The Content
Judgments of Experts
To devise a classroom tests, the advice and assistance of subject-matter experts, serving as
consultants, can prove to be of immense importance. The best way to seek consultants’
judgments is to submit to them a tentative outline prepared after studying the instructional
objectives as stated in representative curriculums and of the subject-matter content as
indicated by up-to-date and widely used textbooks.
Page 9
Training Module on “Effective Test Item Construction Techniques”
Objectives
The basic objective of an educational achievement test is to assess the desired changes
brought about by the teaching-learning process. Obviously, each subject demands a
different set of instructional objectives. For example, major objectives of the subjects like
sciences, social sciences, and mathematics are: knowledge, understanding, application and
skill. On the other hand the major objectives of a language course are: knowledge,
comprehension and expression. Knowledge objective is considered to be the lowest level of
learning whereas understanding, application of knowledge in sciences / behavior sciences is
considered higher levels of learning.
As the basic objectives of education are concerned with the modification of human behavior,
the teacher must determine measurable cognitive outcomes of instruction at the beginning of
the course. The evaluation process determines the extent to which the objectives have been
attained, both for the individual students and for the class as a whole. Such an evaluation
provides feedback that can suggest modification of either the objectives or the instruction or
both.
Some objectives are stated as broad, general, long-range goals, e.g., ability to exercise the
mental functions of reasoning, imagination, critical appreciation. These educational
objectives are too general to be measured by classroom tests and need to be operationally
defined by the class teacher.
The Cognitive Domain is the core of curriculum and test development. It is largely
concerned with descriptions of student behavior in terms of knowledge,
understanding, and abilities that can be demonstrated.
The Affective Domain includes objectives that emphasize interests, attitudes, and
values and the development of appreciations.
Page 10
Training Module on “Effective Test Item Construction Techniques”
Knowledge level
Task at the knowledge level involves the psychological processes of remembering.
Items in knowledge category involve the ability to recall important information like
knowledge of specific facts, definitions of important terms, familiarity with important
concepts etc. Thus knowledge level questions are formulated to assess previously
learned information:
Comprehension Level
The common cognitive processes required at comprehension level are translation,
interpretation, and extrapolation.
Application Level
Task at the application level requires use of previously learned information in new Formatted: Indent: Left: 0.5"
and concrete situations to solve a problem. It requires mastery of a concept well
enough to recognize when and how to use it correctly in an unfamiliar or novel
situation. The fact that most of what we learn is intended to be applied to problem
situations in our everyday life demonstrates well the importance of application
objectives in the curriculum.
The taxonomy levels of knowledge, comprehension, and application are considered Formatted: Indent: Left: 0.5"
more valuable for curriculum development and educational evaluation than analysis,
synthesis, and evaluation. Furthermore, the taxonomy does not suggest that all good
tests or evaluation techniques must include items from every level of taxonomy.
Illustrative Examples
Which of the following does not belong with the others? Formatted: Indent: Left: 0.25"
a. Aptitude tests Formatted: Indent: Left: 0.5"
b. Personality tests
c. Intelligence tests
d. Achievement tests
Formatted: Indent: Left: 0.25"
Which of the following measures involve nominal data?
a. the test score on an examination Formatted: Indent: Left: 0.5"
b. the number on a basketball player’s jersey
c. the speed of an automobile
d. the class rank of a college student
Formatted: Indent: Left: 0.25"
A psychologist monitors a group of nursery-school children, recording each Formatted: Indent: Left: 0.25"
instance of altruistic behavior as it occurs. The psychologist is using:
a. case studies Formatted: Indent: Left: 0.5"
b. The experimental method.
c. naturalistic observation
d. The survey method.
Formatted: Indent: Left: 0.25"
In a study of the effect of a new teaching technique on students’ achievement test
scores, an important extraneous variable would be the students:
a. Hair color. Formatted: Indent: Left: 0.5"
b. athletic skills
Page 11
Training Module on “Effective Test Item Construction Techniques”
c. IQ scores.
d. sociability
The results of Milgram’s (1963) study imply that: Formatted: Indent: Left: 0.25"
a. In the real world, most people will refuse to follow orders to inflict harm on a Formatted: Indent: Left: 0.5", Tab stops:
stranger. -3.69", List tab + Not at -3.94"
b. Many people will obey an authority figure even if innocent people get hurt.
c. Most people are willing to give obviously wrong answers when ordered to do so.
d. Most people stick to their own judgment, even when group members
unanimously disagree.
Formatted: Indent: Left: 0.25"
What is your chance of flipping 4 head or 4 tails in a row with a fair coin (i.e., one
that comes up heads 50% of the time)?
a. .0625 Formatted: Indent: Left: 0.75", Tab stops:
b. .125 -3.63", List tab + Not at -3.88"
c. .250
d. .375
Formatted: Indent: Left: 0.25"
Problem Solving
Problem solving refers to active efforts to discover an underlying process leading to
achievement of a goal, for example, the series completion, analogies, and transformation
problems.
Examples1
1. A teacher had 28 students in her class. All but 7 of them went on a museum trip and Formatted: Indent: Left: 0.25"
thus were away for the day. How many students remained in the class that day?
2. The water lilies on the surface of a small pond double in area every 24 hours. From
the time the first water lily appears until the pond is completely covered takes 60
days. On what day is half of the pond covered with lilies?
Formatted: Indent: Left: 0.25"
In the following illustrations, we have used only three levels: knowledge
(recall/recognition), comprehension (understanding) and application (or skills), and
labeled the columns accordingly.
The blue print is meant to ensure content validity of the test. It is the most important
characteristic of an achievement test devised to determine the GPA at the end of a unit/ term
or course of instruction. The test may be based on several lessons or chapters in a text
book, reflecting a balance between content areas and learning objectives. The test blue-print
must specify both the content and process objectives in proportion to their relative
importance and emphasis in the curriculum.
Depending on the purpose of the test and the instructional objectives, the test may vary in
length, difficulty, and format (objective, essay, short-answer, open-book, or take-home).
1
Source: Sternberg, R.J.(1986). Intelligence Applied: Understanding and increasing your intellectual skills.
New York: Harcourt Brace Jovanovich.
Page 12
Training Module on “Effective Test Item Construction Techniques”
Table 1
Test Blue Print/ General Layout of an Achievement Test
Purpose of the Test Minimum competency, mastery, diagnosis, selection.
Nature of the test Norm-referenced or criterion-referenced.
Target Population School children, college or university students, trainees of a
course, and employees of an organization.
Format of Items Objective type (multiple-choice, true-false, matching,
completion), short answer, essay type, computer-administered.
Test Length Approximate number of items comprising each category
(objective type/essay type, short answers).
Testing Time Maximum time limit.
Mode of Test Individual, group, computer.
Administration
Examiners’ Qualifications, training, experience.
Characteristics
Test content Verbal, pictorial, performance.
Sources of Test Content Textbooks, subject experts, curriculum
No. of items representing Depends on the relative importance of content areas
various content strata
Appropriate difficulty Difficulty level in relation to purpose of the test, such as,
minimum competency, mastery, selection, diagnosis, etc.
Taxonomy level of the Knowledge, comprehension, application, analysis, synthesis,
items evaluation.
Scoring procedure Hand scoring, machine scoring, computer assisted scoring, or
grading (as in essay examinations).
Interpretation Norms, percentiles, grade equivalents, etc.
Item analysis Qualitative, quantitative, both.
Reliability techniques Test-retest, parallel form, Kuder-Richardson, Examiner or inter-
rater.
Validity Content, criterion-related (predictive/concurrent), construct.
Source: Riaz, M.N. (2008). Test Construction: Development and Standardization of Psychological Tests
in Pakistan. Islamabad: HEC.
The term test content refers to representative sample of the course content, skills and
cognitive levels/instructional objectives to be measured by the test. The author of the test
has to prepare a test plan or table of specification that clearly shows the relative emphasis
on various topics and different types of behavior.
Our first illustration relates to a Pre Medical Science Achievement Test. We may prepare a
100 items test according to the following table of specifications showing instructional
objectives and content areas.
Page 13
Training Module on “Effective Test Item Construction Techniques”
Table 2
Specifications related to Instructional Objectives and Content of a Premedical Science
Achievement Test
Percent of
1. Objectives of instruction Formatted: Font: 9 pt
Items
a. Recall of basic concepts 30 Formatted Table
b. Comprehension, interpretation, analysis of scientific Content 40 Formatted: Font: 9 pt
c. Application of concepts, principles etc. 30
Formatted: Font: 9 pt
100
Percent of Formatted: Font: 9 pt
2. Content areas
Items Formatted: Font: 9 pt
a. Biology 40
Formatted: Font: 9 pt
b. Chemistry 40
c. Physics 20 Formatted: Font: 9 pt
100 Formatted: Font: 9 pt
Source: Riaz, M.N. (2008). Test Construction: Development and Standardization of Psychological Tests
Formatted: Font: 9 pt
in Pakistan. Islamabad: HEC.
Formatted: Font: 9 pt
In practice, a much more detailed outline of the contents within each cell of a table is needed
before test construction proceeds. Combined in a two-way table, the above specifications
are presented in the following table (Table 3).
Table 3
Number of Items in Each Category of a Premedical Science Achievement Test
Objectives of Instruction
Content/ Formatted: Font: 9 pt
Subjects Recall of Basic Application of Concepts,
Comprehension Total
Concepts Principles, etc. Formatted Table
Biology 12 16 12 40 Formatted: Font: 9 pt
Chemistry 12 16 12 40
Formatted: Font: 9 pt
Physics 6 8 6 20
Total 30 40 30 100 Formatted: Font: 9 pt
Source: Riaz, M.N. (2008). Test Construction: Development and Standardization of Psychological Formatted: Font: 9 pt
Tests in Pakistan. Islamabad: HEC.
Formatted: Font: 9 pt
Table 4
Number of items in each category of an Achievement Test “Principles of
Psychological Measurement”
Content/ Subject-matter Objectives of Instruction Formatted: Font: 10 pt
Recall of Comprehension, Analysis, Total Formatted Table
Basic Concepts Application synthesis
evaluation
1. Basic statistical concepts;
1 2 2
variability, correlation, prediction 5 (10%)
2.
Scales, transformation, norms 3 2 0
5 (10%)
3 Reliability: concepts, theory and
3 3 4
methods of estimation 10 (20%)
4. Validity: Content, construct,
4 6 5
criterion-related validity 15 (30%)
5. Item analysis: Item characteristics,
distracter analysis, item
4 7 4
discrimination, item characteristic 15 (30%)
curves
Total 15 (30%) 20 (40%) 15 (30%) 50
Source: Riaz, M.N. (2008). Test Construction: Development and Standardization of Psychological Tests in
Pakistan. Islamabad: HEC.
Page 14
Training Module on “Effective Test Item Construction Techniques”
The above table shows a test outline for an Achievement Test based on “Principles of
Psychological measurement” (Part II: Chapters 4-10 of Psychological Testing by Murphy,
K. R., & Davidshofer, C. O., 1998).
Table of Specifications
A table of specifications is a two-way table that represents along one axis the content
area/topics that the teacher has taught during the specified period and the cognitive level at
which it is to be measured, along the other axis. In other words, the table of specifications
highlights how much emphasis is to be given to each objective or topic.
Table 5
A classroom test in Experimental Psychology: Semester 1
Subject/ Instructional Objectives: Bloom’s Taxonomy
Content Knowledge & Application Analysis, Synthesis Totals
comprehension & Evaluation
Topic A 10% 20% 10% 40%
Topic B 15% 15% 30% 60%
Total 25% 35% 40% 100%
While writing the test items, it may not be possible to attempt to adhere very rigorously to the
weights assigned in each cell of a table of specifications like the one presented in Table 5.
Thus, the weights indicated in the original table may need to be slightly changed during the
course of test construction, if sound reasons for such a change are encountered by the
teacher. For instance, the teacher may find it appropriate to modify the original test plan in
view of data obtained from the experimental try-out of the new test.
Table 6
Table of Specifications showing only one topic (Problem Solving) and three
levels of cognitive objectives (Knowledge, Comprehension, Application)
Objectives Number of
Content/Topics Knowledge Comprehension Application Items
Problem solving in search of 10% 10%
solutions
Barriers to Effective problem 10% 15% 25%
solving
Approaches to problem solving 15% 20% 35%
Culture, cognitive style, and 30% 30%
problem solving
The above table shows that the first topic is to be measured only at the knowledge level, and
the fourth topic at the application level. Second and third topics are to be measured at two
different levels. Topic 2: Knowledge and Comprehension; topic 3: Comprehension and
Application. Preparing a test according to the above table of specifications means that 20%
of items in our test measure Knowledge, 30% measure Comprehension, 50% measure
Application.
Test Length
The number of items that should constitute the final form of a test is determined by the
purpose of the test or its proposed uses, and by the statistical characteristics of the items.
Some of the important considerations in setting test length are:
Page 15
Training Module on “Effective Test Item Construction Techniques”
1. The optimal number of items for a homogenous test is lower than for a highly
heterogeneous test.
2. Items that are meant to assess higher thought processes like logical reasoning,
creativity, abstract thinking etc., require more time than those dependent on our
ability to recall important information.
3. Another important consideration in determining the length of test and the time
required for it is related to the validity and reliability of the test. The teacher has to
determine the number of items that will yield maximum validity and reliability of the
particular test.
Page 16
Training Module on “Effective Test Item Construction Techniques”
13. Directions to the examinees should be as simple, clear, and precise as possible, so
that even those students who are of below average ability can clearly understand
what they are expected to do.
14. Scoring procedures must be clearly defined before the test is administered.
15. The test constructor must clearly state optimal testing conditions for test
administration.
16. Item analysis should be carried out to make necessary changes, if any ambiguity is
found in the items.
Page 17
Training Module on “Effective Test Item Construction Techniques”
Day 2: Session: 1
a. Presentations based on home assignment.
b. Participants will share their drafts with group members for discussion and further
input.
What is a Test?
A test is an instrument or a tool. It follows a systematic procedure for measuring a sample of
behavior by posing a set of questions in a uniform manner. It is an attempt to measure what
a person knows or can do at a particular point in time. Furthermore, a test answers the
question ‘how well’ does the individual perform either in comparison with others or in
comparison with a domain of performance tasks?
Achievement Test
A test designed to apprise what the individual has learned as a result of planned previous
experience or training is an Achievement Test. Since it relates to what has been learnt
already its frame of reference is on the present or past.
Basic Assumptions
Preparation of an achievement test assumes that the content and / or the skill domain
covered by the test can be specified in behavioral terms and that the knowledge and skill to
be measured must be specified in a manner that is readily communicable to other persons. It
is important that the test measures the important goals rather than peripheral or incidental
goals. It also assumes that the test takers have had the opportunity to learn the material
covered by the test.
Page 18
Training Module on “Effective Test Item Construction Techniques”
Achievement tests attempt to measure what a person knows or can do at a particular point in
time. Furthermore, our reference is usually to the past; that is we are interested in what has
been learned as a result of a particular course or experience or a series of experiences
Step 2
The Constructed Response / Supply type items must be dealt with in another module. Let us
restrict ourselves here with the use, limitations and construction of Structured Response /
Select type items.
Page 19
Training Module on “Effective Test Item Construction Techniques”
Incomplete sentences providing two options to choose from to fill in the blank also fall in this
category. Very common use of such items is to test the knowledge of grammar. Appropriate
use ‘tense’ and also, contextual meaning of words or spelling mainly of words that sound
alike.
In each case there are only two possible answers. The most common form it takes is True -
False questions.
True-false tests include numerous opinion statements to which the examinee is asked to
respond true or false. There is no objective basis for determining whether a statement of
opinion is true or false. In most situations, when a student is the respondent, he guesses
what opinion the teacher holds and marks the answers accordingly. This, of course, is not
desirable from all standpoints, testing, teaching, and learning. An alternative procedure is to
attribute the opinion to some source, making it possible to mark the statements true or false
with some objectivity. This would allow measuring knowledge concerning the beliefs that
may be held by an individual or the values supported by an organization or institution.
Another aspect of understanding that can be measured by the true-false item is the ability to
recognize cause-and-effect relationships. This type of item usually contains two true
propositions in one statement, and the examinee is to judge whether the relationship
between them is true or false.
The true-false item also can be used to measure some simple aspects of logic.
Criticism
A common criticism of the true-false item is that an examinee may be able to recognize a
false statement as incorrect but still not know what is correct. To overcome such difficulties,
some teachers prefer to have the students change all false statements to true.
Advantage
A major advantage of true-false items is that they are efficient.
Students can typically respond to roughly three true-false items in the time it takes to
respond to two multiple choice items.
True-false items have utility for measuring a broad range of verbal knowledge.
A wide sampling of course material can be obtained.
Limitations
Limitations of the true-false items are in the types of learning outcomes that can be
measured.
True-false items, is, unfortunately more illusory than real ease of construction
True-false items are not especially useful beyond the knowledge area.
Page 20
Training Module on “Effective Test Item Construction Techniques”
The exceptions to this seem to be distinguishing between fact and opinion and
identifying cause-and-effect relationships. These two outcomes measured by the
true-false item can be measured more effectively by other forms of selection items,
especially the multiple-choice form.
Another factor that limits the usefulness of true-false item is susceptibility to
guessing.
Successful guessing on the true-false item has effects that are at least as deleterious:
1. First, the reliability of each item is low.
2. Second, Very little diagnostic value of such a test.
Another concern that needs to be considered in the design of tests with true-false items is
student response sets.
A response set is a consistent tendency to follow a certain pattern in responding to
test items.
Note: True-false items are most useful in situations in which there are only two possible
alternatives (for instance right, left, more, less, who, whom, and so on) and special uses
such as distinguishing fact from opinion, cause from effect, superstition from scientific belief,
relevant from irrelevant information, valid conclusions, and the like.
The short-answer item uses a direct question whereas the completion item consists of an
incomplete statement.
Short-answer item is especially useful for measuring problem-solving ability in
science and mathematics.
Complex interpretations can be made when the short- answer item is used to
measure the ability to interpret diagrams, charts, graphs, and pictorial data.
When short-answer items are used the question must be stated clearly and concisely. It
should be free from irrelevant clues, and require an answer that is both brief and definite.
Page 21
Training Module on “Effective Test Item Construction Techniques”
In short-answer type item the student must supply the answer, this reduces the possibility
that the examinee will obtain the correct answer by guessing.
Limitations of Short-Answer
It is not suitable for measuring complex learning outcomes. Unless the question is carefully
phrased, many answers of varying degrees of correctness must be considered for total or
partial credit. Hence it is difficult to score.
These limitations are less troublesome when the answer is to be expressed in numbers or
symbols, as in physical science or mathematics.
The number of options used differs from one test to the other. An item must have at least
three answer choices to be classified as a multiple choice item. The typical pattern is to have
four or five choices to reduce the probability of guessing the answer. A good item should
have all the presented options look like probable answers at least to those examinees who
do not know the answer.
Page 22
Training Module on “Effective Test Item Construction Techniques”
3. Distracters: appear to be reasonable answers to the examinee who does not know
the content
4. Options: include the distracters and the keyed response.
1. The MCQ is the most flexible of the objective type items. It can be used to appraise
the achievement of any educational objectives that can be measured by a paper-and-
pencil test except those relating to skill in written expression and originality.
2. An ingenious and talented item writer can construct an MCQ to measure a variety of
educational objectives from rote learning to more complex learning outcomes like
comprehension, interpretation, application of knowledge and also those that require
the skills of analysis or synthesis to arrive at the correct answer.
3. Moreover, the chances of getting a correct response by guessing are significantly
reduced.
However, good multiple choice items are difficult to construct. A thorough grasp of the
subject matter and skillful application of certain rules is needed to construct good multiple
choice items.
Advantages of MCQ
Wide sampling of content
The problem or the task is well structured or clearly defined.
Flexible Difficulty Level
Efficient scoring of items
Objective scoring
Provide scores easily understood and transformed as needed:
Multiple choice tests provide scores in matrices that are familiar to most of the score users
i.e. percentiles, grade equivalent scores
Limitations of MCQ
The multiple choice items, despite having advantages over other items, have some serious
limitations as well.
It takes time to construct MCQ.
susceptible to guessing
Do not provide any diagnostic information.
Page 23
Training Module on “Effective Test Item Construction Techniques”
3. The stem of the item should present only one problem. Two concepts must not
be combined together to form a single stem.
4. Include as much of the item in the stem and keep options as short as possible:
this leads to economy of space, economy of reading time and clear statement of
the problem. Hence, include most of the information in the stem and avoid
repeating it in the options. For example, if an item relates to the association of a
term with its definition, it would be better to include definition in the stem and
several terms as options rather than to present option in the stem and several
definitions as alternatives.
5. Unnecessary words or phrases should not be included in the stem. Such words
add to the length and complexity of the stem but do not enhance meaningfulness
of the stem. The stem should be written in simple, concise and clear form.
6. Avoid the use of negative words in the stem of the item. There are times when it
is important for the examinee to detect errors or to know exceptions. For these
purposes, sometimes the use of ‘not ‘or ‘except’ is justified in the stem. When a
negative word is used in a stem it should be highlighted.
8. Use plausible distracters as alternatives. If an examinee who does not know the
correct answer is not distracted by a given alternative, that alternative is not
plausible and it will add nothing to the functioning of the item.
10. The correct answer should appear at each position in almost equal numbers.
While constructing multiple - choice item, some examiners have a tendency to
place correct alternative at the first position. Some place it in the middle and
others at the end. Such tendencies should be consciously controlled.
11. Avoid using ‘none of the above’, ‘all of the above’, both a and b etc. as options
for an MCQ.
Matching Exercises
Matching exercise consists of two parallel columns with each word, number, or
symbol in one column being matched to a word, sentences, or phrase in the other
column.
Items in the column for which a match is sought are called premises, and the items in
the column from which the selection is made are called responses.
Page 24
Training Module on “Effective Test Item Construction Techniques”
The major advantage of the matching exercise is its compact form, which makes it
possible to measure a large amount of related factual material in a relatively short
time.
Limitations
A variety of multiple choice items may be used to measure learning achievement of higher
level such as comprehension, interpretation, extrapolation, application, reasoning, analysis
etc. and help the students focus more on the items/test.
The most commonly used variation is the Context Dependent Item. The selection of context
or stimuli is made in accordance with the nature of the discipline /subject and the learning
outcome to be measured. The context may be in the form of a:
Paragraph
Diagram
Graph
Page 25
Training Module on “Effective Test Item Construction Techniques”
Picture
One context / stimulus may be followed by one or more multiple choice items some
examples of context dependent items are stated as under:
Paragraph as a Context
Paragraph as a context is used to measure learning outcomes relating to reading
comprehension i.e. understanding meaning/theme of the paragraph, understanding
contextual meanings of words, relating and synthesize various parts of information given in a
paragraph etc.
Diagram/Picture as a Context
The questions using diagram may measure not only knowledge but understanding
and application as well.
Like other contexts, a diagram may be followed by a number of MC items, it may also
require the examinee to label various specified parts of the diagram, or even ask
about their functions.
Graph as a Context
Reading and interpreting graphs is the ability that can be useful in most social and
physical sciences. MCQ be asked to assess the desired achievement in the
respective field.
Item Development:
Items that can be scored Objectively: True false, matching, and multiple-choice type
and their variations
In preparing objective test care should be taken to make the items clear, precise,
grammatically correct, and written in language suitable to the reading level of the group for
whom the test is intended. All information and qualifications needed to select a reasonable
answer should be included, but non-functional or stereotyped words and phrases should be
avoided.
Assembling a Test Items after having written and selected they are organized in the form
of a test.
Page 26
Training Module on “Effective Test Item Construction Techniques”
So far as possible, within item type, items dealing with the same content may be grouped
together. The examinee will be able to concentrate on a single domain at a time rather than
having to shift back and forth among areas of content. Furthermore, the examiner will have
an easier job of analyzing the results, as it will be easier to see at a glance whether the
errors are more frequent in one content area than the other.
Items should be arranged in the test booklet so that answers follow no set pattern.
Test Instructions
The directions should be simple but complete. They should indicate the purpose of the test,
the time limits and the score value of each question. Write a set of directions for each item
type that is used on the test specifying what the respondent is expected to do and how one
is required to record the responses.
Answer Sheets
Separate answer sheets, which are easier to score, can be used at high school level and
beyond.
Test Length
Make sure that the length of the test is appropriate for the time limits?
After the items have been reviewed and tentatively selected, it is important to see that the
items measure a representative sample of learning objectives and course content included in
the test plan. Agreement between the test plan and the test would ensure content validity of
the test.
Administration
All pupils must be given a fair chance to demonstrate their achievement. Physical and
psychological environment be conducive to their best efforts. Control all factors that might
interfere with valid measurement: Adequate workspace, quiet, proper light and ventilation
are important. Pupils must be put at ease, tension and anxiety should be reduced to the
minimum.
If the pupils’ answers are recorded on the test paper, the teacher may make a scoring key by
marking the correct answers on a blank copy of the test. When separate answer sheets are
used, a scoring stencil is a blank answer sheet with holes punched where correct answer
should appear. Before scoring procedure is used, each test paper should also be scanned to
Page 27
Training Module on “Effective Test Item Construction Techniques”
make sure that only one answer was marked for each item. Any item containing more than
one answer should be eliminated from scoring.
In scoring objective tests, each correct answer is usually counted as one point. When pupils
are told to answer every item on the test, a pupil's score is simply the number of items
answered correctly.
Short answer questions may sometime require awarding partial credit and may pose some
problem in scoring. However, a detailed key may be prepared in advance to avoid confusion.
For each question and for the test as a whole, the examiner may make a tally for each kind
error that the examinees make. A summary of these errors could then be used to plan
instructional activities.
Results
The raw score / score obtained by a pupil have no meaning at all and cannot be directly
interpreted. If a student obtains 75 marks out of 100, it tells us neither how s/he compares
with other students nor what he knows nor what he does not know. The simplest form of
meaning that teachers often provide to the test score is by assigning ranks to the scores. It
however, becomes more interpretable when one knows the total number of the students in
the class. Often grades are assigned to give meaning to the raw scores by comparing
individual performance with the whole group that has taken the test. In educational
institutions most often it is the class. For criterion referenced test, of course, absolute
grading is used
Awarding Grades
Relative grading
Letter grades are typically assigned on the basis of Performance in relation to other group
members. Some teachers assign them on normal curve but it is not recommended by
experts (Linn and Gronlund), for the classes are usually too small to attain normal curve.
They suggest that before letter grades are assigned, the proportion of As, Bs, Cs, Ds, and
Fs, to be used may be determined. This must be done in the light of a consistent policy of
the institution or the system.
The following distribution is recommended for an introductory course for the purpose of
illustration only.
A = 10-20 % of the students.
B = 20-30 % of the students.
C = 30-50 % of the students.
D = 10-20 % of the students.
F = 0-10 % of the students.
The educational institution should decide about a consistent grading policy for all its
departments. Grades may be awarded on the basis of percentile or standard score system
may be used.
Page 28
Training Module on “Effective Test Item Construction Techniques”
Grade point for a course is obtained by multiplying the grade value with its credit
hours.
Finally Grade point average (GPA: average of the grade points for all the courses) is
found.
The GPA, a numerical value is often converted into equivalent letter grade
Assigning and communicating the grades to the class is not enough. It is important that the
teacher/ examiner returns and reviews test results with the students/ examinees. Feedback
on the performance has a special value for motivating learners for improvement. Moreover,
learning from one's mistake is usually very effective.
Absolute grading
Assigning grades on absolute basis involves comparing a pupil’s performance to pre -
specified standards set by the teacher. These standards are usually concerned with the
degree of mastery to be achieved as specified and may be specified as the percent of
correct answers to be obtained on a test designed to measure clearly defined set of
learning tasks (competencies) on a criterion referenced test.
The experts in the field do not recognize this system as absolute grading nor does this fall in
the category of relative grading. Though not scientifically recognized, this system of grading
is practiced in many educational settings.
Page 29
Training Module on “Effective Test Item Construction Techniques”
1. To provide faculty with information and guidelines that helps better utilize the
advantages of essay questions in assessing student performance. It will also provide
guidelines for dealing with the challenges of essay questions.
2. To help understand the main advantages and limitations of essay questions and
common misconceptions associated with their use.
3. To help distinguish between learning outcomes that are appropriately assessed by
using essay questions and outcomes that are likely to be better assessed by other
means.
4. Evaluate existing essay questions using commonly accepted criteria.
5. Improve poorly written essay questions by using the information in this booklet to
identify flaws in existing questions and correct them.
6. Construct well-written essay questions that assess given objectives.
There are two major purposes for using essay questions that address different learning
outcomes. One purpose is to assess students' understanding of subject-matter content. The
other purpose is to assess students' writing abilities. These two purposes are so different in
nature that it is best to treat them separately.
An essay question is "…a test item which requires a response composed by the examinee,
usually in the form of one or more sentences, of a nature that no single response or pattern
of responses can be listed as correct, and the accuracy and quality of which can be judged
subjectively only by one skilled or informed in the subject."
Example A
How do you feel about the removal of prayer from public school system?
In Example A, it is possible for a student to answer the question in one word. For instance, a
student could write an answer like "good" or "bad". Moreover, this is a poor example for
testing purposes because there is no basis for grading students’ personal preferences and
feelings.
The following example improves upon the given example in such a way that it elicits a
response of one or more sentences that can be graded.
Page 30
Training Module on “Effective Test Item Construction Techniques”
School prayer should be allowed because national polls repeatedly indicate that the
majority of Pakistanis are in favor of school prayers. Moreover, statistics show a
steady moral decline in a country if banning of organized prayer in school. Drug use,
divorce rate, and violent crime have all increased since the banning of organized
prayer in school.
Example B or Example C
Example B
What was the full name of the man who assassinated President Abraham Lincoln?
Example C
State the full name of the man who assassinated President Abraham Lincoln and explain
why he committed the murder.
There is just one single correct answer to Example B because the students need to
write the full name of the man who assassinated President Abraham Lincoln. The
question assesses verbatim recall or memory and not the ability to think. For this
reason, Example B would not be considered a typical essay question. Example C
assesses students’ understanding of the assassination and it is more effective at
providing students the opportunity to think and to give a variety of answers. Answers
to this question may vary in length, structure, etc.
The nature of essay questions is such that only competent specialists in the subject can
judge to what degree student responses to an essay are complete, accurate, correct,
and free from extraneous information. Ineffective essay questions allow students to
generalize in their responses without being specific and thoughtful about the content
matter.
Effective essay questions elicit a depth of thought from students that can only be judged
by someone with the appropriate experience and expertise in the content matter. Thus,
content expertise is essential for both writing and grading essay questions.
Which of the following sample questions prompts student responses that can only be judged
subjectively by a subject matter expert?
Example D
Explain how Arabs would treat women before advent of Islam.
Example E
As mentioned in class, list main ways women were treated before advent of Islam
Page 31
Training Module on “Effective Test Item Construction Techniques”
In order to grade a student's response to above Examples, the grader needs to know the
ways women were treated in specified period in Arab countries. It takes subject-matter
expertise to be able to grade an essay response to this question.
An essay question is a test item which contains the following four elements:
1. Requires examinees to compose rather than select their response.
2. Elicits student responses that consist of one or more sentences.
3. No single response or single response pattern is correct.
4. The accuracy and quality of students’ responses to essays must be judged
subjectively by a competent specialist in the subject.
1.
Advantages, Limitations, and Common Misconceptions of Essay Questions
Advantages
Essay questions not only allow students to present an answer to a question but also
to explain how they arrived at their conclusion. This allows teachers to gain insights
into a student's way of viewing and solving problems. With such insights teachers are
able to detect problems students may have with their reasoning process and help
them overcome those problems.
Problem solving and decision-making are vital life competencies. In most cases,
these skills require the ability to construct a solution or decision rather than selecting
a solution or decision from a limited set of possibilities. It is not very likely that an
employer or customer will give a list of four options to choose from when he/she asks
for a problem to be solved. In most cases, a constructed response will be required.
Hence, essay items are closer to real life than selected response items because in
real life students typically construct responses, not select them.
Limitations
1. Essay questions necessitate testing a limited sample of the subject matter, thereby
reducing content validity.
2. Essay questions have limitations in reliability.
3. Essay questions require more time for scoring student responses.
4. Essay questions provide practice in poor or unpolished writing.
Page 32
Training Module on “Effective Test Item Construction Techniques”
Common Misconceptions
Whether or not an essay item assesses higher-order thinking depends on the design of the
question and how students’ responses are scored. An essay question does not automatically
assess higher-order thinking skills. It is possible to write essay questions that simply assess
recall. Also, if a teacher designs an essay question meant to assesses higher-order thinking
but then scores students’ responses in a way that only rewards recall ability, that teacher is
not assessing higher-order thinking.
Exercise
Compare the following two examples and decide which one assesses higher-order
thinking skills.
Example A
What are the major advantages and limitations of essay questions?
Example B
Given their advantages and limitations, should an essay question be used to assess the
following intended learning outcome? In answering this question provide brief
explanations of the major advantages and limitations of essay questions. Clearly state
whether you think an essay question should be used to assess students’ achievement of
the given intended learning outcome and explain the reasoning for your judgment.
Intended learning outcome: Evaluate the reasons why the nursing process is an effective
process for serving clients.
Essay questions are easier to construct than multiple-choice items because teachers
don't have to create effective distracters. However, that doesn’t mean that good
essay questions are easy to construct. They may be easier to construct in a relative
sense, but they still require a lot of effort and time. Essay questions that are hastily
constructed without much thought and review usually function poorly.
One of the drawbacks of selected response items is that students sometimes get the
right answer by guessing which of the presented options is correct. This problem
does not exist with essay questions because students need to generate the answer
rather than identifying it from a set of options provided. At the same time, the use of
essay questions introduces bluffing, another form of guessing. Some students are
adept at using various methods of bluffing (vague generalities, padding, name-
dropping, etc.) to add credibility to an otherwise vacuous answer. Thus, the use of
essay questions changes the nature of the guessing that occurs, but does not
eliminate it.
Page 33
Training Module on “Effective Test Item Construction Techniques”
Some research seems to indicate that students are more thorough in their
preparation for essay questions that in their preparation for objective examinations
like multiple choice questions.
When it is more important that the students construct rather than select the answer.
When a teacher has sufficient resources and/or help (time, teaching assistants) to
score the student responses to the essay question(s)
When “the group to be tested is small.”
When a teacher is “more confident of [his/her] ability as a critical and fair reader than
as an imaginative writer of good objective test items.”
Page 34
Training Module on “Effective Test Item Construction Techniques”
Concerning the ranking of their students based on test scores, teachers should know that
some research suggests that students are ranked about the same on essay questions and
multiple-choice questions when tests results are compared (Chase & Jacobs, 1992).
The directive verb in each intended learning outcome provides clues about the method of
assessment that should be used. This can be seen when taking a closer look at some of the
sample intended learning outcomes provided on this page. For example, the verb “recall”
means to retrieve relevant knowledge from long-term memory. Students’ ability to recall
relevant knowledge can be most conveniently assessed through objectively scored test
items. There is no need for students to explain or justify their answer when they are
assessed on recall.
The verb “analyze” means to determine how parts relate to one another and to an overall
structure or purpose. Students can demonstrate their ability to analyze the function of humor
in Shakespeare’s “Romeo and Juliet” by either describing the function of humor in their own
words or by selecting the right or best answer among different options of a well drafted
multiple choice item.
The verb “evaluate” means to make judgments based on criteria and standards. To
effectively assess students’ ability to evaluate the impact of the Industrial Revolution on the
family, the assessment item needs to provide students with the opportunity to not only make
an evaluative judgment but to also explain how they have arrived at their judgment. Hence,
students’ ability to evaluate should be assessed with essay items because they allow
students to explain the rationale for their judgment.
Page 35
Training Module on “Effective Test Item Construction Techniques”
If an intended learning outcome could be either assessed through objective items or essay
questions, use essay questions for the following situations:
When it is more important that the students construct rather than select the
answer
When a teacher has sufficient resources and/or help (time, teaching assistants)
to score the student responses to the essay question(s)
When the group to be tested is small.
When a teacher is more confident of his/her ability as a critical and fair reader
than as an imaginative writer of good objective test items
Students should have a clear idea of what they are expected to do after they have read the
problem presented in an essay item. Below are specific guidelines that can help to improve
existing essay questions and create new ones.
Knowing the intended learning outcome is crucial for designing essay questions.
If the outcome to be assessed is not clear, it is likely that the question will assess for
some skill, ability, or trait other than the one intended.
In specifying the intended learning outcome teachers clarify the performance that
students should be able to demonstrate as a result of what they have learned. The
intended learning outcome typically begins with a directive verb and describes the
observable behavior, action or outcome that students should demonstrate. The focus is
on what students should be able to do and not on the learning or teaching process.
Reviewing a list of directive verbs can help to clarify what ability students should
demonstrate and to clearly define the intended learning outcome to be assessed.
2. Avoid using essay questions for intended learning outcomes that are better
assessed with other kinds of assessment.
Some types of learning outcomes can be more efficiently and more reliably assessed
with selected-response questions than with essay questions. In addition, some complex
learning outcomes can be more directly assessed with performance assessment than
with essay questions. Since essay questions sample a limited range of subject-matter
content, are more time consuming to score, and involve greater subjectivity in scoring,
the use of essay questions should be reserved for learning outcomes that cannot be
better assessed by some other means.
Essay questions have two variable elements—the degree to which the task is structured
versus unstructured and the degree to which the scope of the context is focused or
unfocused. Although it is true that essay questions should always provide students with
structure and focus for their responses, it is not necessarily true that more structure and
more focus are better than less structure and less focus. When using more structure in
Page 36
Training Module on “Effective Test Item Construction Techniques”
essay questions, teachers are trying to avoid at least two problems. More structure helps
to avoid the problem of student responses containing ideas that were not meant to be
assessed and the problem of extreme subjectivity when scoring responses. Although
more structure helps to avoid these problems, how much and what kind of structure and
focus to provide is dependent on the intended learning outcome that is to be assessed
by the essay question and the purpose for which the essay question is to be used.
The process of writing effective essay questions involves defining the task and delimiting
the scope of the task in an effort to create an effective question that is aligned with the
intended learning outcome to be assessed by it. This alignment is absolutely necessary
for eliciting student responses that can be accepted as evidence for determining the
students’ achievement of the intended learning outcome. Hence, the essay question
must be carefully and thoughtfully written in such a way that it elicits student responses
that provide the teacher with valid and reliable evidence about the students’ achievement
of the intended learning outcome.
Failure to establish adequate and effective limits for the student response to the essay
question allows students to set their own boundaries for their response, meaning that
students might provide responses that are outside of the intended task or that only
address a part of the intended task. If students’ failure to answer within the intended
limits of the essay question can be ascribed to poor or ineffective wording of the task, the
teacher is left with unreliable and invalid information about the students’ achievement of
the intended learning outcome and has no basis for grading the student responses.
Therefore, it is the responsibility of the teacher to write essay questions in such a way
that they provide students with clear boundaries for their response.
Task(s) and problem(s) are the key elements of essay questions. The task will specify the
performance students should exhibit when responding to the essay question. A task is
composed of a directive verb and the object of that verb. For example, consider the following
tasks:
i. Task = Justify (Directive verb) the view you prefer (object of the Verb)
ii. Task = Defend (Directive verb) the theory as the most suitable for the situation
(object of the verb)
Tasks for essay questions are not developed from scratch, but are developed based on
the intended learning outcome to be assessed. In essay questions, the task can be
presented either in the form of a direct question or an imperative statement. If written as
a question, then it must be readily translatable into the form of an imperative statement.
For example, the following illustrates the same essay item twice, once as a question and
once as an imperative statement.
Question: How are the processes of increasing production and improving quality in a
manufacturing plant similar or different based on cost?
(compare and contrast processes based on cost). Whether essay questions are written
as imperative statements or questions, they should be written to align with the intended
outcome and in such a way that the task is clear to the students.
The other key element of essay questions is the “problem.” The “problem” in essay
questions includes the unsettled matter or undesirable state of affairs that needs to be
Page 37
Training Module on “Effective Test Item Construction Techniques”
resolved. The purpose of the problem is to provide the students with a context within
which they can demonstrate the performance to be assessed. Ideally, students would not
have previously encountered the specific problem.
Problems within essay questions differ in the complexity of thinking processes they elicit
from students depending on the intended learning outcome to be assessed. For
example, if the intended outcome is to assess basic recall, the essay question could be
to summarize views as given in class concerning a particular conflict. The thinking
process in this case is fairly simple. Students merely need to recall what was mentioned
and discussed in class. Yet consider the problem within the essay question meant to
assess students’ abilities to evaluate a particular conflict and to justify their reasoning.
This problem is more complex. In this case, students have to recall facts about the
conflict, understand the conflict, make judgments about the conflict and justify their
reasoning.
Example:
Essay Question: Explain the interrelationship of grade histories, student reviews and
course schedules for students’ selection of a course and professor.
In the example essay question, the problem is inherent in the task of the question and is
sufficiently developed. The problem for students is to translate into their own words the
interrelationships of certain factors affecting how students select courses.
For intended learning outcomes meant to assess more complex thinking, often a “problem
situation” is developed. The problem situation consists of a problem that students have not
previously encountered that presents some unresolved matter or into an essay question is to
confront students with a new context requiring them to assess the situation and derive an
acceptable solution by using:
Intended learning outcome: Analyze the impact of War on terror on the Pakistani
economy.
Page 38
Training Module on “Effective Test Item Construction Techniques”
4. Helpful Instructions: Specify the relative point value and the approximate time
limit in clear directions.
Specifying the relative point value and the approximate time limit helps students allocate
their time in answering several essay questions because the directions clarify the relative
merit of each essay question. Without such guidelines students may feel at a loss as to
how much time to spend on a question. When deciding the guidelines for how much time
should be spent on a question keep the slower students and students with certain
disabilities in mind. Also make sure that students can be realistically expected to provide
an adequate answer in the given and/or the suggested time.
Example
All of your responses to essay questions will be graded based on the following criteria:
Essay question, they should specify the relative point value for the content and the relative
point value for the form.
Rubric for grading long essay exam questions (10 points possible)
Page 39
Training Module on “Effective Test Item Construction Techniques”
Nearly 6 The answer is lacking major details and/or does not address a portion of the
Satisfactory question.
Most information provided is accurate.
The answer demonstrates less than basic understanding of the content.
Writing may be unorganized, not cohesive, and difficult to read.
Fails to 4 The answer to the question is lacking any detail.
Complete Some information provided is accurate.
The answer demonstrates a lack of understanding of the content.
Writing may be unorganized, not cohesive, and difficult to read.
Unable to 2 Question is not answered.
begin A small amount to none of the information provided is accurate.
effectively The answer demonstrates a lack of understanding of the content.
Writing is unorganized, not cohesive, and very difficult to read.
No attempt 0 Answer was left blank.
6. Use several relatively short essay questions rather than one long one.
Only a very limited number of essay questions can be included on a test because of the
time it takes for students to respond to them and the time it takes for teachers to grade
the student responses. This creates a challenge with regards to designing valid essay
questions. Shorter essay questions are better suited to assess the depth of student
learning within a subject whereas longer essay questions are better suited to assess the
breadth of student learning within a subject. Hence, there is a trade-off when choosing
between several short essay questions or one long one. Focus on assessing the depth
of student learning within a subject limits the assessment of the breadth of student
learning within the same subject and focus on assessing the breadth of student learning
within a subject limits the assessment of the depth of student learning within the same
subject.
When choosing between using several short essay questions or one long one also keep
in mind that short essays are generally easier to score than long essay questions.
Students should not be permitted to choose one essay question to answer from two or
more optional questions. The use of optional questions should be avoided for the
following reasons:
Students may waste time deciding on an option.
Some questions are likely to be harder which could make the comparative
assessment of students' abilities unfair.
The use of optional questions makes it difficult to evaluate if all students are
equally knowledgeable about topics covered in the test.
The following steps can help you improve the essay item before and after you hand it
out to your students.
Page 40
Training Module on “Effective Test Item Construction Techniques”
Evaluate whether students have the content knowledge and the skills
necessary to adequately respond to the question. In detecting possible
weaknesses of the essay question, repair them before handing out the exam.
Before using the essay question on a test, ask a person knowledgeable in the
subject (colleague, teaching assistant, etc.) to critically review the essay
question, the model answer, and the intended learning outcome to determine
how well they are aligned with each other. Based on the intended learning
outcome, revise the question as needed. By having someone else look at the
test, the likelihood of creating effective test items is increased, simply
because two minds are usually better than one. Try asking a colleague to
evaluate the essay questions based on the guidelines for constructing essay
questions.
After students complete the essay questions, carefully review the range of
answers given and the manner in which students seem to have interpreted the
question. Make revisions based on the findings. Writing good essay questions is
a process that requires time and practice. Carefully studying the student
responses can help to evaluate students' understanding of the question as well
as the effectiveness of the question in assessing the intended learning outcomes.
4. Helpful instructions: specify the relative point value and the approximate time limit in
clear directions.
Page 41
Training Module on “Effective Test Item Construction Techniques”
6. Use several relatively short essay questions rather than one long one
Preview (before)
a. Predict student responses.
b. Write a model answer.
c. Ask a knowledgeable colleague to critically review the essay question, the
model answer, and the intended learning outcome for alignment.
Review (after)
a. Review student responses to the essay question.
For exercises 1 – 2 develop an effective essay question for the given intended learning
outcome. Make sure that the essay question meets the following criteria:
The essay question matches the intended learning outcome
The task is specifically and clearly defined.
The relative point value and the approximate time limit are specified
Exercise
Choose an intended learning outcome from a course you are currently teaching and create
an effective essay question to assess students’ achievement of the outcome. Follow each of
the guidelines provided for this exercise. Check off each step on the provided checklist once
you have finished it.
Checklist
1 Clearly define the intended learning outcome to be assessed by the item.
2 Avoid using essay questions for objectives that are better assessed with
3 Objectively-scored items.
4 Use several relatively short essay questions rather than one long one.
5 The task is appropriately defined and the scope of the task is appropriately limited
6 Present a novel situation.
7 Consider identifying an audience for the response
8 Specify the relative point value and the approximate time limit.
9 Predict student responses.
10 Write a model answer.
11 Have colleague critically review the essay question
Page 42
Training Module on “Effective Test Item Construction Techniques”
EVALUATION OF ITEMS
Day 4, Session- 1
Often students judge, after taking the exam, whether the test was fair and good. Teacher is
also usually interested about how the test worked for the students. One way to ascertain this
is to undertake item analysis. It provides objective, external and empirical evidence for the
quality of the items we have pre-tested.
The objective of item analysis is to identify problematic or poor items which might be either
confusing the respondents or do not have a clearly correct response or a distracter might
well be competing with the keyed answer.
A good test has good items. Good test making requires careful attention to the principles of
item evaluation. The basic methods involve are assessment of item difficulty and item
discrimination. These measures comprise item analysis.
Item Analysis
Item analysis is about how difficult an item is and how well it can discriminate between the
good and the poor students. In other words, item analysis provides a numerical assessment
of item difficulty and item discrimination.
Item Difficulty
Item difficulty is determined from the proportion (p) of students who answered each item
correctly. Item difficulty can range from zero (none could solve it)to hundred (all persons
solved it correctly). The goal is usually to have items of all difficulty levels in the test so that
test could identify poor, average as well as good students. However, most of the items are
designed to be average in difficulty levels for they are more useful. Item analysis exercise
provides us the difficulty level of each item.
Optimally difficult items are those that 50%–75% of students answer correctly.
Items are considered low to moderately difficult if (p) is between 70% and 85%
Items that only 30% or below solve correctly are considered difficult ones.
Item Difficulty Percentage can also be denoted as Item Difficulty Index by expressing it in
decimals e.g. .40 for items which could be solved by 40 % of the test-takers. Thus index can
range from 0 to 1.
Items should fall in a variety of difficulty levels in order to differentiate between good and
average as well as average and poor students. Easy items are usually placed in the initial
part of the test to motivate students in taking the test and alleviating test-anxiety.
The optimal item difficulty depends on the question type and number of possible distracters
as well.
Item Discrimination
Another way to evaluate items is to ask “Who gets this item correct”-- the good, average and
the weak students? Assessment of item discrimination answers this query.
Item discrimination refers to the percentage difference in correct responses between the
poor and the high scoring students.
In a small class of 30 students, one can administer the test items, score them and then rank
the students in terms of their overall score.
Page 43
Training Module on “Effective Test Item Construction Techniques”
Next, we separate the upper 15 students and the low 15 into two groups: The
UPPER and the LOW groups
Finally, we find how well each item was solved correctly (p) by each group. In other
words, percentage of students passing (p) each item in each of the two groups is
worked out.
Discrimination (D) power of the item is then known by finding difference between the
percentage of upper group and the low group. The higher the difference, the greater
the discrimination power of an item.
In a large class of 100 or more students, we take the top 25% and the lower 25% students to
form upper and lower groups, to cut short the labor or amount of work.
The discrimination ratio for an item falls between −1.0 and +1.0. The closer the ratio is to
+1.0, the more effectively that item distinguishes students who know the material (the top
group) from those who don’t (the bottom group).
Ten students in a class have taken a ten item quiz. The students’ responses are shown
below from high to low. The top five students can be called the high score group and the
bottom half as the low scoring group. The number”1” indicates a correct answer; a “0”
indicates an incorrect answer.
Page 44
Training Module on “Effective Test Item Construction Techniques”
Students’ responses on the 10 item quiz as shown above can be presented on a chart.
D I F F I C U L T Y %
We find from the chart that items of medium difficulty level are more discriminating and
useful unlike too difficult (item 9) and too easy items (no 2, 3).
Page 45
Training Module on “Effective Test Item Construction Techniques”
Weak or non-functional distracters may be substituted with new ones and make sure
that they align with the stem as well as the objective of the item, well connected with
the rest, and are grammatically correct.
Effectiveness of Distracters
Difficulty and discrimination index are estimates about an item which overall comprises a
stem and a set of distracters or options (Appendix- A). The item analysis statistics reflects on
the goodness of both distracters and the stem. Let us look at the guidelines which can help
us improve them.
1. Most MCQs have 2-4 distracters; 3 is better, 4 is best at the college level Where it is
difficult to think of more than one distracter, frame it as true/false item
2. Distracters that have less than 5 percent response rate are weak and may be
changed / improved. Distracters which attracted no response are not working at all.
3. No distracter should be chosen more than the keyed response in the upper group
4. Similarly, no one distracter should pull more than about half the students.
5. If students have respond about equally to all the options, they might be marking
randomly or wildly guessing. Critically check contents of such items. They might have
been written badly and the students seem to have no idea what you are asking. It
could be very difficult items and students might be completely baffled.
6. If the low group gets the keyed answer as often as the upper group, all the distracters
might be looked into again. Or drop the item if you have a large pool of items.
7. Do not repeat a phrase in the options if it can be stated in the stem. Thus make short
and precise distracters.
8. Us Options appear on separate lines and are suitably indented.
9. There should be plausible and homogeneous distracters which are presented in
logical or numerical order and must be independent of one another.
10. Keep the alternatives mutually exclusive, homogeneous in contents and free from
clues that might indicate which response is correct. These should be moreover
parallel in form and similar in length.
11. The position of the keyed response should vary from the A, B, C and D positions
12. Distracters should be related or somehow linked to each other, should appear as
similar as possible to the correct answer and should not stand out as a result of their
phrasing . If the stem is in past tense, all the options should be in past tense. If the
tense calls for a plural answer, all the options should be plural. Stem and options
should have subject-verb agreement. All options follow grammatically from the stem.
13. Options should not include “none of the above” or “all of the above.” None of the
above is problematic in items where judgment is involved and where the options are
not absolutely true or false.
14. When more than one option has some element of accuracy but the keyed Response
is the best, ask the students to select the “best answer” rather than “correct
answer.”
Objectives:
1- To consolidate the preceding presentation
2- Applying principles & conventions of item construction / Brainstorming
3- Hands-on-Practice / Learning by doing
Page 46
Training Module on “Effective Test Item Construction Techniques”
All tests are desired to be valid and reliable, but no tests are more so than MCQs because of
several advantages over other examination techniques. Here we mention of four qualities of
a test.
1- Reliability
Reliability can be defined as a procedure that tells us whether a test is likely to yield the
same results if administered multiple times to the same group of test takers. In other words a
test is said to be reliable if it measures consistently.
However, no test or measure is perfect. A certain degree of error does creep in called
random or chance error. In measuring length with a ruler, for example, there may be random
error associated with your eye's ability to read the marking or extrapolate between the
markings. In addition, the scale that you use to measure length not be very precise and
accurate
These factors fluctuate and vary from time to time influencing students’ performance on the
test adversely. When such chance or random error is kept to the minimum, test scores truly
reflect on students’ ability. Reliability, as a statistical estimate, ranges between 0-1.
Page 47
Training Module on “Effective Test Item Construction Techniques”
2) Factors in test-takers
Changes in student's attitudes, health, mood, sleep etc can affect the quality of their
efforts and thus their test-taking consistency. For example, test takers may make
careless errors, misinterpret test instructions, forget test instructions, inadvertently omit
test sections, or misread test items.
3) Scoring errors
These refer to scoring rubrics, and a host of rater errors manifested in exams.
Educational tests are reliable when they have homogeneous contents. Performance of the
students is consistent on homogeneous test contents: The poor students would perform
poorly throughout the test and vice versa. Among several methods: test-retest, split-half,
alternative form and internal consistency, the last one is particularly salient to scholastic
tests. ‘KR-20’ as it is formally called is a statistical method that is related to item analysis
work as well.
________________________________________________________________________
KR-20 formula statistically works out the reliability of an educational or ability test, once the
difficulty level of the items is known, that is what proportion of students passed (p) and what
proportion failed (q) an item unit of a test.
Page 48
Training Module on “Effective Test Item Construction Techniques”
________________________________________________________________________
Let us apply this formula to 10 item quiz that we item-analyzed in the earlier session.
K = number of items
SDt2 = Standard deviation of the total class squared
p = proportion of students who pass an item
q = proportion of students who fail in solving the item
________________________________________________________________________
Values of .8 or higher are considered satisfactory for a test of 50 or more items. For a short
quiz of 10 items which has some flawed items as well as pointed out through the p and q
values, the reliability index would be very modest. The quiz when revised would gain in
reliability and if more items are added to it, provided they are good ones, it will improve in
reliability still more. Degree of reliability is a function of the number of items in a quiz.
Lengthened tests comprehensively cover the contents / subject matter. The more the
merrier!
Validity
Validity indicates whether a test measures what it is purported to measure. Usually test
based on MCQs cover the entire course and is therefore potentially a more valid assessment
than the descriptive tests.
The scores of a not-so-valid a test are less credible in warranting student’s mastery of the
course material. It is therefore not safe to draw any inference or decision from it. Assessing
content validity is very salient and essential to an educational test.
Course exams or scholastic tests are required to cover and represent the entire course
domain / knowledge to be called valid tests. Subject specialists or experts judge how valid a
test is by contents.
Page 49
Training Module on “Effective Test Item Construction Techniques”
For example, a test intended to measure knowledge of science in fifth grade course in
science is judged by a panel of 2-3 teachers who estimate as experts how representative the
test contents are of grade-5 science and the objectives of learning course with due
emphasis. They rate the test material accordingly and degree of agreement in their ratings is
considered content validity index.
Another method to establish validity is to correlate test scores on this test with another
established test or criterion: For example correlating scores in Economics Test in a local
university with overall GPA or with GRE score etc. This procedure is called criterion related
validity.
Practicality
MCQs are practically useful and efficient especially in a large scale testing situation unlike
descriptive tests which are more resource intensive and demand time and money.
Increasing enrolment in college and universities has made the staff to incorporate objective
testing within their assessment for more efficient examining of students.
MCQs are versatile and adaptable to various levels of learning / educational objectives
(recall, comprehension, application, analysis).
MCQs test a broad area of knowledge in a short time and are moreover easy to score.
Moreover they yield rich data for psychometric analysis.
To develop quality MCQs faculty need have sufficient know-how of the techniques of test
construction besides having motivation to persevere in this intensive work of framing and
pre-testing question papers. It also requires them to orient students to MCQs as a system of
assessment and evaluation.
Objectivity
It refers to fairness and uniformity in the test scoring procedure. Examiner / rater bias is
therefore non-existent in MCQ based tests. That is why they are called objective tests.
Further, the analysis of the test data is undertaken statistically which further assesses
various dimensions of the tests to make them more precise and accurate instruments.
Page 50
Training Module on “Effective Test Item Construction Techniques”
PREPARATION
Printing, question papers (uniform-varied)
Exam Hall size &number of examinees
Training in test administration, monitoring students
Security of material (with precise accounting)
COMMUNICATE
Result to be conveyed in clear and understandable terms
Page 51
Training Module on “Effective Test Item Construction Techniques”
EVALUATION
Of formative and summative assessment
Judging learning outcomes of students
Improvements in A-E above
References
Bloom B S. Taxonomy of educational objectives. Vol I: Cognitive domain. New York, NY: M
Cox K R, Bandaranayake R. (1978). How to write good multiple choice questions. Med
Journal, Medline, 2 , 553–554.
Kaplan & Saccuzzo (2002) Psychological Testing: Principles, Applications & Issues (7th
Edition)
http://testing.byu.edu/info/handbook/betteritems.pdf
Page 52
Training Module on “Effective Test Item Construction Techniques”
List of Appendices
Page 1
Training Module on “Effective Test Item Construction Techniques”
Books:
Bloom, Benjamin B. (Ed.) Taxonomy of Educational Objectives: the classification of
educational goals, by a committee of college and university examiners 1st Ed. New
York: Longmans, Green, 1956.
Davis, Barbara Gross. Tools for Teaching San Francisco: Jossey-Bass, 1993.
Erickson, Bette LaSere and Diane Weltner Strommer. Teaching College Freshmen San
Francisco: Jossey-Bass, 1991.
Jacobs, Lucy Cheser and Clinton I. Chase. Developing and Using Tests Effectively: A
Guide for Faculty San Francisco: Jossey-Bass, 1992.
McKeachie, Wilbert. Teaching Tips: Strategies, Research, and Theory for College and
University Teachers (9th Ed.) Lexington, Mass: D.C. Heath and Company, 1994.
Miller, Harry G., Reed G. Williams, and Thomas M Haldyna. Beyond Facts: Objective Ways
to Measure Thinking Englewood Cliffs: Educational Technology Publications, 1978.
Articles:
Clegg, Victoria L. and William E. Cashin. "Improving Multiple-Choice Tests." Idea Paper #16,
Center for Faculty Evaluation and Development, Kansas State University, 1986.
Fuhrman, Miriam. "Developing Good Multiple-Choice Tests and Test Questions." Journal of
Geoscience Education 44 (1996): 379-384.
Johnson, Janice K. ". . . Or None of the Above." The Science Teacher 56.2 (1989) 57-61.
Websites:
University of Capetown's Guide to Designing and Managing Multiple Choice Questions
Contact:
Email: tep@uoregon.edu, Phone: 541-346-2177 Fax: 541-346-2184
University of Oregon.
Page 2
Training Module on “Effective Test Item Construction Techniques”
Appendix-B
Once you know the learning objectives and item types you want to include in your test you
should create a test blueprint. A test blueprint, also known as test specifications, consists of
a matrix representing the number of questions you want in your test within each topic and
level of objectives. The blueprint identifies the objectives and skills that are to be tested and
their relative weight age. The blueprint can help you steer through the desired coverage of
topics as well as levels of objectives. Once you create your test blueprint you can begin
writing your items!
Page 3
Training Module on “Effective Test Item Construction Techniques”
Appendix-C
2-Comprehension
3-Application
Which one of the following learning outcomes is properly stated in terms of student
performance?
A. Develops an appreciation of the importance of testing.
B. Explains the purpose of test specifications*
C. Learns how to write good test items.
D. Realizes the importance of validity.
4-Analysis
Directions: Read the following comments a teacher made about testing. Then answer the
questions that follow by circling the letter of the best answer.
“Students go to school to learn, not to take tests. In addition, tests cannot be used to indicate
a student’s absolute level of learning. All tests can do is rank students in order of
achievement, and this relative ranking is influenced by guessing, bluffing, and the subjective
opinions of the teacher doing the scoring. The teacher-learning process would benefit if we
did away with tests and depended on student self-evaluation.”
Page 4
Training Module on “Effective Test Item Construction Techniques”
Which one of the following types of test is this teacher primarily talking about?
5-Synthesis
Given a short story, the student will write a different but plausible ending.
(See paragraph for analysis items)
Outcome: Identifies relationships.
Which one of the following propositions is most essential to the final conclusion?
6-Evaluation
Given a description of a country’s economic system, the student will defend it by basing
arguments on principles of socialism.
Reference:
1. Kubiszyn, K., & Borich, G. (1984). Educational testing and measurement:
Classroom application and practice. Glenview, IL: Scott, Foresman, pp. 53-55.
Page 5