HPGD2303 Educational Assessment - Caug17 (Bookmark) PDF

HPGD2303
Educational Assessment
Copyright © Open University Malaysia (OUM)

HPGD2303
EDUCATIONAL
ASSESSMENT
Prof Dr John Arul Phillips

Project Director: Prof Dato’ Dr Mansor Fadzil
Open University Malaysia
Module Writer: Prof Dr John Arul Phillips

Asia e University
Enhancer: Mr Yap Yee Khiong
Moderator: Prof Dr Kuldip Kaur

Adapted by: Assoc Prof Dr Chung Han Tek

Developed by: Centre for Instructional Design and Technology

First Edition, April 2011

Second Edition, August 2017
Copyright © Open University Malaysia, August 2017, HPGD2303
All rights reserved. No part of this work may be reproduced in any form or by any means without
the written permission of the President, Open University Malaysia (OUM).

Table of Contents
Course Guide ixăxviii
Course Assignment Guide xxiăxxv
Topic 1 The Role of Assessment in Teaching and Learning 1

1.1 Tests, Measurement, Evaluation and Assessment 2
1.2 Why Do We Assess? 5
1.3 General Principles of Assessment 10
1.4 Types of Assessment 11
1.4.1 Formative versus Summative Assessments 11
1.4.2 Norm-referenced versus Criterion-referenced 15
Tests
1.5 Current Trends in Assessment 17
Summary 19
Key Terms 19
Topic 2 Foundation for Assessment: What to Assess 17

2.1 What to Assess 21
2.2 Three Types of Learning Outcomes 22
2.3 Assessing Cognitive Learning Outcomes 24
2.4 Assessing Affective Outcomes 30
2.5 Assessing Psychomotor Learning Outcomes 35
Summary 39
Key Terms 40
Topic 3 Planning the Classroom Test 41

3.1 Purpose of Classroom Testing 42
3.2 Planning the Classroom Test 43
3.2.1 Deciding its Purpose 43
3.2.2 Specifying the Learning Objectives 44
3.2.3 Developing Test Specifications 44
3.2.4 Selecting Best Item Types 48
3.2.5 Preparing Rubrics or Marking Schemes 49
3.2.6 Preparing Test Items 50
3.3 Assessing TeacherÊs Own Test 51
Summary 52
Key Terms 53

iv  TABLE OF CONTENTS
Topic 4 Constructing Objective Test Items 55

4.1 What is an Objective Test? 55
4.2 Short-answer Questions 56
4.3 True-false Questions 59
4.3.1 Limitations of True-false Questions 59
4.3.2 Suggestions for Construction of True-false Questions 60
4.4 Matching Questions 63
4.4.1 Advantages of Matching Questions 64
4.4.2 Limitations of Matching Questions 65
4.4.3 Suggestions for Writing Good Matching Questions 65
4.5 Multiple-choice Questions 66
4.5.1 Parts of a Multiple-choice Question 67
4.5.2 Construction of Multiple-choice Questions 68
4.5.3 Advantages and Limitations of MCQs 73
4.5.4 Constructing MCQs to Measure Higher-order 75
Thinking (HOT)
Summary 79
Key Terms 80
Topic 5 Constructing Essay Questions 81

5.1 What is an Essay Question? 82
5.2 Types of Essays 83
5.3 Why are Essay Questions Used? 84
5.4 Deciding Whether to Use Essay Or Objective Questions 85
5.5 Limitations of Essay Questions 86
5.6 Some Misconceptions about Essay Questions 87
5.7 Some Guidelines for Constructing Essay Questions 89f
5.8 Verbs Describing Various Kinds of Mental Tasks 96
5.9 Marking Essay Questions 100
Summary 105
Key Terms 106
Topic 6 Authentic Assessments 107

6.1 What is Authentic Assessment? 108
6.1.1 Types of Authentic Assessments 110
6.1.2 Characteristics of Authentic Assessments 111
6.2 How to Create Authentic Assessments? 112
6.2.1 Advantages of Authentic Assessments 113
6.2.2 Criticisms of Authentic Assessments 114
6.3 Differences between Traditional Assessment and 115
Authentic Assessment

TABLE OF CONTENTS  v
6.4 Assessment Tools 118

6.4.1 Scoring Rubrics 119
6.4.2 Rating Scales 121
6.4.3 Checklists 123
Summary 124
Key Terms 124
Topic 7 Project and Portfolio Assessments 125

7.1 Project Assessment 126
7.1.1 Assessing Learners with Projects 129
7.1.2 Designing Effective Projects 132
7.1.3 Possible Problems with Project Work 135
7.1.4 Group Work in Projects 136
7.1.5 Assessing Project Work 137
7.1.6 Assessing Process in a Project 141
7.2 Portfolio Assessment 146
7.2.1 What is Portfolio Assessment? 147
7.2.2 Rationale for Portfolio Assessment 147
7.2.3 Types of Portfolios 148
7.2.4 How to Develop a Portfolio 149
7.2.5 Advantages and Disadvantages of Portfolio 150
Assessment
7.2.6 How and When Should a Portfolio be Assessed? 152
Summary 155
Key Terms 156
Topic 8 Test Reliability and Validity 157

8.1 What is Reliability? 157
8.2 The Reliability Coefficient 159
8.3 Methods to Estimate the Reliability of a Test 162
8.4 Inter-rater and Intra-rater Reliability 166
8.5 Types of Validity 168
8.6 Factors Affecting Reliability and Validity 172
8.7 Relationship between Reliability and Validity 174
Summary 175
Key Terms 176

vi  TABLE OF CONTENTS
Topic 9 Appraising Classroom Tests and Item Analysis 177

9.1 What is Item Analysis? 178
9.2 Steps in Item Analysis 178
9.3 The Difficulty Index 180
9.4 The Discrimination Index 182
9.5 Application of Item Analysis on Essay-type Question 184
9.6 Relationship between Difficulty Index and Discrimination 187
Index
9.7 Distractor Analysis 188
9.8 Practical Approach to Item Analysis 190
9.9 Usefulness of Item Analysis to Teachers 192
9.10 Cautions in Interpreting Item Analysis Results 193
9.11 Item Bank 194
9.12 Psychometric Softwares 196
Summary 196
Key Terms 197
Topic 10 Analysis and Interpretation of Test Scores 198

10.1 Why Use Statistics? 199
10.2 Describing Test Scores 200
10.3 Standard Scores 208
10.4 The Normal Curve 212
Summary 213
Key Terms 214
References 215

COURSE GUIDE

Table of Contents
Welcome to HPGD2303 x
What will You Gain from this Course? x

Description of the Course
Aim of the Course
Course Outcomes
How can You Get the Most Out of this Course? xi

Learning Package
Course Synopsis
Text Arrangement Guide
How will You Be Assessed? xiv
What Support will You Get in Studying this Course? xv

Seminars
myINSPIRE Online Discussion
Facilitator
Library Resources
How should You Study for this Course? xvi

Time Commitment for Studying
Proposed Study Strategy

x  COURSE GUIDE
WELCOME TO HPGD2303 EDUCATIONAL

ASSESSMENT
Welcome to HPGD2303 Educational Assessment, which is one of the required
courses for the Postgraduate Diploma in Teaching programme. The course
assumes no previous knowledge of educational assessment but you are
encouraged to tap into your experiences as a teacher, instructor, lecturer or
trainer and relate them to the principles of assessment discussed. This is a 3
credit course conducted over a semester of 14 weeks.
WHAT WILL YOU GAIN FROM THIS COURSE?

Description of the Course
The course discusses the differences between testing, measurement and
assessment or evaluation. The focus is on the role of assessment and followed by
discussion on „what to assess‰, which forms the foundation of assessment. With
regard to the „what‰, emphasis is on cognitive outcomes, affective outcomes and
psychomotor outcomes. Subsequent topics cover the aspect of how to assess with
emphasis on higher-order thinking skills. Besides the usual objective and essay
tests, other authentic assessment techniques such as projects and portfolios are
presented. Also discussed are techniques to determine the effectiveness of
various assessment approaches focussing on reliability, validity and item
analysis. Finally, various basic statistical procedures are presented in the analysis
of assessment results and their interpretation.
Aim of the Course

The main aim of the course is to provide learners with a foundation on the
principles and theories of educational testing and assessment as well as their
applications in the classroom.

COURSE GUIDE  xi
Course Outcomes
By the end of this course, you should be able to:
1. Identify the different principles and theories of educational testing and

assessment;
2. Compare the different procedures of educational testing and assessment;
3. Apply the different principles and theories in the development of

assessment techniques for use in the classroom; and
4. Critically evaluate the quality of different assessment procedures.
HOW CAN YOU GET THE MOST OUT OF THIS

COURSE?
Learning Package
In this Learning Package you are provided with three types of course materials:
1. The Course Guide you are currently reading;
2. The Course Content (consisting of 10 topics);
3. The Course Assessment Guide (which describes the assignments to be

submitted and the examinations you have to sit for); and
4. The Course online platform i.e. MyINSPIRE Online Discussion.
Please ensure that you have all of these materials and the correct url.

xii  COURSE GUIDE
Course Synopsis
To enable you to achieve the four course outcomes, the course content has been
divided into ten topics. Specific learning outcomes are stated at the start of each
topic indicating what you should be able to achieve after completing each topic.
Topic Title Week

1. The Role of Assessment in Teaching and Learning 1
2. Foundation for Assessment: What to Assess 2ă3
3. Planning the Classroom Test 3
4. Constructing Objective Test Items 4ă5
5. Constructing Essay Tests 6ă7
6. Authentic Assessment 8
7. Project and Portfolio Assessments 8ă9
8. Test Reliability and Validity 10
9. Appraising Classroom Tests and Item Analysis 11
10. Analysis and Interpretation of Assessment Scores 12
Topic 1 discusses the differences between testing, measurement, evaluation and

assessment, the role of assessment in teaching and learning and some general
principles of assessment. Also explored is the difference between formative and
summative assessments as well as the difference between criterion and norm-
referenced tests. The topic concludes with a brief discussion of the current trends
in assessment.
Topic 2 discusses the behaviours to be tested focussing on cognitive, affective

and psychomotor learning outcomes and reasons why assessments of the latter
two outcomes are ignored.
Topic 3 provides some useful guidelines to help teachers plan valid, reliable and
useful assessments. The discussion includes determining what is to be measured
and minimising measurement irrelevancies. The topic will also guide teachers to
devise strategies to measure the domain well. Example of the Table of
Specifications, a 2-way table is presented.
Topic 4 discusses the design and development of objective tests in the assessment
of various kinds of behaviours with emphasis on the limitations and advantages
of using this type of assessment tool.

COURSE GUIDE  xiii
Topic 5 examines the role of essay tests in assessing various kinds of learning
outcomes as well as its limitations and strengths, and the procedures involved in
the design of good essay questions.
Topic 6 introduces a form of assessment in which learners are assigned to

perform real-world tasks that demonstrate meaningful application of essential
knowledge and skills. Teachers will be able to understand in what way Authentic
Assessment is similar to or different from Traditional Assessment. Emphasis is
also given to scoring rubrics.
Topic 7 discusses in detail two examples of authentic assessments, namely

portfolio and project assessments. Guidelines to portfolio entries and project
works and evaluation criteria are discussed in detail.
Topic 8 focuses on basic concepts of test reliability and validity. The topic also
includes methods to estimate the reliability of a test and factors to increase
reliability and validity of a test.
Topic 9 examines the concept of item analysis and the different procedures for
establishing the effectiveness of objective and essay-type tests focussing on item
difficulty and item discrimination. The topic concludes with a brief explanation
of item bank.
Topic 10 focuses on the analysis and interpretation of the data collected by tests.
For quantitative analysis of data, various statistical procedures are used. Some of
the statistical procedures used in the interpretation and analysis of assessment
results are measures of central tendency and correlation coefficients.
Text Arrangement Guide

Before you go through this module, it is important that you note the text
arrangement. Understanding the text arrangement will help you to organise your
study of this course in a more objective and effective way. Generally, the text
arrangement for each topic is as follows:
Learning Outcomes: This section refers to what you should achieve after you
have completely covered a topic. As you go through each topic, you should
frequently refer to these learning outcomes. By doing this, you can continuously
gauge your understanding of the topic.

xiv  COURSE GUIDE
Self-Check: This component of the module is inserted at strategic locations

throughout the module. It may be inserted after one sub-section or a few sub-
sections. It usually comes in the form of a question. When you come across this
component, try to reflect on what you have already learnt thus far. By attempting
to answer the question, you should be able to gauge how well you have
understood the sub-section(s). Most of the time, the answers to the questions can
be found directly from the module itself.
Activity: Like Self-Check, the Activity component is also placed at various

locations or junctures throughout the module. This component may require you
to solve questions, explore short case studies, or conduct an observation or
research. It may even require you to evaluate a given scenario. When you come
across an Activity, you should try to reflect on what you have gathered from the
module and apply it to real situations. You should, at the same time, engage
yourself in higher order thinking where you might be required to analyse,
synthesise and evaluate instead of only having to recall and define.
Summary: You will find this component at the end of each topic. This component
helps you to recap the whole topic. By going through the summary, you should
be able to gauge your knowledge retention level. Should you find points in the
summary that you do not fully understand, it would be a good idea for you to
revisit the details in the module.
Key Terms: This component can be found at the end of each topic. You should go
through this component to remind yourself of important terms or jargon used
throughout the module. Should you find terms here that you are not able to
explain, you should look for the terms in the module.
References: The References section is where a list of relevant and useful

textbooks, journals, articles, electronic contents or sources can be found. The list
can appear in a few locations such as in the Course Guide (at the References
section), at the end of every topic or at the back of the module. You are
encouraged to read or refer to the suggested sources to obtain the additional
information needed and to enhance your overall understanding of the course.
HOW WILL YOU BE ASSESSED?

Assessment Method
Please refer to myINSPIRE.

COURSE GUIDE  xv
WHAT SUPPORT WILL YOU GET IN STUDYING

THIS COURSE?
Seminars
There are 15 hours of seminars or face-to-face interactions supporting the course.
This consists of FIVE seminar sessions of 3 hours each. You will be notified of the
dates, times and location of these seminars together with the name and phone
number of your tutor as soon as you are assigned to a tutorial group.
MyINSPIRE Online Discussion

Besides the face-to-face tutorial sessions, you have the support of online
discussions. You should interact with other learners and your facilitators using
MyINSPIRE. Your participation and contributions to the online discussion will
greatly enhance your understanding of the course content, guide you in the
assignments as well as prepare for the examination.
Facilitator
Your facilitator will mark your assignments and assist you during the course. Do
not hesitate to discuss during the tutorial sessions or online if:
Ć You do not understand any part of the course content or the assigned
readings;
Ć You have difficulty with the self-tests and activities; or
Ć You have a question or problem with the assignments.
Library Resources
The Digital Library has a large collection of books and journals which you can
access using your learner ID.

xvi  COURSE GUIDE
HOW SHOULD YOU STUDY FOR THIS COURSE?

1. Time Commitment for Studying
You should plan to spend about six to eight hours per topic reading the
notes, performing the self-tests and activities as well as referring to the
suggested readings. You must also schedule your time to participate in
online discussions. It is often more convenient for you to distribute the
hours over a number of days rather than spend one whole day per week on
your studies. Some topics may require more work than others may,
therefore, it is suggested that, on average, you spend approximately three
days per topic.
2. Proposed Study Strategy

The following is a proposed strategy for working through the course. If you
encounter any trouble, discuss it with your facilitator either online or
during the seminar sessions. Remember, the facilitator is there to help you.
(a) The most important step is to read the contents of this Course Guide
thoroughly.
(b) Organise a study schedule. Note the time you are expected to spend
on each topic and the date for submission of assignments as well as
seminar and examination dates. These are stated in your Course
Assessment Guide. Note down all this information in one place such
as your diary or a wall calendar. Jot down your own dates for
working on each topic. You have some flexibility as there are 10 topics
spread over a period of 14 weeks.
(c) Once you have created your own study schedule, make every effort to
„stick to it‰. The main reason learners are unable to cope is because
they lag behind in their coursework.
(d) To begin reading a topic:

Ć Remember that in distance learning, much of your time will be
spent READING the course content. Study the list of topics given
at the beginning of each topic and examine the relationship of the
topic to the other topics.
Ć Read the topicÊs learning outcomes (what is expected of you). Do

you already know some of the key points in the topic? What are
the things you do not know?

COURSE GUIDE  xvii
Ć Read the introduction (to see how it connects with the previous
topic).
Ć Work through the topic (contents of the topic are arranged to

provide a sequence for you to follow through).
Ć As you work through the topic, you will be directed to perform

the self-tests at appropriate intervals throughout the topic. This
will enable you to find out if you understand what you have just
read.
Ć Work out the activities stated (to see if you can apply the concepts
learned to real-world situations).
(f) When you have completed the topic, review the learning outcomes to
confirm that you have achieved them and are able to do what is
required.
(g) If you are confident, you can proceed to the next topic. Proceed topic
by topic through the course and try to pace your study so that you
keep to your planned schedule.
(h) After completing all topics, review the course and prepare yourself for
the final examination. Check that you have achieved all the topicsÊ
learning outcomes and the course objectives (listed in this Course
Guide).
FINAL REMARKS
Once again, welcome to the course. To maximise your gain from this course you
should try at all times to relate what you are studying to the real world. Look at
the environment in your organisation and ask yourself whether the ideas
discussed apply. Most of the ideas, concepts and principles you learn in this
course have practical applications. It is important to realise that much of what we
do in education and training has to be based on sound theoretical foundations.
The contents of this course merely address the basic principles and concepts of
assessment in education. You are advised to go beyond the course and continue
with lots of self-study to further enhance your knowledge on educational
assessment.
We wish you success with the course and hope that you will find it interesting,
useful and relevant in your development as a professional. We hope you will
enjoy your experience with OUM and we would like to end with a saying by
Confucius ă „Education without thinking is labour lost‰.

xviii  COURSE GUIDE
TAN SRI DR ABDULLAH SANUSI (TSDAS)

DIGITAL LIBRARY
The TSDAS Digital Library has a wide range of print and online resources for
the use of its learners. This comprehensive digital library, which is accessible
through the OUM portal, provides access to more than 30 online databases
comprising e-journals, e-theses, e-books and more. Examples of databases
available are EBSCOhost, ProQuest, SpringerLink, Books247, InfoSci Books,
Emerald Management Plus and Ebrary Electronic Books. As an OUM learner,
you are encouraged to make full use of the resources available through this
library.

COURSE ASSIGNMENT GUIDE

xxvi X COURSE ASSIGNMENT GUIDE

Table of Contents
Introduction xxii
Academic Writing xxii

(a) Plagiarism
(b) Documenting Sources

(i) What is Plagiarism?
(ii) How Can I Avoid Plagiarism?
 Direct Citation
 Indirect Citation
 Third-party Citation
(c) Referencing
 Journal Articles
 Online Journal
 Webpage
 Book
 Article in Book
 Printed Newspaper
Details about Assignments xxiv

xxii  COURSE ASSIGNMENT GUIDE
INTRODUCTION
This guide explains the basis in which you will be assessed in this course during
the semester. It contains details of the facilitator-marked assignments, final
examination and participation required for the course.
One element in the assessment strategy of the course is that all learners and
facilitators should be provided with the same information about how the learners
will be assessed. Therefore, this guide also contains the marking criteria that
facilitators will use in assessing your work.
Please read through the whole guide. It should be read at the start of the course.
ACADEMIC WRITING
(a) Plagiarism
(i) What Is Plagiarism?
Any written assignment (essays, project, take-home tests and others)
submitted by a learner must not be deceptive with regard to the
abilities, knowledge or amount of work contributed by the learner.
There are many ways that this rule can be violated. Among them are:
Paraphrases A closely reasoned argument of an author is paraphrased but

the learner does not acknowledge doing so. (Clearly, all our
knowledge is derived from somewhere but detailed arguments
from clearly identifiable sources must be acknowledged.)
Outright Large sections of the paper are simply copied from other sources
plagiarism and the copied parts are not acknowledged as quotations.
Other These often include essays written by other students or sold by
sources unscrupulous organisations. Quoting from such papers is
perfectly legitimate if quotation marks are used and the source
is cited.
Works by Taking credit deliberately or not deliberately for works
others produced by others without giving proper acknowledgement.
These works include photographs, charts, graphs, drawings,
statistics, video clips, audio clips, verbal exchanges such as
interviews or lectures, performances on television and texts
printed on the Web.
Duplication The student submits the same essay for two or more courses.

COURSE ASSIGNMENT GUIDE  xxiii
(ii) How Can I Avoid Plagiarism?
 Insert quotation marks for „copy and paste‰ clauses, phrases,

sentences and paragraphs, and cite the original source;
 Paraphrase the clause, phrase, sentence or paragraph in your own

words and cite your source;
 Adhere to the APA (American Psychological Association) stylistic

format, whichever applicable, when citing a source and when
writing out the bibliography or reference page;
 Attempt to write independently without being overly dependent

on information from another personÊs original works; and
 Educate yourself on what may be considered as common

knowledge (no copyright necessary), public domain (copyright
has expired or not protected under copyright law) or copyright
(legally protected).
(b) Documenting Sources

Whenever you quote, paraphrase, summarise or refer to the work of
another, you are required to cite its original source documentation. Offered
here are some of the most commonly cited forms of material.
 Direct Citation Simply having a thinking skill is no assurance that

children will use it. In order for such skills to
become part of day-to-day behaviour, they must
be cultivated in an environment that value and
sustains them. „Just as childrenÊs musical skills
will likely lay fallow in an environment that
doesnÊt encourage music, learnerÊs thinking skills
tend to languish in a culture that doesnÊt
encourage thinking‰ (Tishman, Perkins & Jay,
1995, p. 5)
 Indirect Citation According to Wurman (1988), the new disease of

the 21st century will be information anxiety, which
has been defined as the ever-widening gap
between what one understands and what one
thinks one should understand.

xxiv  COURSE ASSIGNMENT GUIDE
(c) Referencing
All sources that you cite in your paper should be listed in the Reference
section at the end of your paper. Here is how you should list your
references:
Journal Article DuFour, R. (2002). The learning-centred principal:

Educational Leadership, 59(8). 12-15.
Online Journal Evnine, S. J. (2001). The universality of logic: On the
connection between rationality and logical ability
[Electronic version]. Mind, 110, 335-367.
Webpage National Park Service. (2003, February 11). Abraham
Lincoln Birthplace National Historic Site.
Retrieved February 13, 2003, from
http://www.nps.gov/abli/
Book Naisbitt, J. & Aburdence, M. (1989). Megatrends 2000.
London: Pan Books.
Article in a Nickerson, R. (1987). Why teach thinking? In J. B.
Book Baron & R.J. Sternberg (Eds). Teaching thinking
skills: Theory and practice. New York: W.H.
Freeman and Company. 27-37.
Printed Holden, S. (1998, May 16). Frank Sinatra dies at 82:
Newspaper Matchless stylist of pop. The New York Times, pp.
A1, A22-A23.
DETAILS ABOUT ASSIGNMENTS

Facilitator-Marked Assignment (FMA)
You will be able to complete the assignment from the information and materials
contained in your suggested readings and course content. However, it is
desirable in all graduate level education to demonstrate that you have read and
researched more widely than the required minimum. Using other references will
give you a broader perspective and may provide a deeper understanding of the
subject. When you have completed the assignment, submit it together with a
FMA form to your facilitator. Make sure that your assignment reaches the
facilitator on or before the deadline.

COURSE ASSIGNMENT GUIDE  xxv
General Criteria for Assessment of FMA

In general, your facilitator will expect you to write clearly, using correct spelling
(please use your spell checker) and grammar. Your facilitator will look for the
following:
 You have critically thought about issues raised in the course;
 You have considered and appreciated different points of views including

those in the course;
 You gave your own views and opinions;
 You stated your arguments clearly with supporting evidences and proper
referencing of sources; and
 You have drawn from your own experiences.

xxvi X COURSE ASSIGNMENT GUIDE

Topic  The Role of
Assessment in
1 Teaching and
Learning
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Differentiate between tests, measurement, evaluation and
assessment;
2. Explain the roles of assessment in teaching and learning;
3. Explain the general principles of assessment;
4. Differentiate between formative and summative assessments; and
5. Justify when norm-referenced and criterion-referenced tests are
adopted.
 INTRODUCTION
The topic discusses the difference between tests, measurement, evaluation and
assessment, the roles of assessment in teaching and learning, and some general
principles of assessment. Also explored is the difference between formative and
summative assessments as well as the difference between criterion and norm-
referenced tests. The topic concludes with a brief discussion on the current trends
in assessment.

2  TOPIC 1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING
Can you differentiate between tests, measurement, evaluation and assessment

(see Figure 1.1)?
Figure 1.1: Can you differentiate between tests, measurement, and evaluation and
assessment?
1.1 TESTS, MEASUREMENT, EVALUATION

AND ASSESSMENT
It is not surprising that many people are confused with the fundamental
differences between tests, measurement, evaluation and assessment as they are
used in education. The following explanation on these concepts will help clarify
the confusion.
(a) Tests
Most people are familiar with tests because all of us, at some point in our
lives, have taken some form of tests. In school, tests are given to measure
our academic aptitude and indirectly to evaluate whether we have gained
from the teaching by the teacher. At the workplace, tests are conducted
to select suitable persons for specific jobs, tests are used as the basis for
job promotions and tests are used to encourage re-learning. Physicians,

TOPIC 1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING  3
lawyers, insurance consultants, real-estate agents, engineers, civil servants

and many other professions are required to take tests to demonstrate their
competence in specific areas and in some cases to be licensed to practice
their profession or trade.
Throughout their professional careers, teachers, counsellors and school

administrators are required to conduct, score and interpret a wide variety
of tests. For example, school administrators rate the performance of
individual teachers and school counsellors record the performance of their
clients. It is possible that a teacher may construct, administer and grade
thousands of tests during his or her career!
Test items can be written in various formats including multiple-choice

questions, matching, true or false, short answer and essay. These formats
vary in strengths and weaknesses. No one format is ideal for all
circumstances.
According to the joint committee of the American Psychological

Association (APA), the American Educational Research Association
(AERA) and National Council on Measurement in Education (NCME),
A test may be thought of as a set of tasks or questions intended to elicit

particular types of behaviour when presented under standardised
conditions and to yield scores that will have desirable psychometric
properties (1974).
While most people know what a test is, many have difficulty differentiating
between measurement, evaluation and assessment. Some have even argued
that they are similar!
(b) Measurement
Measurement is the act of assigning numbers to a phenomenon. In
education, it is the process by which the attributes of a person are measured
and assigned numbers. Remember it is a process, indicating there are
certain steps involved!
As educators, we frequently measure human attributes such as attitudes,

academic achievements, aptitudes, interests, personality and so forth.
Hence, in order to measure we have to use certain instruments so that we
can conclude that Ahmad is better in mathematics than Kumar while
Tajang is more positive towards science than Kong Beng. We measure to
obtain information about „what is‰. Such information may or may not be

useful depending on the accuracy of the instruments we use and our skill at
using them. For example, we measure temperature using a thermometer,
and the thermometer is the instrument used.
How do you measure performance in mathematics, for example? We use a

mathematics test which is an instrument containing questions and
problems to be solved by students. The number of right responses obtained
is an indication of the performance of individual students in mathematics.
Note that we are only collecting information. We are not evaluating!
Evaluation is therefore quite different from measurement.
(c) Assessment and Evaluation

The literature has used the terms „assessment‰ and „evaluation‰ in
education as two different concepts. Evaluation is the process of making
judgments based on criteria and evidences. Assessment is the process of
documenting knowledge, skills, attitudes and beliefs, usually in measurable
terms. The goal of assessment is to make improvements as opposed to
simply being judged. In an educational context, assessment is the process of
describing, collecting, recording, scoring and interpreting information
about learning. The two terms are also used interchangeably.
For example, some authors used the term „formative evaluation‰ while
others use the term „formative assessment‰. We will use the two terms
interchangeably because there is too much overlap in the interpretations
of the two concepts. Generally, assessment is viewed as the process
of collecting information with the purpose of making decisions about
students.
We may collect information using various tests, observations of students

and interviews. Rowntree (1974) views assessment as,
A human encounter in which one person interacts with another directly

or indirectly with the purpose of obtaining and interpreting information
about the knowledge, understanding, abilities and attitudes possessed
by that person.
For example, based on assessment information we can determine whether

Chee Keong needs special services to assist him in developing his reading
skills or whether Khairul, who has been identified as a dyslexic child, needs
special attention. The key words in the definition of assessment is collecting
data and making decisions. Hence, in order to make decisions one has to
evaluate, which is the process of making judgement, about a given
situation.

When we evaluate, we are saying that something is good, appropriate,

valid or positive, for example. To make an evaluation, we need information
and the information is obtained by measuring using a reliable instrument.
For example, you measure the temperature in the classroom and it is 30
degrees Celsius, which is simply information. Some students may find the
temperature too hot for learning while others may say that it is ideal for
learning. At some point during the day, we are evaluating something or
someone!
Educators are constantly evaluating students. This is usually done in

comparison with some standards. For example, if the objective of the lesson
is for learners to apply BoyleÊs Law to solving a problem and 80 per cent of
learners are able to solve the problem, then the teacher may conclude that
his teaching of the principle was quite successful. Therefore, evaluation is
the comparison of what is measured against some defined criteria and to
determine whether it has been achieved, whether it is appropriate, whether
it is good, whether it is reasonable, whether it is valid and so forth.
1.2 WHY DO WE ASSESS?

Let us begin by asking the question, „Why do we, as educators, assess learners?‰
Some of us may find the question rather strange. The following may be a likely
response:
Question : Why do you assess?
Answer : Well, I assess to find out whether my students understand what

has been taught.
Question : What do you mean by „understand‰?
Answer : Whether they can remember what I taught them and able to solve
problems.
Question : What do you do with the test results?
Answer : Well, I provide students the right answers and point out the
mistakes made when answering the questions.
The above could be the reasons educators give when asked about the purpose of
assessment. In the context of education, assessment is performed to gain an
understanding of an individualÊs strengths and weaknesses in order to make
appropriate educational decisions. The best educational decisions are based on

information and better decisions are usually based on more information (Salvia &
Ysseldyke, 1995). Based on the reasons for assessment provided by Harlen (1978)
and Deale (1975), two main reasons are identified (see Figure 1.2).
Figure 1.2: Purpose of assessment
With regard to learning, assessment is aimed at providing information that will

help make decisions concerning remediation, enrichment, selection, exceptionality,
progress and certification. With regard to teaching, assessment provides
information regarding achievement of objectives, the effectiveness of teaching
methods and learning materials.
(a) Assess in Order to Help Learning

Let us look at the following aspects of assessment which are able to help
learning:
(i) Diagnosis
Diagnostic evaluation or assessment is performed at the beginning of
a lesson or unit for a particular subject area to assess studentsÊ
readiness and background for what is about to be taught. This pre-
instructional assessment is done when you need information on a
particular student, group of students or a whole class before you can
proceed with the most effective instructional method. For example,
you could administer a Reading Test to Year One students to assess
their reading level. Based on the information, you may want to assign
weak readers for special intervention or remedial action. On the other
hand, the test might reveal that some students are reading at an
exceptionally high level and you might want to recommend that they
be assigned to an enrichment programme (refer to Table 1.1).

Table 1.1: Purpose of Assessment
Types of Decisions
To Help
Questions to Be Answered
Learning:
Diagnosis for Should the student be sent for remedial classes so that
remedial action difficulty in learning can be overcome?
Diagnosis for Should the student be provided with enrichment
enrichment activities?
Exceptionality Does the student have special learning needs that
require special education assistance?
Selection Should the student be streamed into X or Y class?
Progress To what extent is the student making progress toward
specific instructional goals?
Communication How is the child doing in school and how can parents
to parents help?
Certification What are the strengths and weaknesses in the overall
performance of a student in specific areas assessed?
Administration How is the school performing in comparison to other
and counselling schools?
Why should students be referred to counselling?
(ii) Exceptionality
Assessment is also conducted to make decisions on exceptionality.
Based on information obtained from the assessment, teachers may
make decisions as to whether a particular student needs to be
assigned to a class with exceptional students. Exceptional students are
students who are physically, mentally, emotionally or behaviourally
different from the normal population. For example, based on the
assessment information, a child who is found to be dyslexic may be
assigned for special treatment or a student who has been diagnosed to
be learning disabled may be assigned for special education.
(iii) Certification
Certification is perhaps the most important reason for assessment. For
example, the Sijil Pelajaran Malaysia (SPM) is an examination aimed
at providing students with a certificate. The scores obtained are
converted into letter grades signifying performance level in various
subject areas and used as a basis for comparison between students.
The certificate obtained is further used for selecting students for
further studies, scholarships or jobs.
(iv) Placement
Besides certification, assessment is also conducted for the purpose of
placement. Students are endowed with varying abilities and one of
the tasks of the school is to place them according to their aptitude and
interests. For example, performance in the Pentaksiran Tingkatan 3
(PT3) (previously Penilaian Menengah Rendah) is used as the basis for
placing students in the arts or science stream. Assessment is also used
to stream students according to academic performance. It has been the
tradition that the „A‰ and „B‰ classes consist of high achievers based
on their results in the end of semester or end of year examinations.
Placement tests have even been used in preschools to stream children
according their literacy levels! The practice of placing students
according to academic achievement has been debated for decades
with some educationists arguing against it while others supporting its
merits.
ACTIVITY 1.1
„Streaming students according to academic abilities should be

discouraged in Malaysian schools‰. Discuss.
(v) Communicate to Parents

Parents want to know how their child is doing in school and
appreciate information, particularly specific examples, of their childÊs
progress. Showing examples of their childÊs work over time enables
parents to personally assess the growth and progress of their child. It
is essential to tell the whole story when reporting information about
performance progress. Talking to parents about standards, sharing
student work samples, using rubrics in conferences, and
differentiating between performance and progress are some ways to
ensure that they are given an accurate picture of their childÊs learning.
(vi) School Administration and Counselling

Schools collect assessment information in order to determine how the
school is performing in relation to other schools for a particular
semester or year. Assessment results are also used to compare
performance over the years for the same school. Based on the results,
school administrators may institute measures to remedy weaknesses
such as putting in more resources to help students who perform
poorly, for example, to increase the number of studentsÊ reading and
writing proficiency from poor to satisfactory level.

Assessment results (especially relating to socio-emotional

development) may be used by school administrators and counsellors
in planning intervention strategies for at-risk students. Assessment by
counsellors will enable them to identify students presenting certain
socio-emotional problems that require counselling services or referral
to specialists such psychiatrists, legal counsellors and law
enforcement authorities.
(b) Assess in Order to Improve Teaching

If 70 per cent of your students fail in a test, do you investigate whether your
teaching-learning strategy was appropriate or do you attribute it to your
students being academically weak in addition to not revising their studies?
Most educators would attribute the poor performance to the latter.
However, assessment information is valuable because it can indicate which
of the learning outcomes have been successfully achieved and which
instructional objectives students had most difficulty with. Assessment
results are valuable in providing clues to the effectiveness of the teaching
strategy implemented and teaching materials used. It also indicates
whether students had the required prior knowledge to grasp the concepts
and principles discussed (refer to Table 1.2).
Assessment data may also provide insights into why some teachers are
more successful in teaching a particular group of students while others are
less successful.
Table 1.2: Purpose of Assessment
To Improve Teaching: Questions to Be Answered

Objectives Were the desired learning outcomes achieved?
Teaching methods Were the teaching methods employed effective?
Prior knowledge Did students have relevant prior knowledge?
Teaching materials Were the teaching materials used effective?
Teacher differences Were particular teachers more effective than others?
ACTIVITY 1.2
To what extent have you used assessment data to review your teaching-
learning strategies? Discuss this with your course mates.

1.3 GENERAL PRINCIPLES OF ASSESSMENT

Let us discuss the general principles of assessment:
(a) What is to be Assessed has to be Clearly Specified

The specification of the characteristics to be measured should precede the
selection or development of assessment procedures. In assessing student
learning, the intended learning goals or outcomes should be clearly
specified. In other words, appropriate assessment procedures can only be
selected if there is clear specification of the intended learning outcomes to
be measured.
(b) An Assessment Procedure should be Selected Based on Its Relevance to the

Characteristics or Performance to be Measured
When selecting an assessment procedure to measure a specific learning
outcome, teachers should always ask if the procedure selected is the most
effective method for measuring the learning or development to be assessed.
There must be a close match between the intended learning outcomes and
the types of assessment tasks to be used. For example, if the teacher would
like to assess the studentsÊ ability to organise ideas, the use of multiple-
choice test would be a poor choice.
(c) Different Assessment Procedures are Required to Provide a Complete

Picture of Student Achievement and Development
No single assessment procedure can assess all the different learning
outcomes in a school curriculum. Different assessment procedures
(formative, summative and authentic assessments) can achieve different
objectives. For example, multiple-choice questions are useful for measuring
knowledge, understanding and application outcomes while essay tests are
appropriate for measuring the ability to organise and express ideas. Projects
that require conducting library research are needed to measure certain
skills in formulating and solving problems. Observational techniques are
needed to assess performance skills and various aspects of student
behaviour.
Using a wide variety of assessment tools allows a teacher to determine

which instructional strategies are effective and which needed to be
modified. In this way, assessment can be used to improve classroom
teaching, plan curriculum and research oneÊs own teaching practice.

(d) Assessment must be Aligned to Instruction

In order to encourage higher-order thinking and problem-solving skills,
there is an increased need to align the curriculum, instruction and
assessment. Classroom assessment techniques should be focusing on
aligning assessments more closely with the instructional strategies used in
the classroom. In this way, what is to be assessed in the classroom is
consistent with what has been taught and vice versa. For example, it would
not be fair to assess students on higher-order thinking skills when what is
taught is only lower level thinking skills.
1.4 TYPES OF ASSESSMENT

Before we proceed to discuss more about assessment, you need to be clear about
these often used concepts in assessment:
(a) Formative assessment (or evaluation) and summative assessment (or

evaluation);
(b) Criterion-referenced assessment and norm-referenced assessment; and
(c) Authentic assessment (Examples of authentic assessment will be discussed

in Topics 6 and 7).
Let us move on and read further.
1.4.1 Formative versus Summative Assessments

Teachers engage students in all aspects of their learning. There are many
strategies to accomplish this ranging from informal questioning, quiz and
observation to more formal monthly test and end of semester examination. In a
balanced assessment system, both summative and formative assessments are an
integral part of information gathering.
Assessment can be done at various times throughout the school year. A

comprehensive assessment plan will include both formative and summative
assessments. The point at which assessment occurs and the aim of assessing
distinguishes these two categories of assessment.

(a) Formative Assessment

Formative assessment is often done at the beginning or during the school
year, thus providing the opportunity for immediate evidence of student
learning in a particular subject area or at a particular point in the
programme. Classroom assessment is one of the most common formative
assessment techniques used. The purpose of this technique is to improve
the quality of student learning and should not be evaluative in nature or
involve grading students. In formative assessment, the teacher compares
the performance of a student to the performance of other students in the
class and not all students in the same year. Usually, a small section of the
content is tested to determine if the objectives have been met. Formative
assessment is action-oriented and forms the basis for improvement of
instructional methods (Scriven, 1996).
For example, if a teacher observes that some students still have not grasp
a concept, he may design a review activity or use a different instructional
strategy. Likewise, students can monitor their progress with periodic
quizzes and performance tasks. The results of formative assessments are
used to modify and validate instruction. In short, formative assessments are
ongoing and include reviews and observations of what is happening in the
classroom.
Formative assessments are generally low stakes, which means that they
have minimal or no point value.
(b) Summative Assessment
„When the cook tastes the soup, thatÊs formative evaluation; when the
guests taste the soup, thatÊs summative evaluation‰
ă Robert Stakes
Summative assessment is comprehensive in nature. It provides

accountability and is used to check the level of learning at the end of the
programme (which may be at the end of the semester, year or after two
years). For example, after five years in secondary school, students sit for the
Sijil Pelajaran Menengah (SPM) examination which is summative in nature
since it is based on the cumulative learning experiences of students.

Summative assessment is also used to evaluate the effectiveness of an

instructional programme at the end of an academic year or at a
predetermined time. The goal of summative assessment is to make a
judgment of student competency after an instructional phase is completed.
For example, national examinations are administered each year in
Malaysia. It is a summative assessment to determine each studentÊs
acquisition of several subject areas of between two to three years coverage
of content. Summative evaluations are used to determine if students have
mastered the specific competencies as set out by the programme and letter
grades are assigned to assess learner achievement.
Summative assessments are often high stakes, which means that they have
a high point value.
Table 1.3 highlights the differences between formative and summative

assessments.
Table 1.3: Differences between Formative and Summative Assessments
Criteria Formative Assessment Summative Assessment

Timing Conducted throughout the Conducted at the end of a teaching-
teaching-learning process. learning phase (for example, end of
semester or year).
Methods Paper and pencil tests, Paper and pencil tests, oral tests
observations, quizzes, exercises, administered to the group.
practical sessions administered to
the group and individually.
Aims  To assess progress and  Grading to determine if the
recommend remedial action for programme was successful.
non-achievement of objectives.  To certify students and improve
 Remediation or enrichment or the curriculum.
reteach the topic.
Examples Quizzes, essays, diagnostic tests, Final examination, national
lab reports and anecdotal records. examination and qualifying tests.

Can information from summative assessments be used formatively? Tests given

at the end of a unit of work provide information about how much each student
has learned or has not learned. If left as a summative process, then the studentsÊ
test marks merely become a record of their successes or failures at a point in time
and not directly helping them to improve. Moreover, in actual fact some students
will know more than what is indicated in their test performance.
However, when the information is used formatively, the tests results can provide
an important source of detailed, individualised feedback identifying where each
student needs to deepen their understanding and improve their recall of the
knowledge they have learned.
The more teachers know about individual students as they engage in the learning
process, the better teachers can adjust instruction to ensure that all students
continue to move forward in their learning.
Advantages of using information derived from summative assessments to

improve future student performance are:
(a) Summative data reveals how the students performed at the end of a
learning programme, namely advanced, proficient, basic or below basic.
For example, if a student has scored below basic in the semester exam and
exhibits signs of a struggling student, the teacher may want to place the
student at the front of the class so that the teacher can easily access the
student when the student needs extra support;
(b) Summative assessments can serve as a guide to improving teaching

methods. Teachers often employ varied teaching methodologies within the
classroom. Summative assessments can help teachers collaborate and
improve teaching methods from year to year; and
(c) Summative assessments help teachers and administrators in improving the

curriculum and curriculum planning. Standards-driven instruction plays a
large role in schools today. When summative assessments show consistent
gaps between student knowledge and learning targets, schools may turn to
the relevant authority in the Ministry of Education to suggest improving
curriculum planning or new curriculum to fill those learning gaps.
The data that is collected using a summative assessment can help teachers and
schools make decisions based on the instruction that has already been completed.
This contrasts with formative assessment, whereby formative assessment can
help teachers and students during the instruction process. It is important to
understand the difference between the two, as both assessments can play an
important role in education.

1.4.2 Norm-Referenced versus Criterion-Referenced

Tests
The main difference between norm-referenced tests and criterion-referenced tests
depends on the purpose or aim of assessing your students, the way in which
content is selected and the scoring process which defines how the test results
must be interpreted.
(a) Norm-referenced Tests

The major reason for norm-referenced tests is to classify students. These
tests are designed to highlight achievement differences between and among
students to produce dependable rank order of students across a continuum
of achievement from high-achievers to low achievers (Stiggins, 1994). With
norm-referenced tests, a representative group of students is given the test
and their scores form the norm after having gone through a complex
administration and analysis. Anyone taking the norm-referenced test can
compare his or her score against the norm.
For example, a student who obtained a score of 70 on a norm-referenced

test will not mean much until it is compared to the norm. When compared
to the norm, her score is in the 80th percentile which means that she
performed as well or better than 20 per cent of students in the norm group.
This type of information can be useful for deciding whether or not students
need remedial assistance or is a candidate for the gifted programme.
However, the score gives little information about what the student actually
knows or can do. A major criticism of norm-referenced tests is that they
tend to focus on assessing low level, basic skills (Romberg, 1989).
(b) Criterion-referenced Tests

Criterion-referenced tests determine what students can or cannot do and
not how they compare to others (Anastasi, 1988). Criterion-referenced tests
report how well students are doing relative to a predetermined
performance level on a specified set of educational goals or outcomes
included in the curriculum. Criterion-referenced tests are used when
teachers wish to know how well students have learned the knowledge and
skills which they are expected to have mastered. This information may be
used to determine how well the student is learning the desired curriculum
and how well the school is teaching that curriculum. Criterion-referenced
tests give detailed information about how well a student has performed on
each of the educational goals or outcomes included on that test. For
instance, a criterion-referenced test score might describe which arithmetic
operations a student can perform or the level of reading difficulty
experienced.

Table 1.4 illustrates the differences between norm-referenced and criterion-

referenced tests.
Table 1.4: Differences between Norm-Referenced and Criterion-Referenced Tests
Criteria Norm-referenced Tests Criterion-referenced Tests

Aims  Compare a studentÊs  Compare a studentÊs
performance with other performance against some
students. criteria.
 Select students for  Extent to which a student
certification. has acquired the knowledge
or skill.
 Improve teaching and
learning.
Types of Questions from simple to Questions of nearly similar
questions difficult. difficulty relating to the
criteria.
Reporting of Grades are assigned. No grades are assigned
results (whether skill or knowledge is
achieved).
Content Wide content coverage. Specific aspects of the content.
coverage
Examples UPSR, PT3, SPM national Class tests, exercises and
examinations, end of semester assignments.
examinations and end of year
examinations.
SELF-CHECK 1.1
1. Explain the differences between norm-referenced and criterion-

referenced tests.
2. Describe the main differences between formative and summative

assessments.

1.5 CURRENT TRENDS IN ASSESSMENT

In the last two decades, there have been major changes in assessment practices in
many parts of the world. Brown, Bull and Pendlebury (1997) identified the
following trends in educational assessment:
(a) Written examinations are gradually being replaced by more continuous

assessments and coursework;
(b) There is a move towards more student involvement and choice in

assessments;
(c) Group assessment is becoming more popular in an effort to emphasise

collaborative learning between students and to reduce excessive
competition;
(d) Subject areas and courses state more explicitly about the expectations in
assessment, more specifically the kinds of performance required from
students when they are assessed. This is unlike earlier practices where
assessment is so secretive and students had to figure out for themselves
what was required of them;
(e) An understanding of the process is now seen as, at least, equally important
to the knowledge of facts. This is in line with the general shift from
product-based assessment towards process-based assessment; and
(f) Student-focussed „learning outcomes‰ have begun to replace teacher-

oriented „objectives‰. The focus is more on what the student will learn
rather than what the teacher plans to teach.

Easing up on Exams
Putrajaya: Reducing the number of Among the measures proposed are:
examination subjects and having a
semester system are among the Ć Reducing the number of subjects
major changes being planned to in public examinations;
make the education system more Ć Emphasising skills and abilities
holistic and less focussed on rather than focusing on content
academic achievement. and achievement;
Education Minister, Datuk Seri Ć Encouraging personal
Hishamumuddin Tun Hussein said development through subjects
that these measures were in line like Art and Physical Education;
with the GovernmentÊs aim to and
reform the countryÊs education
system. „We do not intend to abolish Ć Improving teaching-learning
public or school-level examinations methods by encouraging more
totally, but we recognise that the project-based assignments.
present assessment system needs to
be looked at‰, he said. He said that emphasis should be on
individual accomplishments rather
than the schoolÊs performance in
public examinations and also
highlighting the individualÊs co-
curricular achievements.
[The Star, 21 March, 2006]
ACTIVITY 1.3
Refer to the report on „Easing up on Exams‰ and discuss to what extent

you agree with the measures proposed by the Ministry of Education to
reduce the exam-oriented education system in Malaysia.

 A test may be thought of as a set of tasks or questions intended to elicit

particular types of behaviour when presented under standardised conditions
and to yield scores that will have desirable psychometric properties.
 Measurement in education is the process by which the attributes of a person

are measured and assigned numbers.
 Assessment is viewed as the process of collecting information with the

purpose of making decisions regarding students.
 Assessment is aimed at helping the learner and to improve teaching.
 Summative assessment is comprehensive in nature, provides accountability

and is used to check the level of learning at the end of the programme.
 Formative assessment is often conducted at the beginning of or during the

school year, thus providing the opportunity for immediate evidence of
student learning in a particular subject area or at a particular point of time in
a programme.
 The information from summative assessment can be used formatively. The

students or schools can use the information to guide their efforts and
activities in subsequent courses.
 The major reason for norm-referenced tests is to classify students. These tests
are designed to highlight achievement differences between and among
students to produce dependable rank order of students.
 Criterion-referenced tests determine what students can or cannot do and not

how they compare to others.
Assessment Measurement
Criterion-referenced test Norm-referenced test
Evaluation Summative assessment
Formative assessment Test

Topic  Foundation for
Assessment:
2 What to Assess
LEARNING OUTCOMES
1. Justify the behaviours that are to be measured to present a holistic
assessment of students;
2. Describe the various types of cognitive learning outcomes to be
assessed;
3. Describe the various types of affective learning outcomes to be
assessed; and
4. Describe the various types of psychomotor learning outcomes to be
assessed.
 INTRODUCTION
If you were to ask a teacher, what should be assessed in the classroom, the
immediate response would be, of course, the facts and concepts taught. They are
the facts and concepts found in science, history, geography, language, arts,
religious education and other similar subjects. However, the Malaysian
Philosophy of Education states that education should aim towards the holistic
development of the individual. Hence, it is only logical that the assessment system
should also seek to assess more than the acquisition of the facts and concepts of a
subject area. What about assessment of physical and motor abilities? What about
socio-emotional behaviours such as attitudes, interests, personality and so forth?
Do they not contribute to the holistic person?
In this topic, you will learn the types of learning outcomes that need to be assessed
in a curriculum. The topic will conclude with a brief explanation on how to plan a
table of specification for a classroom test.
TOPIC 2 FOUNDATION FOR ASSESSMENT: WHAT TO ASSESS  21
2.1 IDENTIFYING WHAT TO ASSESS

When educators are asked what should be assessed in the classroom, the majority
would refer to evaluating the acquisition of the facts, concepts, principles,
procedures and methods of a subject area. You might find a minority of educators
who insist that skills acquired by learners should also be assessed especially in
subjects such as physical education, art, drama, music, technical drawing,
carpentry, automobile engineering and so forth. Even fewer educators would
propose that the socio-emotional behaviour of learners should also be assessed.
National Philosophy of Malaysian Education

Education in Malaysia is an ongoing effort towards further developing the
potentials of individuals in a holistic and integrated manner so as to produce
individuals who are intellectually, spiritually, emotionally and physically
balanced and harmonic, based on a firm belief in and devotion to God. Such an
effort is designed to produce Malaysian citizens who are knowledgeable and
competent, who possess high moral standards and who are responsible and
capable of achieving a high level of personal well-being as well as being able to
contribute to the harmony and betterment of the family, society and the nation
at large.
[Curriculum Development Centre, Ministry of Education Malaysia, 1988]
The National Philosophy of Malaysian Education has important implications for

assessment. Theoretically, a comprehensive assessment system should seek to
provide information on the extent to which the National Philosophy of Education
has achieved its goal. In other words, the assessment system should seek to
determine:
(a) Whether our schools have developed „the potentials of individuals in a
holistic and integrated manner‰;
(b) Whether our students are „intellectually, spiritually, emotionally and
physically balanced‰;
(c) Whether our students are „knowledgeable and competent‰ and „possess
high moral standards‰;
(d) Whether our students have a „high level of personal well-being‰; and
(e) Whether students are equipped with the abilities and attitudes that will
enable them „to contribute to the harmony and betterment of the family,
society and the nation at large‰.
22  TOPIC 2 FOUNDATION FOR ASSESSMENT: WHAT TO ASSESS
On the contrary, in actual practice the assessment tends to overemphasise on

intellectual competence which translates into the measurement of cognitive
learning outcomes of specific subject areas. The other aspects of the holistic
individual are given minimal attention because of various reasons. For example,
how does a teacher assess the spiritual or emotional growth and development?
These are constructs that are difficult to evaluate and extremely subjective. Hence,
it is no surprise that assessment of cognitive outcomes has remained the focus of
most assessment systems all over the world because it is relatively easier to
observe and measure. However, in this topic we will make an attempt to present
a more „holistic‰ assessment of learning, focusing on three main types of human
behaviour. These are behaviours that psychometricians and psychologists have
attempted to assess and are closely aligned to realising the goals of the National
Philosophy of Malaysian Education.
2.2 THREE TYPES OF LEARNING OUTCOMES

Few people will dispute that the purpose of schooling is the development of the
holistic person. In the late 1950s and early 1960s, a group of psychologists and
psychometricians proposed that schools should seek to assess three domains of
learning outcomes (see Figure 2.1):
(a) Cognitive learning outcomes (knowledge or mental skills);
(b) Affective learning outcomes (feelings or emotions); and
(c) Psychomotor learning outcomes (manual or physical skills).
Figure 2.1: Holistic assessment of learners

Domains can be thought of as categories. Educators often refer to these three

domains as KSA (Knowledge, Skills and Attitude). Each domain consists of
subdivisions, starting from the simplest behaviour to the most complex, thus
forming taxonomy of learning outcomes. Each of the taxonomy of learning
behaviour can be thought of as „the goals of the schooling process.‰ That means
that after schooling, the learner should have acquired new skills, knowledge,
and/or attitudes. However, the levels of each division outlined are not absolutes.
While there are other systems or hierarchies that have been devised in the
educational world, these three taxonomies are easily understood and are probably
the most widely used today.
To assess the three domains, one has to identify and isolate the behaviour that
represents these domains. When we assess we evaluate some aspects of the
learnerÊs behaviour, for example, his ability to compare, explain, analyse, solve,
draw, pronounce, feel, reflect and so forth. The term „behaviour‰ is used broadly
to include the learnerÊs ability to think (cognitive), feel (affective) and perform a
skill (psychomotor). For example, you have just taught about „The Rainforest of
Malaysia‰ and you would like to assess your students in their:
(a) Thinking ă You might ask them to list the characteristics of the Malaysian
rainforest and compare it with the coniferous forest of Canada;
(b) Feelings (emotions, attitudes) ă You could ask them to design an exhibition
on how students could contribute towards conserving the rainforest; and
(c) Skill ă You could ask them to prepare satellite maps about the changing
Malaysian rainforest by accessing websites from the Internet.
ACTIVITY 2.1
Refer to Figure 2.1. To what extent are affective and psychomotor

behaviours assessed in your institution? Explain and discuss with your
coursemates.

2.3 ASSESSING COGNITIVE LEARNING

OUTCOMES
When we evaluate or assess a human being, we are assessing or evaluating the
behaviour of a person. This might be a bit confusing to some people. Are we
not assessing a personÊs understanding of the facts, concepts and principles of a
subject area? Every subject, whether it is history, science, geography, economics,
or mathematics, has its unique repertoire of facts, concepts, principles,
generalisations, theories, laws, procedures and methods that are transmitted to
learners (illustrated in Figure 2.2).
Figure 2.2: Contents of a subject assessed
When we assess we do not assess the learnerÊs store of the facts, concepts or
principles of a subject but rather what the learner is able to do with the facts,
concepts or principles of a subject area. For example, we evaluate the learnerÊs
ability to compare facts, explain the concept, analyse a generalisation (or
statement) or solve a problem based on a given principle. In other words, we assess
the understanding or mastery of a body of knowledge based upon what the learner
is able to do with the contents of the subject. Let us look at two mechanisms used
to measure or assess cognitive learning, namely BloomÊs Taxonomy and The
Helpful Hundred.
(a) BloomÊs Taxonomy

In 1956, Benjamin Bloom led a group of educational psychologists to develop
a classification of levels of intellectual behaviour which are important to
learning. They found that over 95 per cent of the test questions which
learners encountered required them to think only at the lowest possible level,
that is the recall of information. Bloom and his colleagues developed a widely
accepted taxonomy (method of classification on differing levels) for cognitive
objectives. This is referred to as BloomÊs Taxonomy (see Figure 2.3). There
are six levels in BloomÊs classification with the lowest level termed
knowledge. The knowledge level is followed by five increasingly difficult
levels of mental abilities: comprehension, application, analysis, synthesis and
evaluation.
Figure 2.3: BloomÊs taxonomy of cognitive learning outcomes
Now, read further to find out what each level constitutes.
(i) Knowledge: The behaviours at the knowledge level require learners to

recall specific information. The knowledge level is the lowest cognitive
level. Examples of verbs describing behaviours at the knowledge level
include the ability to list, define, name, state, recall, match, identify, tell,
label, underline, locate, recognise, select and so forth. For example,
learnersÊ ability to recite the factors leading to the World War II, quote
formula for density and force, tell laboratory safety rules.
(ii) Comprehension: The behaviours at the comprehension level which is a

higher level of mental ability than the knowledge level require the
understanding of the meaning of concepts and principles, translation
of words and phrases into oneÊs own words, interpolation which
involves filling in missing information, interpretation which involves
inferring and going beyond the given information. Examples of verbs
describing behaviours at the comprehension level are explain,
distinguish, infer, interpret, convert, generalise, defend, estimate,
extend, paraphrase, retell using own words, predict, rewrite,
summarise, translate and so forth. For example, learners are able to

rewrite NewtonÊs three laws of motion, explain in oneÊs own words the
steps for performing a complex task and translate an equation into a
computer spreadsheet.
(iii) Application: The behaviours at the application level require learners to

apply a rule or principle learned in the classroom into novel or new
situations in the workplace or unprompted use of an abstraction.
Examples of verbs describing behaviours at the application level are
apply, change, compute, demonstrate, discover, manipulate, modify,
give an example, operate, predict, prepare, produce, relate, show,
solve, use and so forth. For example, learners are able to use the formula
for projectile motion to calculate the maximum distance a long jumper
jumps and apply laws of statistics to evaluate the reliability of a written
test.
(iv) Analysis: The behaviours at the analysis level require learners to

identify component parts and describe their relationship, separate
material or concepts into component parts so that its organisational
structure may be understood and distinguish between facts and
inferences. Examples of verbs describing behaviours at the analysis
level are analyse, break down, compare, contrast, diagram,
deconstruct, examine, dissect, differentiate, discriminate, distinguish,
identify, illustrate, infer, outline, relate, select, separate and so forth.
For example, learners are able to troubleshoot a piece of equipment by
using logical deduction, recognise logical fallacies in reasoning, analyse
information from a company and determine needs for training.
(v) Synthesis: The behaviours at the synthesis level require learners to

build a structure or pattern from diverse elements and put parts
together to form a whole with emphasis on creating a new meaning or
structure. Examples of verbs describing behaviours at the synthesis
level are categorise, combine, compile, compose, create, devise, design,
explain, generate, modify, organise, plan, rearrange, reconstruct, relate,
reorganise, find an unusual way, formulate, revise, rewrite,
summarise, tell, write and so forth. For example, learners are able to
write a creative short story, design a method to perform a specific task,
integrate ideas from several sources to solve a problem, devise a new
plan of action to improve the outcome.

(vi) Evaluation: The behaviours at the evaluation level require learners to

make judgment about materials and methods, and the value of ideas or
materials. Examples of verbs describing behaviours at the evaluation
level are appraise, compare, conclude, contrast, criticise, critique,
defend, describe, rank, give your own opinion, discriminate, evaluate,
explain, interpret, value, justify, relate, summarise, support and so
forth. For example, learners are able to evaluate and decide on the most
effective solution to a problem, justify the choice of a new procedure or
course of action.
(b) The Helpful Hundred

Heinich, Molenda, Russell and Smaldino (2001) suggested 100 verbs that
highlight performance or behaviours that are observable and measurable.
This is not to say that these 100 verbs are the only ones but they definitely are
a great reference for educators. Table 2.1 displays the verbs that would be
appropriate to use when you are writing instructional objectives in each level
of BloomÊs Taxonomy.
Table 2.1: The Helpful Hundred
add compute drill label predict state

alphabetise conduct estimate locate prepare subtract
analyse construct evaluate make present suggest
apply contrast explain manipulate produce swing
arrange convert extrapolate match pronounce tabulate
assemble correct fit measure read throw
attend cut generate modify reconstruct time
bisect deduce graph multiply reduce translate
build defend grasp name remove type
cave define grind operate revise underline
categorise demonstrate hit order select verbalise
choose derive hold organise sketch verify
classify describe identify outline ski weave
colour design illustrate pack solve weigh
compare designate indicate paint sort write
complete diagram install plot specify
compose distinguish kick position square

In 2001, Krathwohl and Anderson modified the original BloomÊs Taxonomy (1956).
They identified and isolated the following list of behaviours that an assessment
system should address (see Table 2.2).
Table 2.2: Revised Version of BloomÊs Taxonomy
Category and Cognitive Process Alternative Names

1. Remembering
Ć Recognising Ć Identifying
Ć Recalling Ć Retrieving
2. Understanding
Ć Interpreting Ć Clarifying, paraphrasing, representing,
Ć Exemplifying translating
Ć Classifying Ć Illustrating, instantiating
Ć Summarising Ć Categorising, subsuming
Ć Inferring Ć Abstracting, generalising
Ć Comparing Ć Concluding, extrapolating, interpolating,
predicting
Ć Explaining
Ć Contrasting, mapping, matching
Ć Constructing models
3. Applying
Ć Executing Ć Carrying out
Ć Implementing Ć Using
4. Analysing
Ć Differentiating Ć Discriminating, distinguishing, focusing,
Ć Organising selecting
Ć Attributing Ć Finding coherence, integrating, outlining,
structuring
Ć Deconstructing
5. Evaluating
Ć Checking Ć Coordinating, detecting, monitoring, testing
Ć Criticising Ć Judging
6. Creating
Ć Generating Ć Hypothesising
Ć Planning Ć Designing
Ć Producing Ć Constructing
Source: Krathwohl and Anderson (2001)

Note that the sequencing of some of the levels has been rearranged and renamed.
The first two original levels of „knowledge‰ and „comprehension‰ were replaced
with „remembering‰ and „understanding‰ respectively. The „synthesis‰ level was
renamed with the term „creating‰. Note that in the original taxonomy the sequence
was „synthesis‰ followed by „evaluate‰. In the modified taxonomy, the sequence
was rearranged to „evaluating‰ followed by „creating‰.
As you can see, the primary differences between the original and the revised
taxonomy are not in the listings or rewordings from nouns to verbs, or in the
renaming of some of the components, or even in the re-positioning of the last two
categories. The major differences lie in the more useful and comprehensive
additions of how the taxonomy intersects and acts upon different types and levels
of knowledge ă factual, conceptual, procedural and metacognitive.
(a) Factual knowledge refers to essential facts, terminologies, details or elements

that learners must know or be familiar with in order to understand a
discipline or solve a problem with it;
(b) Conceptual knowledge is knowledge of classifications, principles,

generalisations, theories, models or structures pertinent to a particular
disciplinary area;
(c) Procedural knowledge refers to information or knowledge that helps

learners do something specific to a discipline, subject or area of study. It also
refers to the methods of inquiry, very specific or finite skills, algorithms,
techniques and particular methodologies; and
(d) Metacognition is simply thinking about oneÊs thinking. More precisely, it

refers to the processes used to plan, monitor and assess oneÊs understanding
and performance. Activities such as planning how to approach a given
learning task, monitoring comprehension and evaluating progress toward
the completion of a task are metacognitive in nature.
BloomÊs aim was to promote higher forms of thinking in education such as

analysing, creating and evaluating rather than just teaching learners to remember
facts (rote learning). Higher-order thinking (HOT) takes thinking to higher levels
other than restating the facts and requires learners to do something with the facts
such as understand them, infer from them, connect them to other facts and
concepts, categorise them, manipulate them, put them together in new or novel
ways, and apply them as we seek new solutions to new problems.

SELF-CHECK 2.1
1. Explain the differences between analysis and synthesis according

to BloomÊs Taxonomy.
2. How is the revised version of BloomÊs Taxonomy different

from the original version?
3. Do you agree that BloomÊs Taxonomy is a hierarchy of cognitive

abilities?
4. You need to be able to „analyse‰ before being able to „evaluate‰.

Comment.
5. Using The Helpful Hundred, list the learning outcomes appropriate

for the subject area you are teaching.
2.4 ASSESSING AFFECTIVE OUTCOMES

Affective characteristics involve the feelings or emotions of a person. Attitudes,
values, self-esteem, locus of control, self-efficacy, interests, aspirations and anxiety
are all examples of affective characteristics. Unfortunately, affective outcomes
have not been a central part of our education system even though they are
arguably as important as, or even more important than, any cognitive or
psychomotor domain of learning outcomes targeted by schools. Some possible
reasons for the lack of emphasis on affective outcomes include:
(a) The belief that the development of appropriate feelings is the task of the
family and religion;
(b) The belief that appropriate feelings develop automatically from knowledge
and experience with content and do not require any special pedagogical
attention; and
(c) Attitudinal and value-oriented instructions are difficult to develop and

assess because:
(i) Affective goals are intangible;
(ii) Affective outcomes cannot be attained in the typical periods of

instruction offered in schools;

(iii) Affective characteristics are considered to be private rather than public

matters; and
(iv) There are no sound methods to gather information about affective

characteristics.
However, affective goals are no more intangible than cognitive ones. Some have
claimed that affective behaviours can be developed automatically when specific
knowledge are taught while others argue that affective behaviours have to be
explicitly developed in schools (see Figure 2.4). Affective goals do not necessarily
take longer to achieve in the classroom than cognitive goals. All that is required is
to state a goal more concretely and behaviourally-oriented so that it can be
assessed and monitored.
Figure 2.4: Affective behaviours can be developed in the classroom
There is also the belief that affective characteristics are private and should not be
made public. While people value their privacy, the public also has the right to
information. If the information gathered is needed to make a decision, then the
gathering of such information is not generally considered an invasion of privacy.
For example, if an assessment is used to determine whether a learner needs further
attention such as special education, then gathering such information is not an
invasion of privacy. On the other hand, if the information being sought-after is not
relevant to the stated purpose, then gathering of such information is likely to be
an invasion of privacy.

Similarly, information about affective characteristics can be used for good or bad.
For example, if a mathematics teacher discovers a learner has a negative attitude
towards mathematics and ridicules that learner in front of the class, then the
information has been misused. But if the teacher uses the information to change
his instructional methods so as to help the learner develop a more positive attitude
towards mathematics, then the information has been used wisely. Krathwohl,
Bloom and Bertram and their colleagues developed the affective domain in 1973
which deals with things emotionally such as feelings, values, appreciation,
enthusiasm, motivation and attitudes. The five major categories which listed the
simplest behaviour to the most complex behaviour are receiving, responding,
valuing, organisation and characterisation (see Figure 2.5).
Figure 2.5: Krathwohl, Bloom & BertramÊs taxonomy of affective learning outcomes
Let us read on to find about more about the taxonomy.
(a) Receiving
The behaviours at the receiving level require the learner to be aware of,
willing to hear and focus his or her attention. Examples of verbs describing
behaviours at the receiving level are ask, choose, describe, follow, give, hold,
locate, name, point to, reply and so forth.
For example, the learner:
(i) Listens to others with respect; and
(ii) Listens for and remembers the names of other learners.
(b) Responding
The behaviours at the responding level require the learner to be an active
participant, attend to and react to a particular phenomenon, be willing to
respond and gain satisfaction in responding (motivation). Examples of verbs
describing behaviours at the responding level are answer, assist, aid, comply
with, conform, discuss, greet, help, label, perform, practise, present, read,
recite, report, select, tell, and write.

(i) Participates in class discussion;
(ii) Gives a presentation; and
(iii) Questions new ideals, concepts and models in order to fully

understand them.
(c) Valuing
This level relates to the worth or value a person attaches to a particular object,
phenomenon or behaviour. It ranges from simple acceptance to the more
complex state of commitment. Valuing is based on the internalisation of a set
of specified values while clues to these values are expressed in the learner as
overt behaviours and are often identifiable. Examples of verbs describing
behaviours at the valuing level are demonstrate, differentiate, explain,
follow, form, initiate, invite, join, justify, propose, read, report, select, share,
study and work.
(i) Demonstrates belief in the democratic process;
(ii) Is sensitive towards individual and cultural differences (value

diversity);
(iii) Shows the ability to solve problems;
(iv) Proposes a plan to social improvement; and
(v) Follows through with commitment.
(d) Organisation
At this level, a person organises values into priorities by contrasting different
values, resolving conflicts between them and creating a unique value system.
The emphasis is on comparing, relating and synthesising values. Examples
of verbs describing behaviours at the level of organisation are adhere to,
alter, arrange, combine, compare, complete, defend, explain, formulate,
generalise, identify, integrate, modify, order, organise, prepare, relate and
synthesise.
(i) Recognises the need for balance between freedom and responsible
behaviour;
(ii) Accepts responsibility for his behaviour;

(iii) Explains the role of systematic planning in solving problems;
(iv) Accepts professional ethical standards;
(v) Creates a life plan in harmony with abilities, interests and beliefs; and
(vi) Prioritises time effectively to meet the needs of the organisation, family
and self.
(e) Characterisation
At this level, a personÊs value system controls his behaviour. The behaviour
is pervasive, consistent, predictable and most importantly, characterises the
learner. Examples of verbs describing behaviours at this level are act,
discriminate, display, influence, listen, modify, perform, practise, propose,
qualify, question, revise, serve, solve and verify.
(i) Shows self-reliance when working independently;
(ii) Cooperates in group activities (displays teamwork);
(iii) Uses an objective approach in problem-solving;
(iv) Displays a professional commitment to ethical practice on a daily basis;
(v) Revises judgment and changes behaviour in light of new evidence; and
(vi) Values people for what they are, not how they look.
Table 2.3 shows how the affective taxonomy may be applied to a value such as
honesty. It traces the development of an affective attribute such as honesty from
the „receiving‰ level up to the „characterisation‰ level where the value becomes a
part of the individualÊs character.
Table 2.3: An Affective Taxonomy for Honesty
Individual Character Explanation

Receiving (Attending) Aware that certain things are honest or dishonest
Responding Saying that honesty is better and behaving
accordingly
Valuing Consistently (but not always) telling the truth
Organisation Being honest in various situations
Characterisation by a value or Honest in most situations, expects others to be honest
value complex and interacts with others with all honesty

SELF-CHECK 2.2
1. Explain the differences between characterisation and valuing

according to the Affective Taxonomy of learning outcomes.
2. „A student is operating at the responding level.‰
What does the statement mean? Explain.
ACTIVITY 2.2
1. The Role of Affect in Education
„Some say schools should only be concerned with content.‰
„It is impossible to teach content without teaching affect as well.‰
„To what extent, if at all, should we be concerned with the
assessment of affective outcomes?‰
Discuss the three statements in the context of the Malaysian
education system. Share your answer with your course mates.
2. Select any TWO values from the list of 16 universal values and
design an affective taxonomy for each value as shown in Table 2.3.
Share your answer with your course mates.
2.5 ASSESSING PSYCHOMOTOR LEARNING

OUTCOMES
The psychomotor domain includes physical movement, coordination and use of
the motor-skill areas. Development of these skills requires practice and is
measured in terms of speed, precision, distance, procedures or techniques in
execution. The seven major categories listed from the simplest behaviour to the
most complex are shown in Figure 2.6.
Figure 2.6: The taxonomy of psychomotor learning outcomes

Let us read further to find out more about psychomotor learning.
(a) Perception
Perception is the ability to use sensory cues to guide motor activity. This
ranges from sensory stimulation through cue selection to translation.
Examples of verbs describing these types of behaviours are choose, describe,
detect, differentiate, distinguish, identify, isolate, relate and select.
(i) Detects non-verbal communication cues from the coach;
(ii) Estimates where a ball will land after it is thrown and then moving to
the correct location to catch the ball;
(iii) Adjusts heat of the stove to the correct temperature through smell and
taste of food; and
(iv) Adjusts the height of the ladder in relation to the point on the wall.
(b) Set
It includes mental, physical and emotional sets. These three sets are
dispositions that predetermine a personÊs response to different situations
(sometimes called mindset). Examples of verbs describing „set‰ are begin,
display, explain, move, proceed, react, show, state and volunteer.
(i) Knows and acts upon a sequence of steps in a manufacturing process;
(ii) Recognises his abilities and limitations; and
(iii) Shows desire to learn a new process (motivation).
 Note: This subdivision of the psychomotor domain is closely

related with the „responding‰ subdivision of the affective domain.
(c) Guided Response

Guided response refers to the early stages of learning a complex skill which
includes imitation and trial and error. Adequacy of performance is achieved
by practicing. Examples of verbs describing „guided response‰ are copy,
trace, follow, react, reproduce and respond.

(i) Performs a mathematical equation;
(ii) Follows instructions when building a model of a kampung house; and
(iii) Responds to hand signals of the coach while learning gymnastics.
(d) Mechanism
This is the intermediate stage in learning a complex skill. Learned responses
have become habitual and the movements can be performed with some
confidence and proficiency. Examples of verbs describing „mechanism‰
include assemble, calibrate, construct, dismantle, display, fasten, fix, grind,
heat, manipulate, measure, mend, mix and organise.
(i) Uses a computer;
(ii) Repairs a leaking tap;
(iii) Fixes a three-pin electrical plug; and
(iv) Rides a motorbike.
(e) Complex Overt Response

Complex overt response involves the skilful performance of motor acts that
involve complex movement patterns. Proficiency is indicated by quick,
accurate and highly coordinated performance requiring minimum energy.
This category includes performing without hesitation and automatic
performance. For example, tennis players such as Maria Sharapova and
Serena Williams often utter sounds of satisfaction or expletives as soon as
they hit a tennis ball. Similarly for golf players when they immediately
realised that they have hit a bad shot. This is because they can tell by the feel
of the act and the result that will follow. Examples of verbs describing
„complex overt responses‰ are assemble, build, calibrate, construct,
dismantle, display, fasten, fix, grind, heat, manipulate, measure, mend, mix,
organise and sketch.
(i) Manoeuvres a car into a tight parallel parking spot;
(ii) Operates a computer quickly and accurately; and
(iii) Displays competence while playing the piano.

 Note that many of the verbs are the same as „mechanism‰, but will
have adverbs or adjectives that indicate that the performance is
quicker, better and more accurate.
(f) Adaptation
Skills are well-developed and the individual can modify movement patterns
to fit special requirements. Examples of verbs describing „adaptation‰ are
adapt, alter, change, rearrange, reorganise, revise and vary.
(i) Responds effectively to unexpected experiences;
(ii) Modifies instructions to meet the needs of learners;
(iii) Performs a task with a machine that it was originally not designed to
do (assuming that the machine is not damaged and there is no danger
in performing the new task).
(g) Origination
Origination is about creating new movements or patterns to fit a particular
situation or specific problem. Learning outcomes emphasise creativity based
upon highly developed skills. Examples of verbs describing „origination‰ are
arrange, build, combine, compose, construct, create, design, initiate, make
and originate.
(i) Constructs a new theory;
(ii) Develops a new technique for goalkeeping; and
(iii) Creates a new gymnastic routine.
SELF-CHECK 2.3
1. Explain the differences between adaptation and guided response
according the Psychomotor Taxonomy of Learning Outcomes.
2. „A student is operating at the origination level.‰
What does the statement mean? Explain.

As a guide, Table 2.5 shows the allotment of time for each type of question.
Table 2.5: Allotment of Time for Each Type of Question
Task Approximate Time Per Item

True-False 20ă30 seconds
Multiple-choice (factual) 40ă60 seconds
Multiple-choice (complex) 70ă90 seconds
Matching (5 stems or 6 choices) 2ă4 minutes
Short answers 2ă4 minutes
Multiple-choice (with calculations) 2ă5 minutes
Word problems (simple math) 5ă10 minutes
Short essays 15ă20 minutes
Data analysis or graphing 15ă25 minutes
Extended essays 35ă50 minutes
Once your questions are developed, make sure that you include clear instructions
for the learners. For the objective items, specify that they should select one answer
for each item and indicate the point value of each question, especially if you are
allocating different weightage to different sections of the test. For essay items,
indicate the point value and suggested time to be spent on the item. We will
discuss different types of questions in more detail in Topics 3 and 4. If you are
teaching a large class with close seating arrangements and are giving an objective
test, you may want to consider administering several versions of your test to
minimise the opportunities for cheating. This is done by creating versions of your
test with different numberings of the items.
 Assessment of cognitive outcomes has remained the focus of most assessment

systems all over the world because it is relatively easier to observe and
measure.
 Each domain of learning consists of subdivisions, starting from the simplest

behaviour to the most complex thus forming taxonomy of learning outcomes.
 When we evaluate or assess a human being, we are assessing or evaluating the

behaviour of a person.

 Every subject area has is unique repertoire of facts, concepts, principles,

generalisations, theories, laws, procedures and methods that are transmitted
to learners.
 There are six levels in BloomÊs taxonomy of cognitive learning outcomes with
the lowest level termed knowledge followed by five increasingly difficult
levels of mental abilities, which are comprehension, application, analysis,
synthesis and evaluation.
 The six levels in the revised version of BloomÊs taxonomy are remembering,
understanding, applying, analysing, evaluating and creating.
 Affective characteristics involve the feelings or emotions of a person. Attitudes,

values, self-esteem, locus of control, self-efficacy, interests, aspirations and
anxieties are examples of affective characteristics.
 The five major categories of the affective domain from the simplest behaviour
to the most complex behaviour are receiving, responding, valuing,
organisation and characterisation.
 The psychomotor domain includes physical movement, coordination, and use

of the motor-skill areas.
 The seven major categories of the psychomotor domain from the simplest
behaviour to the most complex are perception, set, guided response,
mechanism, complex overt response, adaptation and origination.
Affective learning outcomes Holistic assessment

BloomÊs taxonomy Psychomotor learning outcome
Cognitive learning outcomes The Helpful Hundred

Topic  Planning the
Classroom Test
3
LEARNING OUTCOMES
1. Plan a classroom test by deciding on its purpose;
2. Specify the learning objectives of a classroom test;
3. Develop test specifications of a classroom test;
4. Select best item types of a classroom test;
5. Decide on the rubrics or marking schemes of a classroom test;
6. Prepare the test items for a classroom test; and
7. Describe the considerations in preparing relevant test items.
 INTRODUCTION
In this topic we will focus on methods of planning classroom tests. Testing is part
of the teaching and learning process. The importance of planning and writing a
reliable, valid and fair test cannot be underestimated. Designing tests is an
important part of assessing learnersÊ understanding of course content and their
level of competency in applying what they have learned. Whether you use low-
stake quizzes or high-stake mid and final semester examinations, careful design of
the tests will help provide more calibrated results. Assessments should reveal how
well learners have learned based on what the teachers want them to learn while
the instructions facilitates their learning. Thus, solely conducting a summative
assessment at the end of a teaching programme is not sufficient. It is helpful to
think about assessing learners at every stage of the planning process. Identifying
ways in which to assess their learners help determine the most suitable learning
activities.

42  TOPIC 3 PLANNING THE CLASSROOM TEST
In this topic we will discuss the general guidelines applicable to most assessment
tools. Topics 4 and 5 will discuss in detail the objective and essay tests. Authentic
assessment tools such as projects and portfolios will be discussed in the respective
topics.
3.1 PURPOSE OF CLASSROOM TESTING

Tests can be referred to as:
(a) Traditional paper and pencil or computer-based tests in the form of multiple-
choice, short answer or essay tests; and
(b) Performance assessments such as projects, interviews, presentations or other

alternative assessments.
Tests provide teachers with an objective feedback as to how much learners have
learned and understand the subject taught. Commercially published achievement
tests can provide, to some extent, evaluation of the knowledge levels of individual
learners but only limited instructional guidance in assessing a wide range of skills
taught in any given classroom.
Teachers know their learners. Tests developed by the individual teachers for use
in their own class are the most instructionally relevant. Teachers can tailor tests to
emphasise the information they consider important and to match the ability levels
of their learners. If carefully constructed, classroom tests can provide teachers with
accurate and useful information about the knowledge retained by their learners.
The key to this process is the test questions that are used to elicit evidence of
learning. Test questions and tasks are not just a planning tool; they also form an
essential part of the teaching sequence. Incorporating the tasks into teaching and
using the evidence of the learnersÊ learning to determine what happens next in the
lesson is truly an embedded formative assessment.
„Sharing high quality questions may be the most significant thing we can do to
improve the quality of student learning,‰ (William, D., 2011).

TOPIC 3 PLANNING THE CLASSROOM TEST  43
3.2 PLANNING THE CLASSROOM TEST

A well-constructed test must have high quality items. A well-constructed test is an
instrument that is able to provide accurate measure of the test takerÊs ability within
a particular domain. It is worth spending time to write high quality items for tests.
In order to produce high quality questions, the test construction has to be properly
planned. The following are the six steps of test planning:
(a) Deciding its purpose;
(b) Specifying the learning objectives;
(c) Developing test specifications;
(d) Selecting best item types;
(e) Deciding its rubrics or marking scheme; and
(f) Preparing test items.
We will look at these in detail in the following subtopics.
3.2.1 Deciding its Purpose

The first step in test planning is to decide on the purpose of the test. Tests can be
used for many different purposes. If the test is to be used formatively, the test can
indicate precisely what the learner needs to study and to what level. The purpose
of formative tests is to assess progress and to direct the learning process. These
tests will have limited sample of learning outcomes. Teachers must prepare a
sufficient mix and difficulty levels of items. These items are used to make
corrective prescriptions such as a review for the whole group and practice
exercises for some learners. If the test is to be used in a summative manner, the test
will have broad coverage of objectives.
Another important purpose of testing is selection. The test is used to make

decisions about whether or not candidates will be admitted, pass or fail. Tests that
lead to an assessment and provide evidence that a given level of knowledge or
skill has been mastered (for example by awarding course credits) are referred to as
summative tests.

Tests can also serve a diagnostic purpose. In such cases, the test is used to provide
learners with insights into gaps in their current knowledge and skill sets.
Alternatively, tests can also be used to motivate learners to exhibit effective
studying behaviour.
3.2.2 Specifying the Learning Objectives

The next step is to consider the learning objectives and the relative importance of
the learning objectives. Teachers will have to select the appropriate knowledge and
skills to be assessed and include more questions with regard to the more important
learning objectives.
The learning objectives that the teachers would like to emphasise on will
determine not only what materials to include in the test but also the specific form
of the test. For example, if it is important for learners to be able to solve long
division problems rapidly, consider giving a speed test. If it is important for
learners to understand how historical events affect one another, short answer or
essay questions might be appropriate. If it is important that learners remember
dates, multiple-choice or fill-in-the-blank questions might be appropriate.
3.2.3 Developing Test Specifications

Making a test blueprint or Table of Specifications is the next important step that
teachers should do. The table describes the content, the behaviour of the learners
and the number of questions in the test corresponding to the number of hours
devoted to the learning objectives in class. In fact, exactly how many questions to
include in a test is based on the importance of the objectives, the type of questions,
the subject matter and the amount of time available for testing.

Table of Specifications is a plan prepared by the classroom teacher as a basis for

the test construction. It is a two-way table which describes the topics to be covered
in a test and the number of items associated with each topic. The table ensures that
a fair and representative sample of questions appear in the test. Teachers cannot
measure every topic or objective and cannot ask every question they may wish to
ask. A Table of Specifications allows the teacher to construct a test which focuses
on the key areas and weightage (in percentage) given in different sections of the
test based on their importance. A Table of Specifications provides the teacher with
evidence that a test has content validity and that it covers what should be covered.
This table also allows the teacher to view the test as a whole.
A sample of the Table of Specifications shown in Table 3.1 has content on one
column and cognitive levels across the top. However, teachers could also arrange
the content across the top and levels down the column. In this sample, the teacher
who prepared the table grouped „Remembering‰ and „Understanding‰ levels
together. It is very likely that he believed a straight recall was too simple to be
considered as real learning.
Table 3.1: A Sample of Table of Specifications
Levels
Content Remembering and Applying Analysing, Evaluating Total
Understanding (%) (%) and Creating (%) (%)
Topic 1 15 15 30 60
Topic 2 10 20 10 40
Total 25 35 40 100
For a Table of Specifications to be useful, it normally contains more details than

the sample shown in Table 3.1. The number of questions to be included in a test is
based on the importance of the objectives, the type of questions, the subject matter
and the amount of time available for testing. A more detailed Table of
Specifications is shown in Table 3.2.


In the example shown in Table 3.2, the vertical columns on the left of the 2-way
table show a list of the topics covered in class and the amount of time spent on
those topics. The topics can also be further subdivided into subtopics such as
„Subtract two numbers without regrouping 2-digit numbers from a 2-digit
number‰ under the topic „Subtraction within the range of 1000‰.
The amount of time spent in the topics as shown in the column „Hours of
Interaction‰ can be used as a basis to compute the weightage or percentage and
the number of questions or items for each topic. For example, the teacher has spent
20 hours teaching the three topics of which 6 hours are allotted to „Addition with
the highest total of 1000‰. Thus, 6 hours from a total of 20 hours amount to 30% or
9 items from the total 30 items as planned by the teacher.
The teacher might have a reason for allocating 25%, 35% and 40% for the levels
„Remembering‰, „Understanding‰ and „Applying‰ respectively. Perhaps he is
trying to train his Year 2 learners to pay more attention to the „thinking‰ questions.
The 25% of „Remembering‰ level is actually 7.5 questions, 35% of
„Understanding‰ level is 10.5 questions and 40% of „Applying‰ level is 12
questions. The total 30 is not affected as the number 7.5 and 10.5 are conveniently
rounded up to 8 and 10.
The cells in the # columns can be arbitrarily filled or computed using a simple
formula. In the first # column, the topic „Addition ...‰ under the level
„Remembering‰ should be 25%  9 = 2.25, the topic „Subtraction ...‰ under
„Remembering‰ is 25%  9 = 2.25 and the topic „Multiplication ...‰ under
„Remembering‰ is 25%  12 = 3. The teacher can either round up the numbers 2.25,
2.25, 3 to 3, 2, 3 or 2, 3, 3.
The teacher, especially one who is newly trained, is advised to have this Table of
Specifications together with the subject syllabus reviewed by the subject matter
expert or the subject Head of Department to confirm whether the test plan would
actually measure what it set out to measure. When the test items have been drafted
and assembled, it is advisable to once again submit the draft test paper and the
Table of Specifications to the Head of Department or the recognised subject matter
expert to evaluate whether the test items do, in actual fact, assess the defined
content. Content validity is different from face validity. Face validity assesses
whether the test „looks valid‰ to the examinees who sit for the test whereas content
validity requires recognised subject matter experts to evaluate whether the test
items assess the defined content.

It is obvious that the Table of Specifications benefits learners in two ways:
(a) Improves the validity of teacher-construct tests; and
(b) Able to improve learnerÊs learning.
The Table of Specifications helps to ensure that there is a match between what is
taught and what is tested. From the example, we can see that classroom assessment
is driven by classroom teaching which in turn is driven by learning objectives.
SELF-CHECK 3.1
1. What is a Table of Specifications?
2. How deos a Table of Specifications enhance the validity of a test?
3.2.4 Selecting Best Item Types

Before deciding on a particular type of test format, teachers should first establish
the following:
(a) Does the testing make sense?
(b) What is it that the teachers would like to assess?
The determination of what it is that the teachers would like to measure with the
test should precede the determination of how they are going to measure it.
Tests and examinations involve an investigation of the learnersÊ knowledge,

understanding and skills as well as an assessment of the outcomes of that
investigation. It appears that the test format used is one of the main driving factors
in learnersÊ learning behaviour. For instance, learners prepare differently for
multiple-choice tests than they do for oral examinations. Which test format is best
suited to the teachersÊ purposes depends on the learning objectives. When it comes
to learning objectives for academic training, professional skills and complex skills,
it is often difficult to select a suitable test format. Many opt for a combination of
different test formats to investigate and assess whether a learner has achieved all
of the learning objectives. Test selection usually involves a degree of compromise
as each test format has its pros and cons. Table 3.3 lists some item types and their
respective purposes.

Table 3.3: Item Types and their Respective Purposes
No. Item Type Purpose

1 Multiple-choice  Test for factual knowledge
 Assess a large number of items
 Score rapidly, accurately and objectively
2 Essay  Candidates are required to write an extended piece of
test based on a certain topic
 Assess higher-order thinking skills such as analysing,
evaluating and creating in BloomÊs taxonomy
3 Oral examination  The questions are primarily focused on knowledge,
understanding and application
 A test with closed-ended questions and open-ended
questions
3.2.5 Preparing Rubrics or Marking Schemes

Preparing a rubric or marking scheme well in advance of the testing date will give
teachers ample time to review their questions and make changes to answers when
necessary.
The teacher should make it a habit to write a model answer which can be easily
understood by others. This model answer can be used by other teachers who act
as external examiners, if need be. Besides, a rubric can also be an effective tool to
help the teacher or the external examiner.
Coordination should be done once the test scripts are collected. The teacher should
try to read some of the answers from the scripts and review the correct answers in
the marking scheme. The teacher may sometimes find that learners have
interpreted the test question in a way that is different from what is intended.
Learners may come up with excellent answers that may be slightly outside of what
was asked. Consider giving these learners marks accordingly.
The teacher should make a note in the marking scheme for any error made earlier
but carried through the answer. Marks should be deducted if the rest of the
response is sound.

A marking scheme may increase the efficiency of grading the test but they often
provide only limited information to promote learning. Besides marking scheme, a
rubric can be prepared to act as an additional tool to help the teacher or the external
examiner. Many of the limitations of a marking scheme can be overcome by having
rubrics which contain carefully considered and clearly stated descriptions of levels
of performance.
A rubric is a descriptive scoring scheme developed by teachers, and sometimes in

consultation with learners, to guide judgements with regard to the product or
process of learnersÊ assessment task.
In simpler terms, a rubric is an extended two dimensional checklist of what is

expected in assessment criteria matched against descriptions of how well the work
presents each criterion. It is particularly useful for evaluating assessment tasks
such as portfolios and projects.
General rubrics can help learners build up a concept of what it means to perform
a skill well. General rubrics do not „give away answers‰ to questions. Instead, they
contain descriptions such as „Explanation of reasoning is clear and supported by
appropriate details.‰ Descriptions like this help learners focus on what their
learning target is supposed to be. They provide clarification to learners on how to
approach the project. Rubrics will be discussed in greater detail in Topic 6 of this
module.
3.2.6 Preparing Test Items

While the different types of questions (multiple-choice, short answer, true-false,
matching and essay) are constructed differently, the following principles apply to
constructing questions and tests in general:
(a) Use simple and brief instructions for each type of question;
(b) Use simple and clear language in the questions;
(c) Write items that require specific understanding or ability developed in that
course, not just general intelligence or test-wiseness;
(d) Do not provide clues or suggest the answer to one question in the body of
another question;
(e) Avoid writing questions in the negative. If you must use negatives, highlight
them as they may mislead learners into answering incorrectly;

(f) Specify the units and precision of answers;
(g) Try, as far as possible, to construct your own questions. Check to make sure
the questions fit the learning objectives and requirements in the Table of
Specifications if you need to use questions from other sources; and
(h) If an item is revised, recheck its relevance.
Guidelines to prepare and write multiple-choice, short answer, true-false,

matching, essay, portfolio and project assignments will be discussed in Topics
4, 5, 6 and 7 respectively.
3.3 ASSESSING TEACHER’S OWN TEST

Regardless of the types of tests used, teachers can assess their effectiveness by
asking the following questions:
(a) Did I Test What I Thought I was Testing for?

If you had wanted to know whether learners could apply a concept to a new
situation but mostly asked questions requiring learners to label parts or
define terms, that means you have tested for recall instead of application.
(b) Did I Test What I Taught?

For example, your questions may have tested the learnersÊ understanding of
surface features or procedures. On the other hand, you have been lecturing
on causation or relation such as how the bones of the feet work together
when we walk.
(c) Did I Test for What I Emphasised on in Class?

Make sure that you have asked most of the questions with regard to the
material you feel is the most important, especially if you have emphasised it
in class. Avoid questions on obscure material that are weighted the same as
questions on crucial material.
(d) Is the Material I Tested for Really What I Wanted Learners to Learn?
For example, if you had wanted learners to use analytical skills such as the
ability to recognise patterns or draw inferences but only used true-false
questions requiring non-inferential recall, you might try constructing more
complex true-false, or multiple-choice questions.

Learners should know what is expected of them. They should be able to identify
the characteristics of a satisfactory answer and understand the relative importance
of those characteristics. This can be achieved in many ways. For example, you can
provide feedback on tests, describe your expectations in class or post model
solutions on a class blog. Teachers are encouraged to make notes on the test scripts.
When test scripts are returned to the learners, the notes will help them understand
their mistakes and correct them.
SELF-CHECK 3.2
1. What are some of the major steps in test planning? Elaborate.
2. Explain some of the advantages of the Table of Specifications.
3. Discuss the advantages of using rubrics and marking schemes.
 The first step in test planning is to decide on the purpose of the test. Tests can
be used for many different purposes.
 The next step is to consider the learning objectives and the relative importance
of the learning objectives. Teachers will have to select the appropriate
knowledge and skills to be assessed and include more questions for more
important learning objectives.
 The learning objectives that teachers would like to emphasise on will

determine not only what material to include in the test but also the specific
format of the test.
 Making a test blueprint or Table of Specifications is the next important step

that teachers should do.
 The Table of Specifications describes the content, the behaviour of the learners,
the number of questions in the test corresponding to the number of hours
devoted to the learning objectives in class.
 The Table of Specifications helps to ensure that there is a match between what
is taught and what is tested. Classroom assessment is driven by classroom
teaching which in turn is driven by learning objectives.

 The test format used is one of the main driving factors in the learnersÊ learning
behaviour.
 An objective test is a written test consisting of items or questions which

requires the respondent to select from a list of possible answers.
 Preparing a rubric or marking scheme well in advance of testing date will give
teachers ample time to review their questions and make changes to answers
when necessary.
Learning objectives Rubrics

Marking schemes Table of Specifications
Purpose of test Test items

Topic  Constructing
Objective Test
4 Items
LEARNING OUTCOMES
1. Define objective tests;
2. Differentiate between different types of objective tests;
3. Discuss the advantages of using different types of objective tests;
4. Identify the weaknesses of using the different types of objective tests;
and
5. Explain the techniques for constructing the different types of
objective test items.
 INTRODUCTION
In this topic we will focus on using objective tests to assess various kinds of
behaviour in the classroom. Firstly, the discussion will be limited to the simple
forms of objective test items, namely short-answer item, true-false item and
matching item. Three types of objective tests are examined and guidelines for the
construction of each type of the tests are discussed. The advantages and limitations
of each of these types of objective tests are explained. Secondly, we will discuss the
multiple-choice item, a more complex form of objective test items. The discussion
will focus on the characteristics and uses of multiple-choice items, their advantages
and limitations and some suggestions for the construction of such items.

TOPIC 4 CONSTRUCTING OBJECTIVE TEST ITEMS  55
4.1 WHAT IS AN OBJECTIVE TEST?

When objective tests were first used in 1845 by George Fisher in the United States,
it was not well received by the society. However, over the years it gained
acceptance and was widely used. Today it is perhaps the most popular format of
assessing various types of human abilities, competencies and socio-emotional
attributes. Objective tests are extensively used in schools, industries, businesses,
professional organisations, universities and colleges. Can you guess why?
What is an objective test? An objective test is a written test consisting of items or

questions which require the respondent to answer them by supplying a word,
phrase or symbol or by selecting from a list of possible answers. The former test
items are often referred to as supply-type test items while the latter is referred to
as selection-type test item. The word objective means „accurate‰. An objective item
or question is „accurate‰ because it cannot be influenced by the personal
preferences and prejudices of the marker. In other words, it is not „subjective‰ and
not open to varying interpretations. This is one of the reasons why the objective
test is popular in measuring human abilities, competencies and many other
psychological attributes such as personality, interests and attitudes. Among the
common objective test items are as shown in Figure 4.1.
Figure 4.1: Common objective test items

56  TOPIC 4 CONSTRUCTING OBJECTIVE TEST ITEMS
4.2 SHORT-ANSWER QUESTIONS

A short-answer question is basically a supply-type item. It exists in two formats,
namely direct question and completion question formats. Figure 4.2 shows
examples of short-answer questions:
Figure 4.2: Examples of short-answer questions
You can refer to Linn and Gronlund (1995) for more examples.
Now, let us look at the advantages and the limitations of this type of question.
The short-answer question is generally used to measure simple learning outcomes.

It is used almost exclusively to measure memorised information (except for the
problem-solving outcomes measured in Mathematics and Science). This has partly
made the short-answer question one of the easiest to construct.
Another advantage of the short-answer question is that the possibility of guessing

which often occur in the supply-type item can be reduced. In this case, learners
must supply the correct answer when they respond to the short-answer question.
They must either recall the information asked for or make the necessary
computations to obtain the answer. They cannot rely on their partial knowledge to
choose the correct answer from the list of alternatives.
Many short-answer questions can be set for a specific period of time. A test paper
consisting of short-answer questions is thus able to cover a fairly wide content of
the course to be assessed. A wide content coverage enhances the content validity
of the test.

One major limitation of the short-answer question is that it cannot be used to

measure complex learning outcomes such as organising ideas, presenting an
argument or evaluating information. What is required of learners is providing a
word, phrase, number or symbol.
Scoring of answers to the short-answer question can also pose a problem. Unless
the question is carefully phrased, learners can provide answers of varying degree
of correctness. For example, answer to a question such as „When was Malaysia
formed?‰ could either be „In 1963‰ or „On 16 September 1963‰. The teacher has to
decide whether learners who gave the partial answer have the same level of
knowledge as those who provided the complete answer. Besides, learnersÊ
answers can also be contaminated by spelling errors. If spelling is taken into
consideration, the test scores of learners will reflect their level of knowledge of the
content assessed as well as their spelling ability. If spelling is not considered in the
scoring, the teacher has to decide whether the misspelled word actually represents
the correct answer.
Although the construction of short-answer questions is comparatively easier than

other types of questions, they have variety defects which should be avoided to
ensure that they will function as intended. The following are some guidelines for
the construction of short-answer questions:
(a) Word the question so that the intended answer is brief and specific.
As far as possible, the question should be phrased in such a way that only
one answer is correct.
For Example:
Poor item: An animal that eats the flesh of other animals is

_____________.
(Possible answer: a wolf, a lion, hungry, ⁄)
Better item: An animal that eats the flesh of other animals is classified as
___________.
(One specific answer: carnivorous)

(b) Use direct questions instead of incomplete statements.

Direct questions are more natural and usually better structured.
For Example:
Poor item: Columbus discovered America in ________.
Better item: In what year did Columbus discover America? (1492)
Possible answers for 1st item: a boat, the 15th century, a search for India
(c) If the problem requires a numerical answer, indicate the units in which the
answer is to be expressed.
For Example:
(d) Do not include too many blanks for the completion item.
Blanks for answers should be equal in length. For the completion item, place
the blank near the end of the sentence.
For Example:
SELF-CHECK 4.1
1. What are some advantages of short-answer questions?
2. Describe some limitations of short-answer questions.
3. Suggest other advantages and weaknesses of using short-answer

questions.

4.3 TRUE-FALSE QUESTIONS

In the most basic format, true-false questions are those in which a statement is
presented and the learner indicates in some manner whether the statement is true
or false. In other words, there are only two possible responses for each item, and
the learner chooses between them. A true-false question is a specialised form of
the multiple-choice format in which there are only two possible alternatives. These
questions can be used when the test designer wishes to measure a learnerÊs ability
to identify whether statements of fact are accurate or not. True-false questions can
be used for testing knowledge and judgement in many subjects. When grouped
together, a series of true-false questions on a specific topic or scenario can test a
more complex understanding of an issue. They can be structured to lead a learner
through a logical pathway (Brown, 1997) and can reveal part of the thinking
process employed by the learner in order to solve a given problem.
For Example:
True False
A whale is a mammal because it gives birth to its young.
True-false questions can be quickly written and can cover a lot of content. True-
false questions are well suited for testing learner recall or comprehension. Learners
can generally respond to many questions, covering a lot of content in a fairly short
amount of time. From the teacherÊs perspective, these questions can be written
quickly and are easy to score. Because they can be objectively scored, the scores
are more reliable than for items that are at least partially dependent on the
teacherÊs judgment. Generally, they are easier to construct compared to multiple-
choice questions because there is no need to develop distractors. Hence, they are
less time consuming compared to constructing multiple-choice questions.
4.3.1 Limitations of True-false Questions

However, true-false questions do have a number of limitations:
(a) Guessing – A learner has a one in two chance of guessing the correct answer
of a question. Scores on true-false items tend to be high because of the ease
of guessing the correct answers when the answer is not known. With only
two choices (true or false) the learner could expect to guess correctly on half
of the items for which correct answers are not known. Thus, if a learner
knows the correct answers to 10 questions out of 20 and guesses on the other

10, the learner could expect a score of 15. The teacher can anticipate scores
ranging from approximately 50 per cent for a learner who did nothing but
guess on all items to 100 per cent for a learner who knows the material.
(b) Because these items are in the form of statements, there is sometimes a
tendency to take quotations from the text, expecting the learner to recognise
a correct quotation or note a change (sometimes minor) in wording. There
may also be a tendency to include trivial or inconsequential material from
the text. Both of these practices are discouraged.
(c) It can be difficult to write a statement which is unambiguously true or false,

particularly for complex materials
(d) True-false items provide little diagnostic information. Teachers can often get
useful information about learner errors and misconceptions by examining
learnersÊ incorrect answers but true-false items do not provide such
diagnostic information.
(e) True-false items may produce a negative suggestion effect. Some testing
experts feel that exposing false statements might promote learning false
information.
(f) False statements do not provide evidence that learners know the correct
answer.
4.3.2 Suggestions for Construction of True-false

Questions
The following are suggestions for construction of true-false questions:
(a) Include only one main idea in each question.
For Example:
Poor item: The study of biology helps us understand living organisms

and predict the weather.
Better item: The study of biology helps us understand living organisms.

(b) Use negative statements sparingly but avoid double negatives. Double
negatives tend to contribute to the ambiguity of the statement. Statements
with words like none, no and not should be avoided as far as possible.
For Example:
Poor item: None of the steps in the experiment were unnecessary.
Better item: All the steps in the experiment were necessary.
(c) Avoid broad, general statements. Most broad generalisations are false unless
qualified.
For Example:
Poor item: Short-answer questions are more favourable than essay

questions in testing.
Better item: Short-answer questions are more favourable than essay

questions in testing factual information.
(d) Avoid long complex sentences. Such sentences also test reading
comprehension besides the achievements to be measured.
For Example:
Poor item: Despite the theoretical and experimental difficulties of

determining the exact pH value of a solution, it is possible
to determine whether a solution is acidic by the red colour
formed on litmus paper when inserted into the solution.
Better item: Litmus paper turns red in an acidic solution.
(e) Try using in combination with other materials such as graphs, maps and
written material. This combination allows for the testing of more advanced
learning.
(f) Avoid lifting statements directly from assigned reading, notes or other
course materials so that recall alone will not permit a correct answer.

(g) In general, avoid the use of words which would signal the correct response
to the test-wise learner. Absolutes such as „none‰, „never‰, „always‰, „all‰
and „impossible‰ tend to be false while qualifiers such as „usually‰,
„generally‰, „sometimes‰ and „often‰ are likely to be true.
(h) A similar situation occurs with the use of "can" in a true-false statement. If
the learner knows of a single case in which something „can‰ be done, it
would be true.
(i) Ambiguous or vague statements and terms such as "large", "long time",
"regularly", "some" and "usually" are best avoided in the interest of clarity.
(j) Some terms have more than one meaning and may be interpreted differently
by individuals.
(k) True statements should be about the same length as false statements (there is
a tendency to add details in true statements to make them more precise).
(l) Word the statement so precisely that it can be judged unequivocally true or
false.
(m) Statements of opinion should be attributed to some source.
(n) Avoid verbal clues (specific determiners) that indicate the answer.
(o) Test important ideas rather than trivial information.
(p) Do not present items in easily learned pattern.
SELF-CHECK 4.2
1. What are some advantages of true-false questions?
2. Describe some limitations of true-false questions.
3. Suggest other advantages and weaknesses of using true-false

questions.

4.4 MATCHING QUESTIONS

What are matching questions?
Matching questions are used in measuring a learnerÊs ability to identify the

relationship between two lists of terms, phrases, statements, definitions, dates,
events, people and so forth.
For Example:
Directions: Column A contains statement describing selected Asian cities.
For each description find the appropriate city in Column B. Each city in Column
B can be used only once.
Column A Column B
1. The ancient capital of Thailand A. Ayuthia
2. The largest city in Sumatera B. Ho Chi Minh City
3. The capital of Myanmar C. Karachi
4. Formerly known as Saigon D. Medan
One matching question can replace several true-false questions. In developing

matching questions, you have to identify two columns of materials listed
vertically. The items in Column A (or I) are usually called premises and assigned
numbers (1, 2, 3, ⁄) while items in Column B (or II) are called responses and
designated capital letters (A, B, C, ....) Have more responses than premises or allow
responses to be used more than once.
The learner reads a premise (Column A) and finds the correct response from
among those in Column B. The learner then prints the letter of the correct response
in the blank beside the premise in Column A. An alternative is to have the learner
draw a line from the correct response to the premise but this is more time
consuming to score. One of the ways to reduce the possibility of guessing the
correct answers is to list a larger number of responses (Column B) than premises
(Column A) as is done in the example. Another way to decrease the possibility of
guessing is to allow responses to be used more than once. Instructions to the
learners should be very clear about the use of responses.

Some psychometricians suggest there should be no more than five to eight

premises (Column A) in one set. For each premise, the learner has to read through
the entire list of responses (or those still unused) to find the matching response.
For this reason, the shorter elements should be in Column B rather than Column
A to minimise the amount of reading needed for each item (Alabama Department
of Education, 2001). Responses (Column B) should be listed in logical order, if there
is one (for example, chronological order, by size). If there is no apparent order, the
responses should be listed alphabetically. Premises (Column A) should not be
listed in the same order as the responses. Care must be taken to ensure that the
association keyed as the correct response is unquestionably correct and that the
numbered item could not be rightly associated with any other choice.
4.4.1 Advantages of Matching Questions

Now, let us look at the advantages of matching questions:
(a) Matching questions are particularly good at assessing a learnerÊs

understanding of relationships. They can test recall by requiring a learner to
match the following elements (McBeath, 1992):
(i) Definitions – terms;
(ii) Historical events – dates;
(iii) Achievements – people;
(iv) Statements – postulates; and
(v) Descriptions – principles.
(b) They can also assess a learnerÊs ability to apply knowledge by requiring a
test-taker to match the following:
(i) Examples – terms;
(ii) Functions – parts;
(iii) Classifications – structures;
(iv) Applications – postulates; and
(v) Problems – principles.

(c) Matching questions are really a variation of the multiple-choice format. If

you find that you are writing multiple-choice questions which share the same
answer choices, you may consider grouping the questions into a matching
item.
(d) Matching questions are generally easy to write and score when the content
tested and objectives are suitable for matching question.
(e) Highly efficient as large amount of knowledge can be sampled in a short

period of time.
4.4.2 Limitations of Matching Questions

How about the limitations? Let us look at the following:
(a) Matching questions are limited to the materials that can be listed into two
columns and there may not be much material that lends itself to such a
format;
(b) If there are four items in a matching question and the learners know the
answers for three of them, the fourth item is a give-away through
elimination;
(c) Difficult to differentiate between effective and ineffective items;
(d) Often leads to testing of trivial facts or bits of information; and
(e) Often criticised for encouraging rote memorisation.
4.4.3 Suggestions for Writing Good Matching

Questions
Here are suggestions for writing good matching questions:
(a) Provide clear instructions. They should explain how many times responses
can be used;
(b) Keep the information in each column as homogeneous as possible;
(c) Include more responses than premises or allow the responses to be used
more than once;

(d) List the items with more words in Column A;
(e) Correct answers should not be obvious to those who do not know the content
being taught;
(f) There should not be keywords appearing in both a premise and response,
providing a clue to the correct answer; and
(g) All of the responses and premises for a matching item should appear on the
same page.
SELF-CHECK 4.3
1. What are some advantages of matching questions?
2. Describe some limitations of matching questions.
3. Suggest other advantages and weaknesses of using matching

questions.
ACTIVITY 4.1
1. Select five true-false questions in your subject area and analyse each
item using the guidelines discussed.
2. Select five matching questions in your subject area and analyse each
3. Suggest how you would improve the weak items for each type of
questions that you have identified.
Discuss your answers with your course mates.
4.5 MULTIPLE-CHOICE QUESTIONS

Multiple-choice questions or MCQs are widely used in many different settings
because it can be used to measure low level cognitive outcomes as well as more
complex cognitive outcomes. It is challenging to write test items that can tap into
higher-order thinking. Good item writing requires a well-trained test writer.
Above all, test writers need to have expertise in the subject area being tested so
that they can gauge the difficulty and content coverage of test items.

MCQs are the most difficult to prepare. These questions have two parts, namely a
stem which contains the question and four or five options which contains the
correct answer. The correct answer is called the keyed response and the incorrect
options are called distractors. The stem may be presented as a question, direction
or a statement while the options could be a word, phrase, numbers, symbols and
so forth. Cruel as it may seem, the role of the distractor is to attract the attention of
respondents who are not sure of the correct answer.
4.5.1 Parts of a Multiple-choice Question

A traditional multiple-choice question (MCQ) or item is one in which a learner
chooses one answer from a number of choices supplied (illustrated in Figure 3.4).
Figure 4.3: Example of a multiple-choice question (MCQ)
Now, let us look at what an MCQ consists of. It has the stem and also the options.
(a) Stem
The stem should:
(i) Be in the form of a question or a statement to be completed;
(ii) Be expressed clearly and concisely, avoiding poor grammar, complex

syntax, ambiguity and double negatives;
(iii) Generally present a positive question. (If a negative is used it should be

emphasised with italics or underlining);
(iv) Generally ask for one answer only (the correct or the best answer); and
(v) Include as many of the words common to all alternatives as possible.

(b) Options (or Alternatives)

The options should:
(i) Have either four or five alternatives, all of which should be mutually
exclusive and not too long;
(ii) Follow grammatically from the stem and be parallel in grammatical

form;
(iii) Be expressed simply enough to make clear the essential differences

between them and must be unambiguous; and
(iv) Contain the intended answer or the keyed response and it should
appear to be clearly correct to the informed but it should be definitely
incorrect but plausible to the distractors.
SELF-CHECK 4.4
1. Why is MCQ a popular form of objective test?
2. Describe some weaknesses of using MCQs.
4.5.2 Construction of Multiple-choice Questions

Test writing is a profession. By that we mean that good test writers are
professionally trained in designing test items. Test writers have knowledge of the
rules of constructing items and at the same time they have the creativity to
construct items that capture learnersÊ attention. Test items need to be succinct but
yet clear in meaning.
As stated earlier, MCQs are the most difficult to prepare. We need to focus on
writing the stem as well as providing the options or alternatives. All the options in
multiple-choice items need to be plausible but they also need to separate learners
of different ability levels. Table 4.1 shows some considerations that need to be
taken into account when constructing MCQs, particularly the stems.

Table 4.1: Considerations When Constructing MCQ Stems
Considerations When Example

Writing a Stem Poor Item Better Item
(a) Present a single, World War II was: In which of these time
definite statement A. The result of the failure of period was World War II
to be completed or the League of Nations. fought?
answered by one B. Horrible. A. 1914–1917
of the several given B. 1929–1934
C. Fought in Europe, Asia and
choices.
Africa. C. 1939–1945
D. Fought during the period of D. 1951–1955
1939–1945.
Note: In the Poor Item, there is no sense of what the question
is asking from the stem. The Better Item more clearly identifies
the question and offers the learner a set of homogeneous
choices.
(b) Avoid unnecessary For almost a century, the Rhine Which of the following
and irrelevant River has been used by would be the most dramatic
material. Europeans for a variety of result if, because of diesel
purposes. However, in recent pollution from ships, the
years, the increased river traffic Rhine River was closed to
has resulted in increased levels all shipping?
of diesel pollution in the A. Increased prices for
waterway. Which of the Ruhr products.
following would be the most B. Shortage of water for
dramatic result if, because of the Italian industries.
pollution, the Council of
C. Reduced
Ministers of the European
competitiveness of the
Community decided to close the
French Aerospace
Rhine River to all shipping?
Industry.
A. Closure of the busy Rhine
D. Closure of the busy
River ports of Rotterdam,
river Rhine ports of
Marseilles and Genoa.
Rotterdam, Marseilles
B. Increased prices for Ruhr and Genoa.
products.
C. Reduced competitiveness of
the French Aerospace
Industry.
D. Shortage of water for Italian
industries.
Note: The Poor Item was too wordy and contained unnecessary
material.
Source: McKenna and Bull (1999). Designing effective
objective questions. Loughborough University

(c) Use clear, straight As the level of fertility A major decline in fertility
forward language approaches its nadir, what is the in a developing nation is
in the stem of the most likely ramification for the likely to produce
item. Questions citizenry of a developing nation? A. a decrease in the labour
that are A. A decrease in the labour force participation rate
constructed using force participation rate of of women.
complex wordings women. B. a downward trend in
may become a test B. A downward trend in the the youth dependency
of reading youth dependency ratio. ratio.
comprehension
C. A broader base in the C. a broader base in the
rather than an
population pyramid. population pyramid.
assessment of
whether the D. An increased infant D. an increased infant
learner knows the mortality rate. mortality rate.
subject matter. Note: In the improved item, the word „nadir‰ is replaced with
„decline‰ and „ramification‰ is replaced with „produce‰
which are more straight forward words.
(d) Use negatives Which of the following is not a Which of the following is a
sparingly. If symptom of osteoporosis? symptom of osteoporosis?
negatives must be A. Decreased bone density. A. Hair loss.
used, capitalise, B. Frequent bone fractures. B. Painful joints.
underscore or
C. Raised body temperature. C. Raised body
bold.
D. Lower back pain. temperature.
D. Decreased bone density.
Note: The Better Item is stated in the positive so as to avoid
use of the negative „not‰.
(e) Put as much of the Theorists of pluralism have Theorists of pluralism have
question in the asserted which of the following? asserted that the
stem as possible, A. The maintenance of maintenance of democracy
rather than democracy requires a large requires
duplicating middle class. A. a large middle class.
material in each of B. The maintenance of B. the separation of
the options. democracy requires governmental powers.
autonomous centres of C. autonomous centres of
countervailing power. countervailing power.
C. The maintenance of D. the existence of a
democracy requires the multiplicity of religious
existence of a multiplicity of groups.
religious groups.
D. The maintenance of
democracy requires the
separation of governmental
powers.

Note: In the Better Item, the phrase „maintenance of

democracy‰ is included in the stem so as not to duplicate
material in each option.
Source: McKenna and Bull (1999). Designing effective
objective questions. Loughborough University
(f) Avoid giving away A fertile area in the desert in A fertile area in the desert in
the answer because which the water table reaches which the water table reaches
of grammatical the ground surface is called the ground surface is called
cues. an a/an
A. mirage. A. lake.
B. oasis. B. mirage.
C. water hole. C. oasis.
D. polder. D. polder.
Note: The Poor Item uses the article „an‰ which identifies
choice C as the correct response. Ending the stem with „a/an‰
improves the item.
(g) As far as possible Which of the following men
avoid asking an contributed most towards the
opinion. defeat of HitlerÊs Germany in
World War II?
A. F. D. Roosevelt
B. George Patton
C. Josef Stalin
D. Winston Churchill
(h) Avoid using ALWAYS and NEVER in the stem as test-wise learners are likely to
rule such universal statements out of consideration.
ACTIVITY 4.2
1. Select ten MCQs in your subject area and analyse the stem of each
2. Suggest how you would improve the poor item.
Discuss your answers with your course mates.

Now, let us look at Table 4.2 which shows some considerations when constructing
the distractors for MCQs.
Table 4.2: Considerations When Constructing Distractors for MCQs
Considerations When Example

Writing Distractors Poor Item Better Item
(a) For single response What is the main source of What is the main source of
MCQs, ensure that pollution of Malaysian rivers? pollution of Malaysian
there is only one A. Land clearing rivers?
correct response. A. Open burning
B. Open burning
C. Solid waste dumping B. Coastal erosion
D. Coastal erosion C. Solid waste dumping
D. Carbon dioxide
emission
Note: In the Poor Item, both options A and C could be
considered to be correct.
(b) Use only plausible Who was the third Prime Who was the third Prime
and attractive Minister of Malaysia? Minister of Malaysia?
alternatives as A. Hussein Onn A. Hussein Onn
distractors.
B. Ghafar Baba B. Abdullah Badawi
C. Mahathir Mohamad C. Mahathir Mohamad
D. Musa Hitam D. Abdul Razak Hussein
Note: In the Poor Item, B and D are not serious distractors.
(c) When writing distractors, if possible, avoid the choices „All of the above‰ and „None
of the above‰. If you do include them, make sure that they appear as correct answers
some of the time.
It is tempting to resort to these alternatives but their use can be flawed. To begin with,
they often appear as an alternative that is not the correct response. If you do use them,
be sure that they constitute the correct answer some of the time. An „all of the above‰
alternative could be exploited by a test-wise learner who will recognise it as the correct
choice by identifying only two correct alternatives. Similarly, a learner who can
identify one wrong alternative can then also rule out this response. Clearly, the
learnerÊs chance of guessing the correct answer improves as they employ these
techniques. Although a similar process of elimination is not possible with „none of the
above‰, it is the case that when this option is used as the correct answer, the question
is only testing the learnersÊ ability to rule out wrong answers but this does not
guarantee that they know the correct one. (Gronlund, 1988).

(d) Distractors based on common learner errors or misconceptions are very effective.
One technique for compiling distractors is to ask learners to respond to open-ended
short-answer questions, perhaps as formative assessments. Identify which incorrect
responses appear most frequently and use them as distractors for a multiple-choice
version of the question.
(e) Do not create distractors that are so close to the correct answer that they may confuse
learners who really know the answer to the question. „Distractors should differ from
the key in a substantial way, not just in some minor nuances of phrasing or
emphasis.‰ (Isaacs, 1994)
(f) Provide a sufficient number of distractors.
You will probably choose to use three, four or five alternatives in a MCQ. Until
recently, it was thought that three or four distractors were necessary for the item
to be suitably difficult. However a study by Owen and Freeman suggested that
three choices are sufficient (Brown, 1987). Clearly the higher the number of
distractors, the less likely it is for the correct answer to be chosen through guessing
(provided all alternatives are of equal difficulty).
SELF-CHECK 4.5
1. Do you agree that teachers should not use negatives in the stems of
MCQs? Justify your answer.
2. Do you agree that teachers should avoid using distractors such as
„All of the above‰ and „None of the above‰? Justify your answer.
3. Select ten MCQs in your subject area and analyse the distractors of
each item using the guidelines discussed. Suggest how you would
improve the weak items.
4.5.3 Advantages and Limitations of MCQs

MCQs are widely used to measure knowledge outcomes and various types of
learning outcomes. They are popular because of the following reasons:
(a) Learning outcomes from simple to complex can be measured;
(b) Highly structured and clear tasks are provided;
(c) A broad sample of achievement can be measured;

(d) Incorrect alternatives or options provide diagnostic information;
(e) Scores are less influenced by guessing than true-false items;
(f) Scores are more reliable than subjectively scored items (such as essays);
(g) Scoring is easy, objective and reliable;
(h) Item analysis can reveal how difficult each item was and how well it
discriminates between strong and weaker learners in the class;
(i) Performance can be compared from class to class and year to year;
(j) Can cover a lot of material very efficiently (about one item per minute of
testing time); and
(k) Items can be written so that learners must discriminate among options that
vary in degree of correctness.
While there are many advantages of using MCQs, there are also many limitations
in using such items, namely:
(a) Constructing good items is time consuming;
(b) It is frequently difficult to find plausible distractors;
(c) MCQs are not effective for measuring problem-solving skills as well as the
ability to organise and express ideas;
(d) Scores can be influenced by reading ability;
(e) There is a lack of feedback on individual thought processes. It is difficult to

determine why individual learners select incorrect responses;
(f) Learners can sometimes read more into the question than was intended;
(g) It often focuses on testing factual information and fails to test higher levels
of cognitive thinking;
(h) Sometimes there is more than one defensible „correct‰ answer;
(i) It places a high degree of independence on learnersÊ reading ability and the
constructorÊs writing ability;
(j) Does not provide a measure of writing ability; and
(k) May encourage guessing.

Now, let us look at Table 4.3 which summarises the procedural rules for the
construction of MCQs.
Table 4.3: Procedural Rules for Constructing Multiple-choice Questions (MCQs)
Some Procedural Rules for the Construction of Multiple-choice Questions
 Test for important or significant  Avoid giving clues through the use of
information. faulty grammatical construction.
 Avoid trick items.  Avoid the use of humour when
 Keep the vocabulary consistent developing options.
with the learnersÊ level of  Present practical or real-world
understanding. situations to learners.
 Avoid overly specific knowledge  Use pictorial materials that require
when constructing items. learners to apply principles and
 Avoid items based on opinions. concepts.
 Be sensitive to cultural, religious  Avoid textbook, verbatim phrasing
and gender issues. when developing items.
 Keep options or alternatives  Use charts, tables or figure that
independent and not overlapping. require interpretation.
 Avoid distractors that can provide
clue to test-wiseness.
4.5.4 Constructing MCQs to Measure Higher-Order

Thinking (HOT)
Although it is not easy to construct higher-order thinking (HOT) MCQs, learners
can try the following techniques to write HOT questions:
(a) Scenario-based Questions

One of the best ways to promote and assess HOT skill is to use scenario-
based questions, particularly those that simulate real work experiences. A
scenario is a snapshot of an event, generally providing a brief overall
description of a problem that you need to solve or give advice on.
For Example:
Read the following comment that a teacher made about testing and then
answer the question.
„Learners go to school to learn, not to take tests. In addition, tests

cannot be used to indicate a learnerÊs absolute learning. All tests can do
is to rank learners in order of achievement and this relative ranking is

influenced by guessing, bluffing and the subjective opinions of the

teacher who is scoring. The teaching and learning process would be
beneficial if we do away with tests and depend on learner self-
evaluation.‰
Which of the following types of test is this teacher primarily talking about?
A. Aptitude test
B. Diagnostic test
C. Formative test
D. Summative test*
Adapted from:
https://www.learningsolutionsmag.com/articles/804/writing-
multiple-choice-questions-for-higher-level-thinking
(b) Interpret and Analyse Visuals

You can also assess critical thinking skills by asking learners to analyse or
interpret information from visuals, which are provided as part of the
question stem or the answer choices. In many cases, visuals such as job aids,
diagrams and graphs simulate workplace tasks.
For Example:
Table 4.1: Changes in Weight of P#1 and P#2

Table 4.2: Changes in Glucose Level of P#1 and P#2
Two 60-year-old male patients (P#1 and P#2) have Type 2 diabetes. Each has
a BMI of 27. The primary treatment for each patient is a diet to reduce blood
glucose levels.
What is the most likely reason why P#2 did not show a decline in glucose
level after three months?
A. P#1 may have exercised more than P#2.
B. P#2 may have a more resistant form of diabetes.
C. P#2 probably leads a more sedentary life than P#1.*
D. P#1 lost more weight on the glucose reduction diet.
Adapted from:
http://digitalcommons.hsc.unt.edu/cgi/viewcontent.cgi?article=100
9&context=test_items
(c) Distractors and Key Context Reasons in Complex MCQs

Another approach to measuring critical or creative thinking is to ask learners
to synthesise what they have learned into an explanation. The possible
responses include the answer and a variety of reasons that support the
answer. Of course, only one reason is logical and correct, based on the
knowledge and skills being assessed. A context is also provided when this
technique is used.

For Example:
Both Su and Ramya want to lose weight. Su goes on a low carbohydrate diet
while Ramya goes on a vegan diet. After six months, Su lost 30 pounds and
Ramya lost 15 pounds.
In relation to losing weight, which of the following conclusions is supported?
A. A low carbohydrate diet is more effective than a vegan diet.*
B. Additional information is needed before making any conclusions.
C. A low carbohydrate diet is easier to maintain than a vegan diet.
D. A vegan diet contains more calories than a low carbohydrate diet.
Adapted from:
http://digitalcommons.hsc.unt.edu/cgi/viewcontent.cgi?article=100
9&context=test_items
Nitko (2004) refers to this type of questions as context-dependent items. The

context is presented in some form of introductory material which may
include graphs, charts, drawings, formulae, extracts from reading materials,
maps and list of words. In crafting this type of questions, it is important
that learners are required to think about and use the information in the
introductory material to answer the questions to solve the problems or
complete the assessment tasks. The examples given previously refer to
multiple-choice items. However, context-dependent items may be used with
any type of item format such as essay, true-false and matching items.
SELF-CHECK 4.6
1. What are some advantages of MCQs?
2. List some limitations of MCQs.
3. Suggest other advantages and weaknesses of using MCQs.

 An objective test is a written test consisting of items or questions which

requires the respondent to select the answer from a list of possible answers.
 An objective item or question is „accurate‰ because it cannot be influenced by

the personal preferences and prejudices of the marker.
 Objective tests vary depending on how the questions are presented. The four
common types of questions used in most objective tests are short-answer
questions, matching questions, true-false questions and multiple-choice
questions (MCQs).
 The two forms of short-answer questions are direct questions and completion
questions.
 True-false questions are those in which a statement is presented and the learner
indicates in some manner whether the statement is true or false.
 True-false questions can be written quickly and are easy to score. Because they
can be objectively scored, the scores are more reliable than for items that are at
least partially dependent on the teacherÊs judgment.
 Avoid lifting statements directly from assigned reading, notes or other course
materials so that recall alone will not permit a correct answer.
 Matching questions are used to measure a learnerÊs ability to identify the

relationship between two lists of terms, phrases, statements, definitions, dates,
events, people and so forth.
 To reduce the possibility of guessing correct answers in matching questions,

list a larger number of responses than premises and allow responses to be used
more than once.
 MCQs have two parts: a stem that contains the question and four or five
options that contains the correct answer called the keyed response and
incorrect options called distractors.

 MCQs are widely used because they can be used to measure learning
outcomes, from simple to complex. They are highly structured with clear tasks
provided and able to test a broad sample of achievement.
 MCQs are difficult to construct, tend to measure low level learning outcomes,
lend themselves to guessing and do not measure writing ability.
 It is not impossible to construct higher-order thinking skills (HOTS) MCQs.
Alternatives Objective tests

Content coverage Options
Distractors Premises
Guessing Responses
Higher-Order Thinking Skills (HOTS) Short-answer questions
Matching questions Stem
Multiple-choice questions (MCQs) True-false questions

Topic  Constructing
Essay
5 Questions
LEARNING OUTCOMES
1. Identify the attributes of an essay question;
2. Explain the purpose of using the essay test;
3. Describe the advantages and limitations of essay questions;
4. Identify learning outcomes that are appropriately assessed using
essay questions; and
5. Construct well-written essay questions which assess learning
outcomes.
 INTRODUCTION
In Topic 4, we discussed in detail the use of objective tests in assessing learners. In
this topic, we will examine the essay test. The essay test is a popular technique for
assessing learning and is used extensively at all levels of education. It is also widely
used in assessing learning outcomes in business and professional examinations.
Essay questions are used because they challenge learners to create their own
responses rather than merely selecting a response. Essay questions have the
potential to reveal learnersÊ ability to reason, create, analyse and synthesise. These
skills may not be effectively assessed using objective tests.

82  TOPIC 5 CONSTRUCTING ESSAY QUESTIONS
5.1 WHAT IS AN ESSAY QUESTION?

According to Stalnaker (1951), an essay question is „a test item which requires a
response composed by the examinee usually in the form of one or more sentences
of a nature that no single response or pattern of responses can be listed as correct,
and the accuracy and quality of which can be judged subjectively only by one
skilled or informed in the subject‰. Although the definition was provided a long
time ago, it is still a comprehensive definition.
Elaborating on this definition, Reiner, Bothell, Sudweeks and Wood (2002), argued
that to qualify as an essay question it should meet the following four criteria:
(a) The learner has to compose rather than select his response or answer. In
essay questions, learners have to construct their own answer and decide on
what material to include in their response. Objective test questions (MCQ,
true-false, matching) require learners to select the answer from a list of
possibilities.
(b) The response or answer provided by the learner consists of one or more
sentences. Learners do not respond with a „yes‰ or „no‰ response but instead
have to respond in the form of sentences. In theory there is no limit to the
length of the answer. However, in most cases its length is predetermined by
the demands of the question and the time limit allotted for the question.
(c) There is no one single correct response or answer. In other words, the
question should be composed so that it does not ask for one single correct
response. For example, the question, „Who killed J. W. W. Birch?‰ assesses
verbatim recall or memory and not the ability to think. Hence, it cannot
qualify as an essay question. You could modify the question to „Identify who
killed J.W.W. Birch and explain the factors that led to the killing.‰ Now, it is
an essay question that assesses learnersÊ ability to think and give reasons for
the killing supported with relevant evidence.
(d) The accuracy and quality of learnersÊ responses or answers to essays must be
judged subjectively by a specialist in the subject matter. The nature of essay
questions is such that only specialists in the subject matter can judge to what
degree responses (or answers) to an essay is complete, accurate and relevant.
Good essay questions encourage learners to think deeply about their answers
which can only be judged by someone with the appropriate experience and
expertise in the content area. Thus, content expertise is essential for both
writing and grading essay questions. For example, the question „List three
reasons for the opening of Penang by the British in 1789‰ requires learners to
recall a set list of items. The person marking or grading the essay does not

TOPIC 5 CONSTRUCTING ESSAY QUESTIONS  83
have to be a subject matter expert to know if the learner has listed the three
reasons correctly as long as the list of three reasons is available as an answer
key. For the question „To what extent is commerce the main reason for the
opening of Penang by the British in 1789?‰ a subject-matter expert is needed
to grade or mark an answer to this essay question.
5.2 TYPES OF ESSAYS

Generally, there are two types of essays used in educational institutions, namely
coursework essay and examination essay.
(a) Coursework Essay:

A coursework essay may consist of the following:
(i) Essay Outlines

These is a short summary or outline of a particular topic (perhaps,
about 500 words). It assesses the ability to organise material, to
construct coherent arguments and to select relevant information from
a wide field of study.
(ii) Standard Essays

These are full papers of any topic from 1000 to 2500 words in length.
They assess the ability to describe and analyse the relationship between
ideas and events, to give a coherent account of a topic, to select and
weigh evidence in support of an argument, to diagnose and suggest
solutions to problems, to solve familiar types of problems, to express
critical judgements and to make comparisons.
(iii) Extended Essays

These are full papers of between 2500 to 5000 words in length. They
assess the ability to solve less familiar problems and to analyse or
critically evaluate less familiar materials, hence, requiring more
extensive preparation or research.
Coursework essays are intended to assess learnersÊ ability to research a topic

thoroughly and meticulously, and to handle a mass of material. It evaluates
their ability to answer at length when „given enough rope‰ on a selected
topic. Of particular concerns are the points the learner makes, how he makes
sense of them, the way in which his essay is organised and the value he
attaches to different aspects of the topic. Coursework essays are written
responses and learners are expected, for the most part, to use their own
words and communicate their ideas clearly and persuasively.

(b) Examination Essays

These are short essays written as part of a formal examination. Surely you
have had experience with this format of assessment! For example, in a 2 to 3-
hour examination, learners may be required to answer about three essay
questions allotting about 35 to 45 minutes per question. In practice, there is
much variation in the number of questions asked and the duration of the
examination. In some situations there is a choice of questions while in other
situations there is no choice. The reason for controlling the choice is to ensure
that learners are examined over a comparable range. The discussion that
follows focuses on the examination essay particularly the use of essay
questions in examinations which is commonly a „closed-book‰ setting. We
will discuss further about coursework essays in Topic 5 under Projects.
ACTIVITY 5.1
Select a few essay questions that have been used in tests or examinations.
To what extent do these questions meet the criteria of an essay question
as defined by Stalnaker (1951) and elaborated by Reiner, Bothell,
Sudweeks and Wood (2002)?
Share your answer with your course mates.
5.3 WHY ARE ESSAY QUESTIONS USED?

Essay questions are used to assess learning because of the following reasons:
(a) Essay questions provide an effective way of assessing complex learning

outcomes. They enable the assessment of learnersÊ ability to synthesise ideas,
to organise and express ideas, and to evaluate the worth of ideas. These
abilities cannot be effectively assessed directly with other paper and pencil
test items.
(b) Essay questions allow learners to demonstrate their reasoning. Essay

questions not only allow learners to present an answer to a question but also
to explain how they arrived at their conclusion. This allows teachers to gain
insights into a learnerÊs way of viewing and solving problems. With such
insights, teachers are able to detect problems which learners may have with
their reasoning process and help them overcome those problems.

(c) Essay questions provide authentic experiences. „Constructing responses‰

are closer to real life than selecting responses (as in the case of objective tests).
Problem-solving and decision-making are vital life competencies which
require the ability to construct a solution or decision rather than selecting
a solution or decision from a limited set of possibilities. In the work
environment it is unlikely that an employer will give a list of „four options‰
for the worker to choose from when asked to solve a problem! In most cases,
the worker will be required to construct a response. Hence, essay items are
closer to real life than objective test questions because in real life learners
typically construct responses, not select them.
5.4 DECIDING WHETHER TO USE ESSAY OR

OBJECTIVE QUESTIONS
Essay questions should strive for higher-order thinking skills.
The decision whether to use essay questions or objective questions in examinations

can be problematic for some educators. In such a situation one has to go back to
the objective of the assessment. What learning outcomes do you intend to achieve?
Essay questions are generally suitable for the following purposes:
(a) To assess learnersÊ understanding of subject-matter content; and
(b) To assess thinking skills that require more than simple verbatim recall of
information by challenging learners to reason with their knowledge.
It is challenging to write test items to tap into higher-order thinking. However,

learnersÊ understanding of subject-matter content and many of the other higher-
order thinking skills could also be assessed through objective items. When in
doubt as to whether to use an essay question or an objective question, just
remember that essay questions are used to assess learnersÊ ability to construct
rather than select answers.
To determine what type of test (essay or objective) to use, it is helpful that you
examine the verb(s) that best describes the desired ability to be assessed. We have
discussed these verbs in Topic 2. These verbs indicate what learners are expected
to do and how they should respond. These verbs serve to channel and focus
learnersÊ responses towards the performance of specific tasks. Some verbs clearly
indicate that learners need to construct rather than select their answer (for
example, to explain). Other verbs indicate that the intended learning outcome is
focused on learnersÊ ability to recall information (for example, to list). Perhaps,
recall is best assessed through objectively scored items. Verbs that test for

understanding of subject-matter content or other forms of higher-order thinking

but do not specify whether the learner is to construct or to select the response (for
example, to interpret) can be assessed either by essay questions or by objective
items.
SELF-CHECK 5.1
Compare, explain, arrange, apply, state, classify, design, illustrate,
describe, name, complete, choose, defend, name.
Which of the above list of verbs is best assessed by essay questions or

objective tests or both?
5.5 LIMITATIONS OF ESSAY QUESTIONS

While essay questions are popular because they enable the assessment of higher-
order learning outcomes, this format of evaluating learners in examinations has a
number of limitations which should be kept in mind. The limitations are:
(a) One purpose of testing is to assess a learnerÊs mastery of the subject matter.
In most cases it is not possible to assess the learnerÊs mastery of the complete
subject-matter domain with just a few questions. Because of the time it takes
for learners to respond to essay questions and for markers to mark learnersÊ
responses, the number of essay questions that can be included in a test is
limited. Therefore, using essay questions will limit the degree to which the
test is representative of the subject-matter domain, thereby reducing content
validity. For instance, a test of 80 multiple-choice questions will most likely
cover more of the content domain than a test of 3 or 4 essay questions.
(b) Essay questions have reliability limitations. While essay questions allow
learners some flexibility in formulating their responses, the reliability of
marking or grading is questionable. Different markers or graders may vary
in their marking or grading of the same or similar responses (inter-scorer
reliability) and one marker can vary significantly in his marking or grading
consistency across questions depending on many factors (intra-scorer
reliability). Therefore, essay answers of similar quality may receive notably
different scores. Characteristics of the learner, length and legibility of
responses and personal preferences of the marker or grader with regard to
the content and structure of the responses are some of the factors that may
lead to unreliable marking or grading.

(c) Essay questions require more time for marking learner responses. Teachers
need to invest large amounts of time to read and mark learnersÊ responses to
essay questions. On the other hand, relatively little or no time is required of
teachers for scoring objective test items such as the multiple-choice items and
matching exercises.
(d) As mentioned earlier, one of the strengths of essay questions is that they
provide learners with authentic experiences because learners are challenged
to construct rather than to select their responses. To what extent does the
short time given affect learner response? Learners have relatively little time
to construct their responses and it does not allow them to pay appropriate
attention to the complex process of organising, writing, and reviewing their
responses. In fact, in responding to essay questions, learners use a writing
process that is quite different from the typical process that produces excellent
writing (such as draft, review, revise and evaluate). In addition learners
usually have no resources to aid their writing when answering essay
questions (such as dictionary and thesaurus). These limitations may offset
whatever advantage accrued from the fact that responses to essay questions
are more authentic than responses to multiple-choice items.
5.6 SOME MISCONCEPTIONS ABOUT ESSAY

QUESTIONS
Besides the limitations of essay questions, there are also some misconceptions
about this format of assessment such as:
(a) By their Very Nature Essay Questions Assess Higher-order Thinking (HOT)
Whether or not an essay item assesses HOT depends on the design of the
question and how learnersÊ responses are scored. An essay question does not
automatically assess higher-order thinking skills (HOTS). It is possible to
write essay questions that simply assess recall. Also, if a teacher designs an
essay question meant to assess HOT but then scores learnersÊ responses in a
way that only rewards recall ability, this means the teacher is not assessing
HOT. Teachers must be well-trained to design and write HOTS items.
(b) Essay Questions are Easy to Construct

Essay questions are easier to construct than multiple-choice items because
teachers do not have to create effective distractors. However, that does not
mean that good essay questions are easy to construct. They may be easier to
construct in a relative sense, but they still require a lot of effort and time.
Essay questions that are hastily constructed without much thought and
review usually function poorly.

(c) The Use of Essay Questions Eliminates the Problem of Guessing

One of the drawbacks of objective test items is that learners sometimes get
the right answer by guessing which of the presented options is correct. This
problem does not exist with essay questions because learners need to
generate the answer rather than identifying it from a set of options provided.
However, the use of essay questions introduces bluffing which is another
form of guessing. Some learners are „good‰ at using various methods of
bluffing (such as vague generalities, padding and name-dropping) to add
credibility to an otherwise weak answer. Thus, the use of essay questions
changes the nature of the guessing that occurs but does not eliminate it.
(d) Essay Questions Benefit All Learners by Placing Emphasis on the Importance
of Written Communication Skills
Written communication is a life competency that is required for effective and
successful performance in many vocations. Essay questions challenge
learners to organise and express the subject matter and problem solutions in
their own words, thereby giving them a chance to practice written
communication skills that will be helpful to them in future vocational
responsibilities. At the same time, the focus on written communication skills
is also a serious disadvantage for learners who have marginal writing skills
but know the subject matter being assessed. For learners who are
knowledgeable in the subject matter but obtain low scores because of their
inability to write well, the validity of the test scores is compromised.
(e) Essay Questions Encourage Learners To Prepare More Thoroughly

Some research seems to indicate that learners are more thorough in their
preparation for essay questions than in their preparation for objective tests
such as multiple-choice questions. However, after an extensive review of
existing literature and research on the matter, Crook (1988) concluded that
learners are more prepared based on the expectations teachers place upon
them (such as HOT, breadth and depth of content) than by the type of test
question items they expect to be given.
SELF-CHECK 5.2
1. What are some limitations in the use of essay questions?
2. Suggest other weaknesses of using essay questions in examinations.
3. List some of the misconceptions of essay questions.

ACTIVITY 5.2
Compare the following two essay questions and decide which one
assesses higher-order thinking skills. Discuss your answer with your
course mates.
(a) What are the major advantages and limitations of solar energy?
(b) Given its advantages and limitations, should governments spend

money developing solar energy?
5.7 SOME GUIDELINES FOR CONSTRUCTING

ESSAY QUESTIONS
When constructing essay questions, whether it is for coursework assessment or
examination, the most important thing is to ensure that learners have a clear idea
on what they are expected to do after they have read the question or problem
presented.
The following are specific guidelines that can help you improve existing essay
questions or create new ones:
(a) Clearly Define the Intended Learning Outcome to be Assessed by the

Question
Knowing the intended learning outcome is crucial for designing essay
questions. In specifying the intended learning outcome, teachers clarify the
performance that learners should be able to demonstrate as a result of what
they have learned. The intended learning outcome typically begins with a
verb that describes an observable behaviour, action or outcome that learners
should demonstrate. The focus is on what learners should be able to do and
not on the learning or teaching process. Reviewing a list of verbs can help to
clarify what ability learners should demonstrate and to clearly define the
intended learning outcome to be assessed (refer to subtopic 5.8).

(b) Avoid Using Essay Questions for Intended Learning Outcomes that Can be
Better Assessed with Other Types of Assessment
Some types of learning outcomes can be more efficiently and more reliably
assessed with objective tests than with essay questions. Since essay questions
sample a limited range of subject-matter content, are more time consuming
to score and involve greater subjectivity in scoring, the use of essay questions
should be reserved for learning outcomes that cannot be better assessed by
some other means.
For Example:
Learning outcome: To be able to differentiate the reproductive habits of

birds and amphibians
Essay question: What are the differences in egg laying characteristics

between birds and amphibians?
Note: This learning outcome can be better assessed by using an objective

test.
Objective Test Item:

Which of the following differences between birds and amphibians is
correct?
Birds Amphibians
A Lay a few eggs at a time Lay many eggs at a time
B Lay eggs Gives birth
C Do not incubate eggs Incubate eggs
D Lay eggs in nest Lay eggs on land
(c) Clarity about the Task and Scope

Essay questions have two variable elements, namely the degree to which the
task is structured and the degree to which the scope of the content is focused.
There is still confusion among educators whether more structure of the task
and more focus on the content are better than less structure and less focus.
When the task is more structured and the scope of content is more focused,
two problems are reduced:
(i) The problem of learner responses containing ideas that were not meant
to be assessed; and

(ii) The problem of extreme subjectivity when scoring learner answers or

responses.
Although more structure helps to avoid these problems, how much and what
kind of structure and focus to provide is dependent on the intended learning
outcome that is to be assessed by the essay question. The process of writing
effective essay questions involves defining the task and delimiting the scope
of the content in an effort to create an effective question that is aligned with
the intended learning outcome to be assessed (illustrated in Figure 5.1).
Figure 5.1: Alignment between content, learning activities and assessment tasks
Source: Phillips, Ansary Ahmed and Kuldip Kaur (2005)
The alignment shown in Figure 5.1 is absolutely necessary in order to obtain the
learnerÊs response that can be accepted as evidence that the learner has achieved
the intended learning outcome. Hence, the essay question must be carefully and
thoughtfully written in such a way that it elicits learner responses which provide
the teacher with valid and reliable evidence about the learnersÊ achievement of the
intended learning outcome. Failure to establish adequate and effective limits for
learnersÊ answers to the essay question may result in learners setting their own
boundaries for their responses. This means that learners might provide answers
that are outside the intended task or address only a part of the intended task. When
this happens the teacher is left with unreliable and invalid information about the
learnersÊ achievement of the intended learning outcome. Moreover, there is no
basis for marking or grading the learnersÊ answers. Therefore, it is the
responsibility of the teacher to write essay questions in such a way that they
provide learners with clear boundaries for their answers or responses.

For Example:
Improving Clarity of Task and Scope of Essay Questions
Weak Essay Question:

Evaluate the impact of the Industrial Revolution on England.
The verb is „evaluate‰ which is the task the learner is supposed to do. The scope
of the question is the impact of the Industrial Revolution on England. Very little
guidance is given to learners about the task of evaluating and the scope of the
task. A learner reading the question may ask:
(a) The impact on what in England? The economy? Foreign trade? A

particular group of people? [The scope is not clear].
(b) Evaluate based on what criteria? The significance of the revolution? The
quality of life in England? Progress in technological advancement? [The
task is not clear].
(c) What exactly do you want me to do in my evaluation? [The task is not

clear]
Improved Essay Question:

Evaluate the impact of the Industrial Revolution on the quality of family life in
England. Explain whether families were able to provide for the education of
their children.
The improved question delimits the task for learners by specifying a particular
unit of society in England affected by the Industrial Revolution (family). The
task is also delimited by giving learners a criterion for evaluating the impact of
the Industrial Revolution (whether or not families were able to provide for their
childrenÊs education). Learners are clearer as to what must be done to
„evaluate.‰ They need to explain how the family has changed and judge
whether the changes are an improvement for their children.
SELF-CHECK 5.3
1. When would you decide to use an objective item rather than an

essay question to assess learning?
2. What is the difference between the task and scope of an essay

question?

(d) Questions that are Fair

One of the challenges that teachers face when composing essay questions is
that they may be tempted to demand unreasonable content expertise on the
part of the learners because of their extensive experience with the subject
matter. Hence, teachers need to make sure that their learners can „be
expected to have adequate material with which to answer the question.‰
(Stalnaker, 1952). In addition, teachers should ask themselves if learners can
be reasonably expected to adequately perform the thought processes which
are required of them in the task. For assessment to be fair, teachers need to
provide their learners with sufficient instruction and practice in the subject
matter and the thought processes to be assessed.
Another important element is to avoid using indeterminate questions. A

question is indeterminate if it is so unstructured that learners can redefine
the problem and focus on some aspects which they are thoroughly familiar
with or if experts in the subject matter cannot agree that one answer is better
than another. One way to avoid indeterminate questions is to stay away from
vocabulary that is ambiguous. For example, teachers should avoid using the
verb „discuss‰ in an essay question. This verb is simply too broad and vague.
Moreover, teachers should also avoid including vocabulary that is too
advanced for learners.
(e) Specify the Approximate Time Limit and Marks Allotted for each Question
Specifying the approximate time limit helps learners allocate their time in
answering several essay questions. Without such guidelines learners may be
at a loss as to how much time to spend on a question. When deciding the
guidelines for how much time should be spent on a question, keep the slower
learners and learners with certain disabilities in mind. In addition, make sure
that learners can be realistically expected to provide an adequate answer in
the given or suggested time. Similarly, state the marks allotted for each
question so that learners can decide how much they should write for the
question.
(f) Use Several Relatively Short Essay Questions Rather than One Long Essay
Question
Only a very limited number of essay questions can be included in a test
because of the time it takes for learners to respond to them and the time it
takes for teachers to grade the learnersÊ responses. This creates a challenge
with regard to designing valid essay questions. Shorter essay questions are
better suited to assess the depth of learnersÊ learning within a subject whereas
longer essay questions are better suited to assess the breadth of learnersÊ
learning within a subject. As such, there is a trade-off when choosing
between several short essay questions and one long essay question. Focus on

assessing the depth of learnersÊ learning limits the assessment of the breadth
of learnersÊ learning within the same subject and vice versa. When choosing
between using several short essay questions or one long one, also keep in
mind that short essays are generally easier to mark than long essay questions.
(g) Avoid the Use of Optional Questions

Learners should not be permitted to choose one essay question from two or
more optional questions. The use of optional questions should be avoided
for the following reasons:
(i) Learners may waste time deciding on an option; and
(ii) Some questions are likely to be harder which could make the
comparative assessment of learnersÊ abilities unfair.
The issue of the use of optional questions is debatable. It is often practiced

especially in higher education and learners often demand that they be given
choices. The practice is acceptable if it can be assured that the questions have
equivalent difficulty levels and that the tasks as well as the scope required
by the questions are equivalent.
(h) Constructing Higher-order Thinking Questions

Higher-order thinking (HOT) is thinking on a level that is higher than
memorising facts. Any question which merely assesses learnersÊ recall is
when the answer has already been previously discussed in the classroom.
One way to write HOT essay questions is to present learners with a situation
they have not previously encountered so that they must reason with their
knowledge and provide an authentic assessment of complex thinking. For
example, present something in the form of introductory text, visuals,
scenarios, resource material or problems of some sort for learners to think
about.
An advantage of presenting a specific problem situation which has not been

encountered before is that learners learn to apply their thinking to real-life
context.
Teachers should not demand unreasonable content expertise on the part of

the learners.

Improving Essay Questions through Preview and Review

The following steps can help you improve the essay item before and
after you administer it to your learners.
PREVIEW (before handing out the essay question to the learners)
 Predict LearnersÊ Responses

Try to respond to the question from the perspective of a typical learner.
Evaluate whether learners have the content knowledge and the skills
necessary to adequately respond to the question. Weaknesses identified
in the essay question should be amended before handing out the exam.
 Write A Model Answer

Before administering the question, write the model answer or at least an
outline of major points that should be included in the answer. Writing
the model answer allows the teacher to reflect on the clarity of the essay
question. Furthermore, the model answer serves as a basis for grading
learnersÊ responses. Once the model answer has been written, compare
its alignment with the essay question and the intended learning
outcome. Make changes as needed so that the intended learning
outcome, the essay question and the model answer are aligned with each
other.
 Ask a Knowledgeable Colleague to Critically Review the Essay
Question, the Model Answer and the Intended Learning Outcome
to Check for Alignment
Before administering the essay question in a test, ask a person
knowledgeable in the subject to critically review the essay question, the
model answer and the intended learning outcome to determine how
well they are aligned with each other.
REVIEW (after receiving the learner responses)
 Review LearnersÊ Responses to the Essay Question

After learners have completed the essay question, carefully review the
range of answers given and the manner in which learners have
interpreted the question. Make revisions based on the findings. Writing
good essay questions is a process that requires time and practice.
Carefully studying the learnersÊ responses can help to evaluate learners'
understanding of the question as well as the effectiveness of the question
in assessing the intended learning outcomes.

SELF-CHECK 5.4
1. Why should you specify the time to be allotted for each question?
2. Why should you avoid optional questions?
3. What do you mean by questions should be „fair‰?
4. What should you do before and after administering a test?
5. What are HOT essay questions?
5.8 VERBS DESCRIBING VARIOUS KINDS OF

MENTAL TASKS
Using the lists suggested by Moss and Holder (1988) and Anderson and Krathwohl
(2001), Reiner, Bothell, Sudweeks and Wood (2002) proposed the following list of
verbs that describe mental tasks to be performed (Table 5.1). The definitions
specify thought processes that a person must perform in order to complete the
mental task. Note that this list is not exhaustive and local examples have been
introduced to illustrate the mental tasks required for each essay question.
Table 5.1: Verb Definitions and Examples of Mental Tasks to be Performed
Verb Definition and Example

Analyse Break material into its constituent parts and determine how the parts
relate to one another and to an overall structure or purpose.
Example: Analyse the meaning of the line „He saw a dead crow, in a
drain, near the post office‰ in the poem The Dead Crow.
Apply Decide which abstractions (concepts, principles, rules, laws, theories and
generalisations) are relevant in a problem situation.
Example: Apply the principles of supply and demand to explain why
the consumer price index (CPI) in Malaysia has increased over
the last three months.
Attribute Determine a point of view, bias, value or intent underlying the presented
material.
Example: Determine the point of view of the author in the article about
her political perspective.

Classify Determine which category belongs to something.

Example: Classify organisms into vertebrates and invertebrates.
Compare Identify and describe points of similarity.
Example: Compare the role of the Dewan Rakyat and Dewan Negara.
Compose Make or form by combining things, parts or elements.
Example: Compose an effective plan for solving the flood problem in
Kuala Lumpur.
Contrast Bring out the points of difference.
Example: Contrast the contribution of Tun Hussein Onn and Tun Abdul
Razak Hussein to the political stability of Malaysia.
Create Put elements together to form a coherent or functional whole, reorganise
elements into a new pattern or structure.
Example: Create a solution for the traffic problem in Kuala Lumpur.
Critique Detect consistencies and inconsistencies between a product and relevant
external criteria; detect the appropriateness of a procedure for a given
problem.
Example: Base on your judgement, which of the two methods is the best
way for solving the water shortage problem?
Defend Develop and present an argument to support a recommendation, to
maintain or revise a policy, programme or propose a course of action.
Example: Defend the decision by the government to raise fuel price.
Define Give the meaning of a word or concept; place it in the class in which it
belongs to; distinguish it from other items in the same class.
Example: Define the term „chemical weathering‰.
Describe Give an account of; tell or depict in words; represent or delineate by a
word picture.
Example: Describe the contributions of ZaÊba to the development of
Bahasa Melayu.
Design Devise a procedure for accomplishing some tasks.
Example: Design an experiment to prove that 21 per cent of air is
composed of oxygen.
Develop Bring to a more advanced, effective or usable state; produce.
Example: Develop an essay question by improving upon a less effective
essay question.

Differentiate Distinguish relevant from irrelevant parts or important from

unimportant parts of the presented material.
Example: Distinguish between supply and demand in determining
price.
Explain Make clear the cause or reason of something; construct a cause-and-
effect model of a system; tell „how‰ to do; tell the meaning of.
Example: Explain the causes of the First World War.
Evaluate Make judgments based on criteria and standards; determine the
significance, value, quality or relevance of; give the good points and the
bad ones; identify and describe advantages and limitations.
Example: Evaluate the contribution of the microchip in
telecommunications.
Generate Introduce alternative hypotheses, examples, solutions and proposals
based on criteria.
Example: Generate hypotheses to account for an observed
phenomenon.
Identify Recognise as being a particular person or thing.
Example: Identify the characteristics of the Mediterranean climate.
Illustrate Use a word picture, a diagram, a chart or a concrete example to clarify a
point.
Example: Illustrate the use of catapults in the amphibious warfare of
Alexander.
Infer Draw a logical conclusion from the presented information.
Example: What can you infer from the results of the experiment?
Interpret Give the meaning of; change from one form of representation (such as
numerical) to another (such as verbal).
Example: Interpret the poetic line, „The sound of a cobweb snapping is
the noise of my life.‰
Justify Show good reasons for; give your evidence; present facts to support your
position.
Example: Justify the American entry into World War II.

List Create a series of names or other items.

Example: List the major functions of the human heart.
Predict Know or tell beforehand with precision of calculation, knowledge or
shrewd inference from facts or experience about what will happen.
Example: Predict the outcome of a chemical reaction.
Propose Offer for consideration, acceptance, or action; suggest.
Example: Propose a solution for landslides along the North-South
Highway.
Recognise Locate knowledge in long-term memory that is consistent with the
presented material.
Example: Recognise the important events that took place on the journey
to independence in Malaysia.
Recall Retrieve relevant knowledge from long-term memory.
Example: Recall the dates of important events in Islamic history.
Summarise Sum up; give the main points briefly.
Example: Summarise the ways in which man preserves food.
Trace Follow the course of; follow the trail of; give a description of progress.
Example: Trace the development of television in providing school
instruction.
SELF-CHECK 5.5
1. Select some essay questions in your subject area and examine
whether the verbs used are similar to the list in Table 5.1. Do you
think the tasks required by the verbs used are appropriate?
2. Do you think learners are able to differentiate between the tasks
required in the verbs listed?
3. Are teachers able to describe to learners the tasks required by these
verbs?

The following shows a checklist for writing essay questions.
Checklist for Writing Essay Questions

Could the item be better assessed with a different kind of assessment?
Is the essay question aligned with the intended learning outcome?
Is the length of the essay question appropriate or should it be split
into short essay questions?
Does the essay question contain a clear task and a specified scope?
Is the question worded and structured clearly?
Do learners know the recommended time for each question?
Have you written a model answer or an outline for the answer?
Do you have a person knowledgeable in the subject matter to review
the question?
5.9 MARKING ESSAY QUESTIONS

Marking or grading of essays is a notoriously unreliable activity. If we read an
essay at two separate times, chances are good that we will give the essay a different
grade each time. If two or more teachers read the essay, the grades will likely differ,
often dramatically so. We all like to think that we are the exceptions but study after
study of well-meaning and conscientious teachers show that essay grading is
unreliable (Ebel, 1972; Hills, 1976; McKeachie, 1986; White, 1985). Eliminating the
problem is unlikely but we can take steps to improve grading reliability. Using a
scoring guide or marking scheme helps control the shifting of standards that
inevitably takes place as we read a collection of essays and papers.
The two most common forms of scoring guides used are the analytic and holistic
methods.
(a) Analytic Method

Analytical marking is the system most frequently used in large-scale public
examinations (such as the SPM and STPM) as well as by teachers in the
classrooms. Its basic characteristic is the marking scheme and the marks
allocated (as shown in Figure 5.2). The marking scheme consists of a list of
the major elements the teacher believes learners should include in the ideal
answer. In the example, marks are allocated for describing each of the five

factors (three marks for each factor) and one mark is given for providing
a relevant example. Marks are also allotted for the „introduction‰ and
„conclusion‰ which are important elements in an essay.
For Example:
Explain the five factors that determine the location of a manufacturing plant.
Marking scheme (maximum 25 points)
1. Introduction Points
 Sets the organisation of the essay 2
2. Site or land
 Description (how site influences location) 3
 Example 1
3. Labour
 Description (role of workers) 3
 Example 1
4. Capital (Loan from banks)

 Description (obtaining a loan) 3
 Example 1
5. Transport system
 Description (access to port, population) 3
 Example 1
6. Entrepreneurial skills
 Description (kind of skill) 3
 Example 1
7. Conclusion (interplay of the five factors) 3

TOTAL 25
Figure 5.2: Sample of a marking scheme using the analytical method
The marker reads and compares the learnerÊs answer with the model answer.
If all the necessary elements are present, the learner receives the maximum
number of points. Partial credit is given based on the elements included in
the answer. In order to arrive at the overall exam score, the instructor adds
the points earned on the separate questions.

Identify in advance what will be worth a point and how many points are
available for each question. Make sure that learners are aware of this so that
they do not give more (or less) than necessary and they know precisely what
you are looking for. If learners produce an unexpected but correct example,
give them the point immediately and add that point to your answer key so
that the next learner will also get the point.
(b) Holistic Method (Global or Impressionistic Marking)

The holistic approach to scoring essay questions involves reading an entire
response and assigning it to a category identified by a score or grade. This
method involves considering the learnerÊs answer as a whole and judging
the total quality of the answer relative to other learnersÊ responses or the total
quality of the answer based on certain criteria that you have developed.
Think of it as sorting into bins. You read the answer to a particular question
and assign it to the appropriate bins. The best answers go into the
„exemplary‰ bin, the good goes into the „good‰ bin and the weak answer
goes into the „poor‰ bin. Then points are written on each paper appropriate
to the bin it is in. It is based on overall impression, not point by point using
the rubric or marking scheme, but based on a holistic approach (as shown in
Table 5.2). You cannot be much more refined than five divisions, six at most.
You have to re-read everything to ensure that all the papers in the five piles
are roughly the same. And all those in the middle pile are roughly the same
and less good than those in the top pile and so on.
You can develop a description of the type of response that would illustrate
each category before you start and try out this draft version using several
actual papers. After reading and categorising all of the papers, it is a good
idea to re-examine the papers within a category to see if they are similar
enough in quality to receive the same points or grade. It may be faster to read
essays holistically and provide only an overall score or grade but learners do
not receive much feedback about their strengths and weaknesses. Some
instructors who use holistic scoring also write brief comments on each paper
to point out one or two strengths and/or weaknesses so that learners will
have a better idea as to why their responses received such scores.

Table 5.2: Sample of a Marking Scheme Using the Holistic Method
Level of
General Presentation Reasoning, Argumentation
Achievement
Exemplary  Addresses the question.  Demonstrates accurate and
(10 pts)  States a relevant complete understanding of the
argument. question.
 Presents arguments in a  Uses several arguments and

logical order. backs arguments with examples
and data that support the
 Uses acceptable style and conclusion.
grammar (no errors).
Good  Combination of above  Same as above but less thorough
(8 points) traits but less consistently even though still accurate.
represented (1-2 errors).  Uses only one argument and
example that supports the
conclusion.
Adequate  Does not address the  Demonstrates minimal
(6 pts) question explicitly, though understanding of the question,
does so tangentially. still accurate.
 States a somewhat  Uses a small subset of possible
relevant argument. ideas for support of the
 Presents some arguments argument.
in a logical order.
 Uses adequate style and
grammar (more than 2
errors).
Poor  Does not address the  Does not demonstrate
(4 pts) question. understanding of the question,
 States no relevant inaccurate.
arguments.  Does not provide evidence to
 Is not clearly or logically support response to the
organised. question.
 Fails to use acceptable

style and grammar.
No answer
(0 pts)

SELF-CHECK 5.6
1. Compare and contrast the analytical method and holistic method of
marking essays.
2. Which method is widely practised in your institution? Why?
Lastly before we end the topic, let us look at some suggestions for marking or
scoring essays:
(a) Grade papers anonymously. This will help control the influence of our
expectations regarding the learner being evaluated.
(b) Read and score the answers to one question before going on to the next
question. In other words, score all the learnersÊ responses to Question 1
before looking at Question 2. This helps to keep one frame of reference and
one set of criteria in mind throughout all the papers, which will also result in
more consistent grading. It also prevents an impression that we form in
reading one question from being carried over to our reading of the learnerÊs
next answer. If a learner has not done a good job on the first question, for
example, we might allow this impression to influence our evaluation of the
learnerÊs second answer. However, if other learnersÊ papers come in
between, we are less likely to be influenced by the original impression.
(c) If possible, try to grade all the answers to one particular question without
interruption. Our standards might vary from morning to night or from one
day to the next.
(d) Shuffle the papers after each item is scored throughout all the papers.
Changing the order reduces the context effect and the possibility that a
learnerÊs score is the result of the location of the paper in relation to other
papers. For example, if RakeshÊs „B‰ work is always followed by JamalÊs „A‰
work, then it might look more like „C‰ work and his grade would be lower
than if his paper were somewhere else in the stack.
(e) Decide in advance how you are going to handle extraneous factors and be
consistent in applying the rule. Learners should be informed about how you
treat matters such as misspelled words, neatness, handwriting, grammar and
so on.

(f) Be on the alert for bluffing. Some learners who do not know the answer may
write a well-organised coherent essay which may contain material that is
irrelevant to the question. Decide how to treat irrelevant or inaccurate
information contained in learnersÊ answers. We should not give credit for
irrelevant material. It is not fair to other learners who may also have
preferred to write on another topic but instead wrote on the required
question.
(g) Write comments on learnersÊ test scripts. TeacherÊs comments make essay
tests a good learning experience for learners. They also serve to refresh your
memory of your evaluation should the learner question the grade he
receives.
(h) Be aware of the order in which papers are marked as it can have an impact
on the grades that are awarded. A marker may grow more critical (or more
lenient) after having read several papers, thus the early papers receive lower
(or higher) marks than papers of similar quality that are scored later.
(i) When learners are directed to take a stand on a controversial issue, the
marker must be careful to insure that the evidence and the way it is presented
is evaluated, NOT the position taken by the learner. If the learner takes a
position contrary to that of the marker, the marker must be aware of his or
her own possible bias in marking the essay because the learnerÊs position
differs from that of the marker.
 An essay is a test item which requires a response composed by the examinee

usually in the form of one or more sentences of a nature that no single response
or pattern of responses can be listed as correct, and the accuracy and quality of
which can be judged subjectively only by one who is skilled or informed in the
subject matter.
 There are two types of essays based on their function, namely, coursework
essay and examination essay.
 Essay questions provide an effective way of assessing complex learning

outcomes.

 Essay questions provide authentic experiences because constructing responses

are closer to real life than selecting responses.
 It is not possible with the use of just a few essay questions to assess the learnerÊs
mastery of the complete subject-matter domain.
 Essay questions have two variable elements, which are the degree to which the
task is structured and the degree to which the scope of the content is focused.
 Whether or not an essay item assesses higher-order thinking (HOT) depends

on the design of the question and how learnersÊ responses are scored.
 Specifying the approximate time limit helps learners allocate their time in
answering several essay questions during a test.
 Avoid using essay questions for intended learning outcomes that can be better
assessed with other types of assessment.
 Analytical marking is the system most frequently used in large-scale public

examinations and also by teachers in the classrooms. Its basic characteristic is
the marking scheme and the marks allocated for elements in the answer.
 The holistic approach to scoring essay questions involves reading an entire

response and assigning it to a category identified by a score or grade.
Analytical method Intra-scorer reliability

Complex learning outcomes Marking scheme
Constructed responses Mental tasks
Essay questions Model answer
Grading Rubric
Holistic method Task and scope
Inter-scorer reliability Time consuming

Topic  Authentic
Assessments
6
LEARNING OUTCOMES
1. Define authentic assessment;
2. Discuss authentic assessments in the classroom;
3. Differentiate the advantages and criticisms of authentic assessment;
and
4. Compare authentic and traditional assessments.
 INTRODUCTION
Many teachers use traditional assessment tools such as multiple-choice and essay
tests to assess their learners. How well do these multiple-choice or essay tests
really evaluate learner understanding and achievement? These traditional
assessment tools do serve a role in the assessment of learner outcomes. However,
assessment does not always have to involve paper and pencil tests. It can also be a
project, an observation or a task as long as it is able to demonstrate a learner has
learned the material. Are these alternative assessment tools more effective than
traditional ones?
Some classroom teachers are using testing strategies that do not focus entirely on
recalling facts. Instead, they ask learners to demonstrate skills and concepts they
have learned. Teachers may want to ask the learners to apply their skills to
authentic tasks and projects or to have learners demonstrate their knowledge in
ways that are much more applicable to life outside of the classroom. Learners must

108  TOPIC 6 AUTHENTIC ASSESSMENTS
then be trained to perform meaningful tasks that replicate real world challenges.
In other words, these teachers are trying to assess studentsÊ abilities in „real-
world‰ contexts. In order to do this, learners are asked to perform a task such as
to explain the historical events or solve math problems rather than select an answer
from a ready-made list.
This strategy of asking learners to perform real-world tasks that demonstrate

meaningful application of essential knowledge and skills is called authentic
assessment. An authentic assessment is to have learners demonstrate their
knowledge in ways that are much more applicable to life outside of the classroom.
6.1 WHAT IS AUTHENTIC ASSESSMENT?

According to Wiggins (1998), teachers become most effective when they seek
feedback from learners and their peers and use that feedback to adjust approaches
to design and teaching. Effective curriculum development reflects a three-stage
design process called backward design that delays the planning of classroom
activities until goals have been clarified and assessments designed. This process
helps to avoid the problems of textbook coverage and activity-oriented teaching in
which no clear priorities and purposes are apparent.
Learner and school performance gains are achieved through regular reviews of
results followed by targeted adjustments to curriculum and instruction.
Authentic assessment, in contrast to more traditional assessment, encourages the

integration of teaching, learning and assessment. In the traditional assessment
model, teaching and learning are often separated from assessment whereby a test
is administered after knowledge or skills have been acquired.
An authentic assessment usually includes a task for learners to perform and a

rubric by which their performance of the task will be assessed. Thus performing
science experiments, writing stories and reports, solving mathematics problems
that have real-world applications can be considered as authentic assessments.
Useful achievement data can be obtained via authentic assessments.

TOPIC 6 AUTHENTIC ASSESSMENTS  109
Teachers can teach learners how to do mathematics, do history and do science, not
just know them. Subsequently, to assess what the learners had learned, teachers
can ask learners to perform tasks that replicate the challenges faced by using
mathematical principles, conducting historical or scientific investigations.
Well-designed traditional classroom assessments such as tests and quizzes can

effectively determine whether or not learners have acquired a body of knowledge.
In contrast, authentic assessments require learners to demonstrate understanding
by performing a more complex task usually representative of a more meaningful
application. These tasks require learners to analyse, synthesise and apply what
they have learned in a substantial manner and learners create new meaning in the
process as well. In short, authentic assessment helps answer the question „How
well can you use what you know?‰ whereas traditional testing helps answer the
question „Do you know it?‰
The usual or traditional classroom assessments such as multiple-choice tests and

short-answer tests are just as important as authentic assessment. Authentic
assessments complement traditional assessments. Authentic assessments have
been gaining acceptance among early childhood and primary school teachers
where traditional assessments may not be appropriate.
Examples of authentic assessment strategies (Brady, 2012) include:
(a) Exhibiting an athletic skill;
(b) Constructing a short musical, dance or drama;
(c) Publishing a class brochure;
(d) Creative arts: role plays, oral presentations, performances;
(e) Planning: mind or concept maps, flow charts;
(f) ICT Tools: webpages, videos, photos, depiction of freeze frames;
(g) Creating: model building, creative writing;
(h) Evaluating: teacher-learner feedback, peer feedback, peer teaching; and
(i) Unstructured tasks: problem-solving tasks, open-ended questions, formal

and informal observations.

6.1.1 Types of Authentic Assessments

Authentic assessment is sometimes referred to as performance assessment,
alternative assessment or direct assessment.
(a) Performance Assessment

Authentic assessment is sometimes called performance assessment or
performance-based assessment because learners are asked to perform
meaningful tasks. Performance assessment is „a test in which the test taker
actually demonstrates the skills the test is intended to measure by doing real-
world tasks that require those skills, rather than by answering questions
asking how to do them‰ (Ark, 2013). Project-based learning (PBL) and
portfolio assignments are examples of performance assessments. With
performance assessment, teachers observe learners while they are
performing in the classroom and judge the level of proficiency demonstrated.
Authentic tasks are rooted in curriculum and teachers can develop tasks
based on what already works for them. Through this process, more evidence-
based assignments such as portfolios become more authentic and more
meaningful to learners.
(b) Alternative Assessment

Authentic assessment is an alternative to traditional assessment, hence the
term used. Learners participate actively in evaluating themselves and one
another through the use of checklists and rubrics in self and peer evaluations.
Alternative assessments measure performance in ways other than the
traditional paper and pencil or short-answer tests. For example, the Klang
Valley science teacher may have the learners identify the different pollutants
in the Klang River and report the information collected and analysed to the
local environmental council.
(c) Direct Assessment

Authentic assessment provides more direct evidence of meaningful
application of knowledge and skills. If a learner does well on a multiple-
choice test we might infer indirectly that the learner could apply that
knowledge in real-world contexts. However, we would be more comfortable
making that inference from a direct demonstration of the application such as
in the river pollutants example stated earlier. We do not just want learners to
know the content of the disciplines when they leave school. We want them
to apply other knowledge and skills that they have learned as well. Direct
evidence of learner learning is tangible, visible and measureable. It is a more
compelling evidence of knowing exactly what learners have or have not
learned. Teachers can directly look at learnersÊ work or performances to
determine what they have learned.

6.1.2 Characteristics of Authentic Assessments

Reeves et al. (2002) have much to say about the characteristics of authentic
learning. Their list of ten characteristics summarise the features of authentic
assessment as explained in the following:
(a) Authentic activities have real-world relevance: The assessment is meant to

focus on the impact of oneÊs work in real or realistic contexts.
(b) Authentic activities require learners to define the tasks and sub-tasks needed
to complete the activity: Problems inherent in the activities are open to
multiple interpretations rather than easily solved by the application of
existing algorithms.
(c) Authentic activities comprise complex tasks to be investigated by learners

over a sustained period of time: Activities are completed in days, weeks and
months rather than minutes or hours. They require significant investment of
time and intellectual resources.
(d) Authentic activities provide the opportunity for learners to examine the task
from different perspectives, using a variety of resources: The use of a variety
of resources rather than a limited number of preselected references requires
learners to detect relevant from irrelevant information.
(e) Authentic activities provide the opportunity to collaborate: Collaboration is

integral to the task, both within the course and the real world, rather than
achievable by the individual learner.
(f) Authentic activities provide the opportunity to reflect: Activities need to

enable learners to make choices and reflect on their learning, both individual
and socially.
(g) Authentic activities can be integrated and applied across different subject
areas and lead beyond domain-specific outcomes: Activities encourage
interdisciplinary perspectives and enable learners to play diverse roles, thus
building robust expertise rather than knowledge limited to a single well-
defined field or domain.
(h) Authentic activities are seamlessly integrated with assessment: Assessment

of activities is seamlessly integrated with the major task in a manner that
reflects real-world assessment rather than separate artificial assessment
which is removed from the nature of the task.
(i) Authentic activities create polished products valuable in their own right
rather than as preparation for something else.

(j) Authentic activities allow for competing solutions and diversity of outcomes:
Activities allow for a range and diversity of outcomes open to multiple
solutions of an original nature rather than a single correct response obtained
by the application of rules and procedures.
SELF-CHECK 6.1
1. What are the other names used to refer to authentic assessment?

Why?
2. To what extent has authentic assessment been used as an

assessment strategy in Malaysian schools?
6.2 HOW TO CREATE AUTHENTIC

ASSESSMENTS?
Authentic assessments can seem overwhelming at first but it need not be the case.
For a start, take small steps such as collecting data one day a week, analysing it
and using the results to group learners in the following week. Follow these helpful
steps to create your own authentic assessment:
(a) Identify which standards you want the learners to achieve through this
assessment. Standards, like goals, are statements of what learners should
know and be able to meet. Standards must be observable and measurable.
Teachers can observe a performance but not an understanding. Thus a
statement such as „Learners will understand how to add two-digit numbers‰
is not observable whereas a statement such as „Learners will correctly add
two-digit numbers‰ is observable and measurable.
(b) Choose a relevant task for the standard or set of standards so that learners
can demonstrate how they have or have not met the standards. In this step,
teachers may want to find a way in which learners can demonstrate how they
are fully capable of meeting the standard. For the standard such as
„Understand how to add two-digit numbers‰, the task may be to ask learners
to describe a real-life situation, story or problem. Teachers may elicit
strategies from the learners, asking them to demonstrate and explain their
reasoning to their classmates. That might take the form of a multimedia
presentation which learners develop (individually or collaboratively),
utilising Ten-Frames with some counters.

(c) Define the characteristics of good performance for the task. This will provide
useful information regarding how well learners have met the standards.
For this step, teachers identify the criteria for good performance of this task.
They may write down a few characteristics for successful completion of the
task.
(d) Create a rubric or set of guidelines for learners to follow so that they are able
to assess their work as they perform the assigned task.
SELF-CHECK 6.2
1. How do teachers use authentic assessments in the classroom?
2. Give examples of authentic tasks for your subject area.
6.2.1 Advantages of Authentic Assessments

According to Wiggins (1998), while standardised, multiple-choice tests can be
valid indicators of academic performance, they often mislead learners into
believing that learning is cramming and mislead teachers into believing that tests
are after-the-fact, contrived and irrelevant. The move towards more authentic
tasks and outcomes improves teaching and learning. More educators favour the
use of more authentic assessments as a means of measuring learning outcomes not
easily measured by standardised tests.
Authentic assessment helps learners view themselves as active participants who

are working on a relevant task rather than passive recipients of obscure facts. It
helps teachers by encouraging them to reflect on the relevance of what they teach
and provides results that are useful for improving instruction.
Authentic assessments focus on the learning process, sound instructional practices

and high-level thinking skills and proficiencies needed for success in the real
world. As a result, they offer a huge advantage for learners who have been exposed
to them over those who have not.
Authentic assessments have many benefits, the main benefit being that it ensures
learner success. Authentic assessments focus on progress rather than identifying
weaknesses.

Other important benefits of authentic assessment include:
(a) It has the advantage of providing parents and community members with
directly observable products and understandable evidence concerning their
learnersÊ performance. The quality of learnersÊ work is more discernible to
laypersons compared to the reliance on abstract statistical figures;
(b) Uses tasks that reflect normal classroom activities or real-life learning. The
tasks are a means for improving instruction, allowing teachers to plan a
comprehensive, developmentally oriented curriculum based on their
knowledge of each child;
(c) Focuses on higher-order thinking skills such as applying, analysing,

evaluating and creating in BloomÊs taxonomy;
(d) Embeds assessment in the classroom context;
(e) Requires active performance to demonstrate understanding. Kinaesthetic

learners prefer being involved in activities. They need to apply the
information and make it their own by constructing something or practising
a technique or skill;
(f) Promotes a wide range of assessment strategies; and
(g) Involves the teacher and learner collaboratively in determining the

assessment.
6.2.2 Criticisms of Authentic Assessments

Authentic assessment provides solutions but it also creates problems. Criticisms
of authentic assessments generally involve both the informal development of the
assessments and difficulty in ensuring test validity and reliability given the
subjective nature of human scoring rubrics as compared to computers scoring
multiple-choice test items. Many teachers shy away from authentic assessments
because this methodology is time intensive to manage, grade, monitor and
coordinate. Teachers find it hard to provide a consistent grading scheme. The
subjective method of grading may lead to biasness. Teachers also find that this
method is not practical for large groups of learners.

Rubrics are an essential component of authentic assessments. They are a

wonderful tool to ensure a more authentic assessment of learnersÊ work. However,
creating a rubric is time consuming in the initial stage but is worth the investment.
Kohn (2006) observed that a rubric which was used to standardise the way we
think about learner assignments has many faults. Learners who presumably grew
accustomed to rubrics in other classrooms now seemed unable to function unless
every required item is spelled out for them. However, this does not mean that the
rubric is defective. As long as the rubric is only one of several sources and does not
drive the instruction, it could conceivably play a constructive role.
Based on the value of authentic assessments to learner outcomes, the advantages

of authentic assessments outweigh these concerns. For example, once the
assessment guidelines and grading rubrics are created, they can be kept in file and
used year after year.
There is nothing new about this authentic assessment methodology. It is not some
kind of radical invention recently fabricated by the opponents of traditional tests
to challenge the testing industry. Rather it is a proven method of evaluating human
characteristics and has been in use for decades (Linquist, 1951).
SELF-CHECK 6.3
1. What are some criticisms of authentic assessments?
2. What are the benefits of authentic assessments?
6.3 DIFFERENCES BETWEEN TRADITIONAL

ASSESSMENT AND AUTHENTIC
ASSESSMENT
Assessment is authentic when we directly examine learner performance on worthy
intellectual tasks. Traditional assessment, by contrast, relies on indirect or proxy
„items‰ (which are efficient, simplistic substitutes) from which we think valid
inferences can be made about the learnerÊs performance at those valued challenges
(Wiggins, 1990).

Table 6.1 summarises the major differences between the authentic and traditional
assessments.
Table 6.1: Major Differences between Authentic and Traditional Assessments
Attributes Authentic Assessment Traditional Assessment

Reasoning and  Schools must help learners  Schools must teach this
Practice become proficient at body of knowledge and
performing the tasks they skills.
will encounter when they  To determine if teaching is
leave schools. successful, the school must
 To determine if teaching is then test learners to see if
successful, the school must they have acquired the
then ask learners to knowledge and skills.
perform meaningful tasks
that replicate real world
challenges to see if learners
are capable of doing so.
Assessment and  Assessment drives the  Curriculum drives the
Curriculum curriculum. That is, assessment. The body of
teachers first determine the knowledge is determined
tasks that learners will first. That knowledge
perform to demonstrate becomes the curriculum
their mastery, and then a that is to be delivered.
curriculum is developed  Subsequently, assessments
that will enable learners to are developed and
perform those tasks well, administered to determine
which would include the if acquisition of the
acquisition of essential curriculum has occurred.
knowledge and skills. This
has been referred to as
planning backwards.
Selecting a  Learners are required to  Learners are typically given
Response to demonstrate understanding several choices and asked
Performing a Task by performing a more to select the correct answer.
complex task usually
representative of more
meaningful applications.

Contrived to Real-  Learners are asked to  Tests offer contrived means

life demonstrate proficiency by (Example: select one out
doing something. four options in MCQ) of
assessment to increase the
number of times you can be
asked to demonstrate
proficiency in a short
period of time.
Recall or  Authentic assessments  Traditional tests tend to
Recognition of require learners to be reveal only whether the
Knowledge to effective performers of the learner can recognise and
Construction or acquired knowledge. recall what was learned out
Application of  Teachers using this of context. Learners are
Knowledge methodology often ask often asked to recall or
learners to analyse, recognise facts.
synthesise and apply what
they have learned in a
substantial manner and
learners create new
meaning in the process as
well.
Teacher-structured  Authentic assessments  What a learner can and will
to Learner- allow more learner choice demonstrate has been
structured and construction in carefully structured by the
determining what is person(s) who developed
presented as evidence of the test.
proficiency.  A learnerÊs attention will
 Even when learners cannot understandably be focused
choose their own topics or on, and limited to, what is
formats, there are usually in the test.
multiple acceptable routes
towards constructing a
product or performance.

Indirect Evidence  Authentic assessments offer  The evidence is indirect

to Direct Evidence more direct evidence of particularly for claims of
application and meaningful application in
construction of knowledge. complex, real-world
 Example: Asking a learner situations.
to write a critique should  Example: In MCQ, a learner
provide more direct effectively cannot critique
evidence of that skill than the arguments someone
asking the learner a series else has presented (an
of multiple-choice and important skill often
analytical questions about a required in the real world).
passage.
Reliability and  Authentic assessment  Traditional testing
Validity achieves validity and standardises objective items
reliability by emphasising and, hence, the right
and standardising the answer(s) for each problem.
appropriate criteria for  Validity on most multiple-
scoring such products. choice tests is determined
 Test validity should merely by matching items
depend in part upon to the curriculum content.
whether the test simulates
real-world tests of ability.
Source: Adapted from Authentic Assessment Toolbox by Jon Mueller at

http://jfmueller.faculty.noctrl.edu/toolbox/
6.4 ASSESSMENT TOOLS

Common assessment tools used are rubrics, rating scales and checklists. Rubrics,
checklists and rating scales are authentic assessment tools that state specific criteria
and allow teachers and learners to gather information and to make judgments
about what learners know and can do in relation to the outcomes.
The quality of information acquired through the use of checklists, rating scales
and rubrics is highly dependent on the quality of the descriptors chosen for the
assessment. Their benefit is also dependent on learnersÊ direct involvement in the
assessment and understanding of the feedback provided.
The purposes of checklists, rating scales and rubrics are to:
(a) Provide tools for systematic recording of observations;
(b) Provide tools for self-assessment;

(c) Provide samples of criteria for learners prior to collecting and evaluating
data on their work recording the development of specific skills, strategies,
attitudes and behaviours necessary for demonstrating learning; and
(d) Clarify learnersÊ instructional needs by presenting a record of current

accomplishments.
6.4.1 Scoring Rubrics

Authentic assessments are typically criterion-referenced measures. That is, an
assessment designed to measure learner performance against a fixed set of
predetermined criteria. The criteria or the learning standards are concise, written
descriptions of what learners are expected to know and be able to do at a specific
stage of their learning. To measure learner performance against a predetermined
set of criteria, a rubric or scoring scale which contains the essential criteria for the
task and appropriate levels of performance for each criterion is created. Rubrics
are usually used in scoring or grading written assignments or oral presentations.
However, they may be used to score any form of learner performance.
Scoring rubrics have become a common method for assessing learners. Scoring
rubrics are descriptive scoring schemes that are developed by teachers or other
evaluators to guide the analysis of the products or processes of learnersÊ efforts
(Brookhart, 1999). As scoring tools, rubrics are a way of describing evaluation
criteria based on the expected outcomes and performances of learners. Each rubric
consists of a set of scoring criteria and point values associated with these criteria.
In most rubrics the criteria are grouped into categories so that the teacher and the
learner can discriminate among the categories by level of performance.
Rubrics have been introduced into todayÊs classrooms in order to give learners a
better understanding of what is being assessed, what criteria the grades are based
upon as well as what clear and compelling product standards are being addressed.
The focus of rubrics and scoring guides is to monitor and adjust progress rather
than to only assess the end result.
As a guide for planning, rubrics and scoring guides give learners clear targets of
proficiency. With these assessments in hand, they know what quality looks like
before they start working. When learners use such assessments regularly to judge
their own work, they begin to accept more responsibility for the end product.
Rubrics and scoring guides offer several advantages for assessment:

Learner performance is improved by clearly showing them how their work is

assessed and what is expected;
(a) Learners become better judges of the quality of their own work;
(b) Learners have more informative feedback about their strengths and areas in
need of improvement;
(c) Learners become aware of the criteria to use in providing peer feedback;
(d) Criteria are determined in specific terms;
(e) Assessment is more objective and consistent;
(f) Amount of time spent assessing learner work is reduced;
(g) Effectiveness of instruction is examined using multiple methods; and
(h) Progress is measured and documented against benchmarks.
The following rubric shown in Table 6.2 covers the research portion of a project.
Table 6.2: Example of a Research Rubric
Criteria 1 2 3 Wt. Score

Number of 1ă4 5ă9 10ă12 x1
sources
Historical Lots of Few No apparent x3
accuracy historical inaccuracies. inaccuracies.
inaccuracies.
Sources of Cannot tell Can tell with Can easily x1
information from which difficulty tell which
source the where the sources the
information information information
came from. came from. was drawn
from.
Bibliography Bibliography Bibliography All relevant x1
contains very contains information
little most of the is included.
information. relevant
information.
Total
Source: Adapted from Research Rubric

http://jfmueller.faculty.noctrl.edu/toolbox/rubrics.htm

A rubric comprised two components, namely the criteria and the levels of
performance. Each rubric has at least two criteria and at least two levels of
performance. The criteria, which are the characteristics of good performance on a
task in this example, are listed on the left-hand column in the rubric (number of
sources, historical accuracy, sources of information and bibliography).
The rubric also contains a mechanism for assigning a score to each project. In the
second last column a weight (Wt.) is assigned for each criterion. Learners can
receive 1, 2 or 3 points for number of sources criterion. However, the historical
accuracy criterion which is considered more important in the teacherÊs opinion, is
weighted three times (3) more heavily. This means, learners can receive 3, 6 or 9
points (that is 1  3, 2  3 or 3  3) for the level of accuracy in their projects.
In the example, „lots of historical inaccuracies‰, „can tell with difficulty where the
information came from‰ and „all relevant information is included‰ are descriptors.
The descriptors help the teacher to be more precise and able to consistently
distinguish between learnersÊ works. However, it is not easy to write good
descriptors for each level and each criterion.
6.4.2 Rating Scales

Rating scales allow teachers to indicate the degree or frequency of the behaviour,
skills and strategies displayed by the learner. Rating scales state the criteria and
provide three or four response selections to describe the quality or frequency of
learnersÊ work.
Teachers can use rating scales to record observations and learners can use them as
self-assessment tools. Teaching learners to use descriptive words such as always,
usually, sometimes and never, helps them pinpoint specific strengths and needs.
Rating scales also provide learners with information for setting goals and
improving performance.
Rating scales list performance statements in one column and the range of
accomplishments in descriptive words, with or without numbers, in other
columns.

The descriptive word is more important than the corresponding number. The more
precise and descriptive the words for each scale point, the more reliable is the tool.
Effective rating scales use descriptors with clearly understood measures such as
frequency. Scales that rely on subjective descriptors for quality such as fair, good
or excellent, are less effective because the single adjective does not contain enough
information on what criteria are indicated at each of these points on the scale.
For example, the performance statement which describes a behaviour or quality as

1 = poor through to 5 = excellent is better than 1 = lowest through to 5 = highest or
simply 1 through 5. The range of numbers should be the same for all rows within
a section (such as all being from 1 to 5).
The range of numbers should always increase or always decrease. For example, if
the last number is the highest achievement in one section, the last number should
also be the highest achievement in all the other sections as well.
Figure 6.1 is an example of the rating scale used for interpersonal skills assessment.
Figure 6.1: Interpersonal Skills Assessment ă Rating Scale

Source: http://www.northernc.on.ca/leid/docs/ja_developchecklists.pdf

6.4.3 Checklists
Checklists usually offer a „yes‰ or „no‰ format in relation to learners
demonstrating specific criteria. An assessment checklist takes each achievement
objective and turns it into one or more „learner can do‰ statements. Checklists may
be used to record observations of an individual, a group or a whole class.
The following are some of the characteristics of checklists. Checklists should:

(a) Have criteria for success based on expected outcomes;
(b) Be short enough for practicality (for example, one sheet of paper);
(c) Highlight critical tasks;
(d) Have sign-off points that prevent learners from proceeding without
approval, if needed;
(e) Be written in clear, detailed wording to minimise the risk of
misinterpretation; and
(f) Have space for the inclusion of other information such as the learnerÊs name,
date, course, examiner and overall result.
Table 6.4 shows an example of a checklist.
Table 6.4: Example of a Checklist
Name:
Date:
Class
Achievement
Items Yes No Comments
Objective
Communicate Can understand the numbers
about numbers 1ă100 through listening
Can say the numbers 1ă100
Can count 1ă100
Can write the numbers 1ă100
Can understand numbers 1ă100
when written in words
Can write numbers 1ă100 in
words

SELF-CHECK 6.4
1. Define authentic assessment.
2. Compare authentic and traditional assessments.
3. What are the similarities and differences between rubrics, checklists

and rating scales?
4. Is authentic assessment a good replacement for traditional

assessment?
5. Give an example of authentic assessment for the subject you are

teaching.
 The strategy of asking learners to perform real-world tasks that demonstrate

meaningful application of essential knowledge and skills is called authentic
assessment.
 Authentic assessment is sometimes called performance assessment, alternative

assessment or direct assessment.
Alternative assessment Alternative assessment

Backward design Backward design
Checklists Checklists
Contrived to real-life Contrived to real-life
Direct evidences Direct evidences
Direct assessment Direct assessment

Topic  Project and
Portfolio
7 Assessments
LEARNING OUTCOMES
1. Describe the characteristics of project work;
2. Explain the procedures of using project as an assessment technique;
3. Explain the role of portfolio assessment as an alternative assessment
strategy;
4. Describe the characteristics of portfolios; and
5. Explain how a portfolio is developed.
 INTRODUCTION
Besides objective and essay tests, there are other types of assessments which can
be collectively categorised as performance-based assessment. According to the
definition provided by the Standards (AERA et al., cited in Reynold, Livingston &
Willson, 2006), performance assessments require test takers to complete a task in a
context or setting that closely resembles real-life situation. For example, to assess
oral communication skills, the assessment might require test takers to participate
in a group dialogue session. Likewise, to assess the teaching ability of teacher
trainees, the trainees might be required to conduct a lesson with a group of
learners. The emphasis of performance-based assessment is thus on doing, not
merely knowing, on process as well as product. In this context, an essay test that
is used to assess writing skills in language learning can be considered as a type of
performance assessment. Since essay questions have been discussed earlier, we

126  TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS
will focus on other types of such assessments in this topic, namely projects and
portfolios. Project assessments which require learners to perform practical tasks to
organise or create something are fast becoming a common practice in school-based
assessments. Portfolios, which are considered a specific form of performance
assessment that is useful in assessing learner learning over time involve the
systematic collection of a learnerÊs work produced over a specified period of time
according to a specific set of guidelines (Reynold, Livingston & Willson, 2006).
7.1 PROJECT ASSESSMENT

Most of us have done some form of project work at school or university and know
what a project is. However, when asked to define it, one will see varying
interpretations of the project and its purpose. Projects can represent a range of
tasks that can be done at home or in the classroom, by parents or groups of
learners, quickly or over time.
Technically there are differences between project and project-based learning (PBL).
While PBL also features projects, the focus is more on the process of learning and
learner-peer-content interaction than the end-product itself. PBL closely resembles
work done in the real world. The scenario or simulation is real whereas projects
are usually based on events that have already been resolved.
ACTIVITY 7.1
Go to this website http://www.friedtechnology.com/stuff
Look at the charts prepared by Amy Mayer for differences between

projects and project-based learning. Discuss your findings with your
coursemates.
A project is an activity in which time constraints have been largely removed and
can be undertaken individually or by a group and usually involves a significant
element of work being done at home or out of school (Firth & Mackintosh, 1987).
Project work has its roots in the constructivist approach which evolved from the
work of psychologists and educators such as Lev Vygotsky, Jerome Bruner, Jean
Piaget and John Dewey. Constructivism views learning as the result of mental
construction whereby learners learn by constructing new ideas or concepts based
on their current and previous knowledge.

TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS  127
Most projects have certain common defining features (Katz & Chard, 1989) such
as:
(a) Learner-centred;
(b) A definite beginning, middle and end;
(c) Content is meaningful to students and directly observable in their

environment;
(d) Real-world problems;
(e) First-hand investigations;
(f) Sensitivity to local culture and culturally appropriate;
(g) Specific goals related to curriculum;
(h) A tangible product that can be shared with the intended audience;
(i) Connections among school, life and work skills;
(j) Opportunities for reflective thinking and learner self-assessment; and
(k) Multiple types and authentic assessments (portfolios, journals, rubrics and
others).
In project work, the whole work process is as important as the final result or
product. Work process refers to learners choosing a knowledge area, delimiting it
and formulating a problem or putting forward questions. It also involves learners
investigating and describing what is required to solve a given problem or answer
a specific question through further work, collection of materials and knowledge.
Project work is planned so that it can be carried out within the available time.
Preferably, the task should be drawn from knowledge areas in the current
curriculum. Project work is an integrated learning experience that encourages
learners to break away from the compartmentalisation of knowledge and instead
involves drawing upon different aspects of knowledge. For example, making an
object not only requires handicraft skills but also knowledge of materials, working
methods and uses of the object. Technological support will also enhance learnersÊ
learning. Thinking skills are integral to project work.
Similarly, writing the project report requires writing skills learned in the language
classroom and applying it when analysing and drawing conclusions for a science
project. Generally, there are two types of projects, namely research-based and
product-based projects.

(a) Research-based project is more theoretical in nature and may consist of

putting forth a question, formulating a problem or setting up some
hypotheses. In order to answer the question, solve the problem or confirm
the assumptions, information must be found, evaluated and used. This
information can either be the result of learnersÊ own investigations or
obtained from public sources without being a pure reproduction. Such
project work is usually presented as a research report.
(b) Product-based project would be the production of a concrete object, a service,

a dance performance, a film, an exhibition, a play, a computer programme
and so forth.
In project work, learners actively involve themselves as problem solvers, decision

makers, investigators, documenters and researchers. Project work provides an
opportunity for learners to explore different approaches to solving problems. In
project work a teacher follows, discusses and assesses the work in all of the
different phases. The teacher is the learner's supervisor. If the work involves a
group, the input from different learners should come from within their respective
programme goals. Most importantly, learners should find projects fun, motivating
and challenging because they play an active role in the entire process. The selection
of a project may be determined by the teacher or the choice may be left to the
learners, probably with the approval of the teacher (Katz, 1994). What is significant
is that learners take ownership of their project.
There are many types of effective projects. The following are just some project
ideas:
(a) Survey of historical buildings in the learnerÊs community;
(b) Study of the economic activities of people in the local community;
(c) Study of the transportation system in the district;
(d) Recreate a historical event;
(e) Develop a newsletter, blog or website on a specific issue relevant to the

school or community (such as school safety, recycling and how businesses
can save energy and reduce waste);
(f) Compile oral histories of the local area by interviewing community elders;
(g) Produce a website as a "virtual tour" of the history of the community;
(h) Create a document on CD of learners graduating from local primary or

secondary school;

(i) Create a wildlife or botanical guide for a local wildlife area;
(j) Create an exhibition on local products, local history or local personalities

using audiotapes, videotapes and photographs; and
(k) Investigate pollution of local rivers, lakes and ponds.
The possibilities for projects are endless. The key ingredient for any project idea is
that it is learner-driven, challenging and meaningful. It is important to realise that
project-based instruction complements the structured curriculum. Project-based
instruction builds on and enhances what learners learn through systematic
instruction. Teachers do not let learners become the sole decision makers about
what project to work on, nor do teachers sit back and wait for the learners to figure
out how to go about the process which can be very challenging (Bryson, 1994). This
is where the teacherÊs ability to facilitate and act as a coach plays an important part
in the success of a project. The teacher would have brainstormed ideas with the
learner to generate project possibilities, discuss possibilities and options, help the
learner form a guiding question and be ready to help the learner throughout the
implementation process such as setting guidelines, due dates, resource selection
and so forth (Bryson, 1994; Rankin, 1993).
SELF-CHECK 7.1
1. What are the main differences between a project and project-based
learning?
2. State the differences between a research-based project and a

product-based project.
3. Give examples of the two types of projects in your subject area.
7.1.1 Assessing Learners with Projects

Project-oriented work is becoming increasingly common in work life. Project
competence, the ability to work together with others, taking personal initiatives
and having entrepreneurship skills are often required by employers. These
competences can be developed during project work and thus prepare learners for
work life. Project work makes schooling more like the real world. In real life, we
seldom spend several hours listening to authorities who know more than we do
telling us exactly what to do and how to do it. We ask questions of the person we
are learning from. We try to link what the person is telling us with what we already
know. We bring our experiences and what we already know that is relevant to the
issue or task and say or do something about it.
You can see this with a class of young learners. When the teacher tells a story, little
kindergarten children raise their hands eager to share their experiences about
something related to the story. They want to be able to apply their natural
tendencies to the learning process. This is how life is much of the time! By giving
project work, we open up areas in schooling where learners can speak about what
they already know.
Project work is a learning experience which enables the development of certain

knowledge, skills and attitudes which prepares learners for lifelong learning and
the challenges ahead (see Table 7.1). These objectives may not be achieved by
current instructional strategies.
Table 7.1: Knowledge, Skills and Attitudes Achieved with Projects
Domain Learning Outcome

Knowledge application Learners will be able to make connections across different
(Apply creative and areas of knowledge and to generate, develop and evaluate
critical thinking skills) ideas and information so as to apply these skills to the
project task.
 Be able to choose a knowledge area and delimit a task or
a problem.
 Be able to choose relevant materials, methods as well as
relevant tools.
 Be held accountable to draw up a project plan and where
necessary, to revise it.
Communication Learners will acquire the skills to communicate effectively
(Improve oral and by presenting their ideas clearly and coherently to specific
written communication audiences in written and oral forms.
skills)  Be able to discuss with their supervising teacher how
their work is developing.
 Be able to provide a written report of the project
describing the progress of the work from initial idea to
final product.
 Be able to produce a final product which means an
independent solution to the task or problem chosen.

Collaboration Learners will acquire collaborative skills through working

(Foster collaborative in a team to achieve common goals.
learning skills)
Independent learning Learners will be able to learn on their own, reflect on their
(Develop self-directed learning and take appropriate actions to improve.
inquiry and life-long  Be able to use a logbook to document the progress of
learning skills) their work and regularly report the process.
 Be able to assess either in writing or verbally their work
process and results.
Source: Blank and Harwell (1997)
Project work helps learners by:
(a) Developing the skills required of planning, structuring and taking

responsibility for a larger piece of work and providing experience for
working in „project form‰;
(b) Deepening the knowledge within a subject or between subjects;
(c) Providing learners with opportunities to explore the interrelationships and

interconnectedness of topics within a subject and between subjects; and
(d) Encouraging learners to synthesise knowledge from various areas of

learning while critically and creatively applying it to real-life situations.
Hence, it is important that learners are assigned to carry our authentic
projects in which learners plan, implement and report on projects that have
real-world applications beyond the classroom (Blank & Harwell, 1997).
SELF-CHECK 7.2
1. What knowledge, skills and attitudes are evaluated when learners

are given a project to work on?
2. To what extent has project work been used as an assessment

strategy in Malaysian schools?

7.1.2 Designing Effective Projects

There are many types of projects and there is no one correct way to design and
implement a project. However, there are some questions and issues to consider
when designing effective projects. It is very important for everyone involved to be
clear about the goals of the project. You will be surprised that many teachers are
not sure why they use projects to assess their learners. Teachers should develop an
outline to explain the projectÊs essential elements and expectations for each project.
Although the outline can take various forms, it should contain the following
elements (Bottoms & Webb, 1998):
(a) Situation or Problem

The outline should contain a sentence or two describing the issue or problem
that the project is trying to address. For example, the pollution levels in
rivers, transportation problems in urban centres, the increasing price of
essential items, crime rate in squatter areas, youths loitering in shopping
complexes and students in Internet cafes during school hours.
(b) Project Description and Purpose

This refers to a concise explanation of the projectÊs ultimate purpose and how
it addresses the situation or problem. For example, learners will research,
conduct surveys and make recommendations on how to reduce the pollution
of rivers. Results will be presented in a newsletter, information brochure,
exhibition or website.
(c) Performance Specifications

A list of criteria or quality standards the project must meet.
(d) Rules
Guidelines for carrying out the project include a timeline and short-term
goals such as having interviews completed by a certain date and specifying
the completion date of the project.
(e) List of Project Participants with Roles Assigned

Identify the roles of team members and members of the community if they
are involved.
(f) Assessment
How the learnerÊs performance will be evaluated. In project work, the
learning process is being evaluated as well as the final product.

Before designing the project, identify the learning goals and objectives. What
specific skills or concepts will learners learn? Herman, Aschbacher and Winters
(1992) have identified five questions to consider when determining learning goals:
(a) What important cognitive skills do I want my learners to develop? (For

example, to use algebra to solve everyday problems, to write persuasively);
(b) What social and affective skills do I want my learners to develop? (For
example, to develop teamwork skills);
(c) What metacognitive skills do I want my learners to develop? (For example,

to reflect on the research process, to evaluate its effectiveness, to determine
methods of improvement);
(d) What types of problems do I want my learners to be able to solve? (For

example, to know how to conduct research, to apply the scientific method);
and
(e) What concepts and principles do I want my learners to be able to apply? (For
example, to apply basic principles of biology and geography in their lives, to
understand cause-and-effect in relationships)
Steinberg (1998) provides a checklist for the design of effective projects (see
Table 7.2). The checklist can be used throughout the process to help both the
teacher and learner to plan and develop a project as well to assess whether the
project was successful in meeting instructional goals.
Table 7.2: The Six AÊs of Effective Project Checklist
Six AÊs Questions Checklist

Authenticity  Does the project stem from a problem or question that is meaningful
to the learner?
 Is the project similar to one undertaken by an adult in the community
or workplace?
 Does the project give the learner the opportunity to produce
something that has value or meaning to the learner beyond the school
setting?
Academic  Does the project enable the learner to acquire and apply knowledge
rigor central to one or more discipline areas?
 Does the project challenge the learner to use methods of inquiry from
one or more disciplines (for example, to think like a scientist)?
 Does the learner develop higher-order thinking skills (for example,
searching for evidences or using different perspectives)?

Applied  Does the learner solve a problem that is grounded in real life and/or
learning work (for example, design a project or organise an event)?
 Does the learner need to acquire and use skills expected in high-
performance work environments (for example, teamwork, problem-
solving, communication or technology)?
 Does the project require the learner to develop organisational and
self-management skills?
Active  Does the learner spend significant amounts of time doing field work,
exploration outside school?
 Does the project require the learner to engage in real investigative
work, using a variety of methods, media and sources?
 Is the learner expected to explain what he or she has learned through
a presentation or performance?
Adult  Does the learner meet and observe adults with relevant experience
relationships and expertise?
 Is the learner able to work closely with at least one adult?
 Do adults and the learner collaborate on the design and assessment
of the project?
Assessment  Does the learner reflect regularly on his or her learning, using clear
practices project criteria that he or she has helped to set?
 Do adults from outside the community help the learner develop a
sense of the real world standards from this type of work?
 Is the learnerÊs work regularly assessed through a variety of methods
including portfolios and exhibitions?
Source: Adaptation of Steinberg (1998) Real learning, real work: School-to-work as high
school reform
Besides those mentioned, it is also important to ensure the following:
(a) Do the learners have easy access to the resources they need? This is especially
important if a learner is using specific technology or subject-matter expertise
from the community;
(b) Do the learners know how to use the resources? Learners who have minimal
experience with computers, for example, may need extra assistance in
utilising them;

(c) Do the learners have mentors or coaches to support them in their work? This
can be in-school or out-of-school mentors; and
(d) Are learners clear on the roles and responsibilities of each person in the
group?
SELF-CHECK 7.3
1. What are some of the factors you should consider when designing
project work for learners in your subject area?
2. Give examples of projects you have included or can include in the

teaching and evaluation of your subject area.
7.1.3 Possible Problems with Project Work

Teachers intending to use projects both as an instructional and assessment tool
should be aware of certain problem areas. They have to be as specific as possible
in determining the outcomes so that both the learner and the teacher understand
exactly what is to be learned. Thomas (1998) identified the following problem areas
when undertaking project-based instruction:
(a) Aligning project goals with curriculum goals can be difficult. To make
matters worse, parents are not always supportive of projects when they
cannot see how it relates to the overall assessment of learning;
(b) Projects can often take longer than expected and teachers need a lot of time
to prepare good authentic projects;
(c) Learners are not clear as to what is required. There is need for adequate
structure, guidelines and guidance on how to carry out projects;
(d) Intensive staff development is required. This is because teachers are not
traditionally prepared to integrate content into real-world activities;
(e) The resources needed for project work may not be readily available and there
might be a lack of administrative support; and
(f) Some teachers may not be familiar with how they should assess the projects.

7.1.4 Group Work in Projects

Working in groups has become an accepted part of learning due to the widely
recognised benefits of collaborative group work for learner learning. When groups
work well, learners learn more and produce higher quality learning outcomes.
What are some benefits of group work in projects? Let us read the following:
(a) Peer Learning Can Improve the Overall Quality of Learner Learning
Group work enhances learner understanding. Learners learn from each other
and benefit from activities that require them to articulate and test their
knowledge. Group work provides an opportunity for learners to clarify and
refine their understanding of concepts through discussions and rehearsals
with peers. Many, but not all, learners recognise the value to their personal
development in group work and of being assessed as a member of the group.
Working with a group and for the benefit of the group also motivates some
learners. Group assessment helps some learners develop a sense of
responsibility: „I felt that because one is working in a group, it is not possible
to slack off or to put things off. I have to keep working otherwise I would be
letting other people down‰.
(b) Group Work Can Help Develop Specific Generic Skills Sought by Employers
As a direct response to the objective of preparing graduates with the capacity
to function successfully as team members in the workplace, there has been a
trend in recent years to incorporate generic skills alongside traditional
subject-specific knowledge in the expected learning outcomes in higher
education. Group work can facilitate the development of skills which
include:
(i) Teamwork skills (skills in working within team dynamics; leadership

skills);
(ii) Analytical and cognitive skills (analysing task requirements;

questioning; critically interpreting materials; evaluating the work of
others);

(iii) Collaborative skills (conflict management and resolution; accepting

intellectual criticism; flexibility; negotiation and compromise); and
(iv) Organisational and time management skills: „Having to do group

work has changed the way I work. I could not do it all the night before.
I had to be more organised and efficient‰.
(c) Group Work May Reduce the Workload Involved in Assessing, Grading and
Providing Feedback to Learners
Group work and group assessment in particular, is sometimes implemented
in the hope of streamlining assessment and grading tasks. In simple terms, if
learners submit group assignments the number of pieces of work to be
assessed can be vastly reduced. This prospect might be particularly attractive
for staff teaching large classes.
SELF-CHECK 7.4
1. What are some project problems in the implementation of project
work and how would you solve them?
2. What are the benefits of group work in projects?
7.1.5 Assessing Project Work

Assessing learner performance on project work is quite different from an
examination using objective tests and essay questions. It is possible that learners
might be working on different projects, for instance, some may be working in
groups while others may be working individually. This makes the task of assessing
learner progress even more complex compared to a paper and pencil test where
everyone is evaluated using one marking scheme.

Table 7.3 could give you some ideas on how to assess and give marks for a project
work.
Table 7.3: General Marking Scheme for Projects
Marks Criteria
90ă Exceptional and distinguished work of a professional standard.
100% Outstanding technical and expressive skills.
Work demonstrating exceptional creativity and imagination.
Work displaying great flair and originality.
80ă89% Excellent and highly developed work of a professional standard.
Extremely good technical and expressive skills.
Work demonstrating a high level of creativity and imagination.
Work displaying flair and originality.
70ă79% Very good work which approaches professional standard.
Very good technical and expressive skills.
Work demonstrating good creativity and imagination.
Work displaying originality.
60ă69% A good standard of work.
Good technical and expressive skills.
Work displaying creativity and imagination.
Work displaying some originality.
50ă59% A reasonable standard of work.
Adequate technical and expressive skills.
Work displaying competence in the criteria assessed but which may be
lacking some creativity or originality.
40ă49% Limited but adequate standard of work.
Limited technical and expressive skills.
Work displaying some weaknesses in the criteria assessed and lacking
creativity or originality.
30ă39% Limited work which fails to meet the required standard.
Weak technical and expressive skills.
Work displaying significant weaknesses in the criteria assessed.
20ă29% Poor work. Unsatisfactory technical or expressive skills.
Work displaying significant or fundamental weaknesses in the criteria
assessed.

10ă19% Very poor work or work where very little attempt has been made.
A lack of technical or expressive skills.
Work displaying fundamental weaknesses in the criteria assessed.
1ă9% Extremely poor work or work where no serious attempt has been made.
Source: Chard (1992)
When assessing a project work, you need to be clear of what to assess. Is it the
product, the process or both? According, to Bonthron and Gordon (1999), from the
onset you should be clear:
(a) Whether you are going to assess the product of the group work or both
product and process.
(b) If you intend to assess the process, what proportion are you going to allocate
for process and what criteria to use and how are you going to assess the
process?
(c) What criteria are you planning to use to assess the project work and how will
the marks be distributed?
Some educators believe there is a need to assess the processes within groups as
well as the products or outcomes. What exactly does „process‰ mean? Both
teachers and learners must have a clear understanding of what the process means.
For example, if you want to assess „'the level of interaction‰ among learners in the
group, they should know what „high‰ or „low‰ interaction means. Should the
teacher involve himself in the workings of each group or rely on self or peer
assessment? Obviously, being involved in many groups would be physically
impossible for the teacher. As a result, some educators may say, „I don't care what
they do in their groups. All I'm interested in is the final product and how they
arrive at their results is their business‰. However, to provide a more balanced
assessment, there is growing interest in both the process and product of group
work and the issue that arises is „What proportion of assessment should focus on
product and what proportion should focus on process?‰
The criteria for the evaluation of group work can be determined by the teacher
alone or both by the teacher and learners through consultation between the two
parties. Group members can be consulted on what should be assessed in a project
through consultation with the teacher. Obviously, you have to be clear about the
intended learning outcomes of the project in your subject area. It is a useful starting
point for determining criteria for assessment of the project. Once these broader
learning outcomes are understood, you can establish the criteria for marking the
project. Generally, it is easier to establish criteria for measuring the „product‰ of
project work and much more difficult to measure the „processes‰ involved in

project work. However, it is suggested that evaluation of product and process be

done separately rather than attempting to do both simultaneously. We will discuss
how the processes in project work may be evaluated later.
Another important point to note is that we need to be clear who gets the marks ă
individuals or the group as a whole? Most projects involve more than one learner
and the benefits of group work have been discussed earlier. A major problem of
evaluating projects involving group work is how to allocate marks fairly among
group members. The following questions are those mentioned by learners: „I
would like my teacher to tell me what amount of work and effort will get enable
me to obtain a certain mark‰, „Do all learners obtain the same mark even though
not all learners put in the same effort?‰ and „Are marks given based on individual
contributions of team members?‰
These are questions that trouble teachers especially when it is common to find
freeloaders or sleeping partners in group projects. The following are some
suggestions how group work may be assessed:
(a) Method 1: Shared Group Mark

All group members receive the same mark for the work submitted regardless
of individual contribution. It is a straightforward method that encourages
group work where group members sink or swim together. However, it may
be perceived as unfair by better learners who may complain that they are
unfairly disadvantaged by weaker learners and the likelihood of „sleeping
partners‰ is very high.
(b) Method 2: Share-out Marks

The learners in the group decide how the total marks should be shared
between them. For example, a score of 40 is given by the teacher for the
project submitted. There are 5 members in the group and so the total score
possible is 5 x 40 = 200. The learners then share the 200 marks based on the
contribution of each of the 4 learners, which may be 35, 45, 42, 38 and 40. This
is an effective method if group members are fair, honest and do not have ill-
feelings towards each other. However, there is the likelihood for the marks
to be equally distributed to avoid ill-feelings among group members.
(c) Method 3: Individual Mark

Each learner in the group submits an individual report based on the task
allocated or on the whole project.
(i) Allocated Task

From the beginning, the project is divided into different parts or tasks
and each learner in the group completes his allocated task which
contributes to the final group product and gets the marks for that task.

This method is a relatively objective way of ensuring individual

participation and may motivate learners to work hard on their task or
part. The problem is breaking up the project into tasks that are exactly
equal in size or complexity. Moreover, the method may not encourage
group collaboration and some members may slow down the progress.
(ii) Individual Report

Each learner writes and submits an individual report based on the
whole project. The method ensures individual effort and may be
perceived as fair by learners. However, it is difficult to determine how
the individual reports should differ and learners may unintentionally
plagiarise.
(d) Method 4: Individual Mark (Examination)

Examination questions specifically target the group projects and can only be
answered by learners who have been thoroughly involved in the project. This
method may motivate learners to learn more from the group project
including learning from the other members of the group. However, it may
not be effective because learners may be able to answer the questions by
reading the group report. In the Malaysian context, national examinations
may not be able to include such questions as it involves hundreds of
thousands of learners.
(e) Method 5: Combination of Group Average and Individual Mark

The group mark is awarded to each member with a mechanism for adjusting
for individual contributions. This method may be perceived to be fairer than
shared group mark. But it means additional challenge for teachers when
trying to establish individual contributions.
SELF-CHECK 7.5
Which of the five methods of assessing group work would you use in
assessing project work in your subject area? Give reasons for your choice.
7.1.6 Assessing Process in a Project

The assessment of the groupÊs end product is rarely the only assessment taking
place in group projects. The process of group work is increasingly recognised as
an important element in the assessment of group work. And where group work is
marked solely on the basis of product, and not process, there can be differences in
individual grading that are deemed unfair and unacceptable.

Let us look at other aspects of assessing a project.
(a) Peer or Self-evaluation of Roles

Learners rate themselves as well as other group members on specific criteria
such as responsibilities, contributing ideas and finishing tasks. This can be
done through various grading forms (see Figure 7.1) or having learners write
a brief essay on the groupÊs or membersÊ strengths and weaknesses.
(b) Individual Journals

Learners keep a journal of events that occur in each group meeting. These
include who attended the meeting, what was discussed and plans for future
meetings. These can be collected and periodically read by the instructor who
comments on the progress. The instructor can provide guidance for the
group without directing them.
(c) Minutes of Group Meetings

Similar to journals are minutes of the group meetings which are periodically
read by the instructor. These include who attended, tasks completed, tasks
planned and contributors to various tasks. This provides the instructor with
a method of monitoring individual contributions within the group.
(d) Group and Individual Contribution Grades

Instructors can divide the project grade into percentages of individual and
group contributions. This is especially beneficial if peer and self-evaluations
are used (see Figure 7.1).
Having a logbook can potentially provide plenty of information to form the basis
of assessment while keeping minutes helps members to focus on the process which
is a learning experience in itself. These techniques may be perceived as a fair way
to deal with freeloaders and outstanding contributors. However, reviewing logs
can be time consuming for teachers or instructors and learners may need a lot of
training and experience in order to keep the records. In addition, emphasis on
second hand evidence may not be reliable.


Figure 7.1: Checklist for evaluating processes involved in project work

Source: Developed by Maggie Sutherland for the Biotechnology Academy at Andrew P.
Hill High School © 2003 East Side Union High School District, San Jose, California
Self-assessment in Project Work

Self-assessment is a process by which learners learn about themselves, for
example, what they have learned about the project, how they have learned and
how they reacted in certain situations when carrying out the project. Involving
learners in the assessment process is an essential part of the balanced assessment.
When learners become partners in the learning process they gain a better sense of
themselves as readers, writers and thinkers. Some teachers may be uncomfortable
with self-assessment because traditionally, teachers are responsible for all forms of
assessments in the classroom and here we are asking learners to assess themselves.

Self-assessment can take many forms:
(a) Discussions involving the whole class or small groups;
(b) Reflection logs;
(c) Self-assessment checklist or inventories; and
(d) Teacher-learner interviews.
These types of self-assessments share a common theme, which requires learners to

review their work to determine what they have learned and areas of confusion that
may still exist. Although each method may differ slightly, they all include enough
time for learners to consider thoughtfully and evaluate their own progress.
Because project learning is learner-driven, assessment should be learner-driven as

well. Learners can keep journals and logs to continually assess their progress. A
final reflective essay or log can allow learners and teachers to understand the
thinking processes, reasons behind the decisions, ability to arrive at conclusions
and communicate what they have learned.
According to Edwards (2000), the following are some questions that a learner can
ask himself or herself while conducting self-assessment:
(a) What were the projectÊs successes?
(b) What could I have done to improve the project?
(c) How well did I meet my learning goals? What was most difficult about
meeting the goals?
(d) What surprised me most about working on the project?
(e) What was my groupÊs best team effort? Worst team effort?
(f) How do I think other people involved with the project felt about the progress
and end product of the project?
(g) What were the skills which I used during this project? How can I engage
these skills in the future?

SELF-CHECK 7.6
1. Explain how process can be measured in group project work.
2. List some of the problems with assessment of process.
3. Do you think process should be assessed? Explain.
7.2 PORTFOLIO ASSESSMENT

Increasingly, portfolio assessment is gaining importance as an assessment strategy
seeking to present a more holistic view of the learner. Portfolios tend to be
associated with art, where the learner keeps his pieces of work in a type of folder
to be presented for evaluation. Some people may associate portfolios with the stock
market where a person or organisation keeps a portfolio of stocks and shares
owned. Hart (1994) defines a portfolio as a container that holds evidence of an
individualÊs skills, ideas, interests and accomplishments (Figure 7.2).
Figure 7.2: Examples of Portfolios
Learner portfolios may take many forms. It is not easy to describe them. A portfolio
is not the pile of learner work that accumulates over a semester or year. Rather a
portfolio is a purposeful collection of the works produced by learners which
reflects their efforts, progress and achievements in different areas of the
curriculum. According to Paulson, Paulson and Meyer (1991), „portfolios offer a
way of assessing learner learning that is different from traditional methods.
Portfolio assessment provides the teachers an opportunity to observe learners in a
broader context which involves taking risks, developing creative solutions and
learning to make judgements about their performances‰.

7.2.1 What is Portfolio Assessment?

The collection of works by learners are assessed and hence the term portfolio
assessment. The portfolio provides for continuous and ongoing assessment
(formative assessment) as well as assessment at the end of a semester or a year
(summative assessment). Emphasis is more on monitoring learnersÊ progress
towards achieving the learning outcomes of a particular subject, course or
programme.
Portfolio assessment has been described as multidimensional because it allows

learners to include different aspects of their work such as essays, project reports,
performance on objective tests, objects or artefacts they have produced, poems,
laboratory reports and so forth. In other words, the portfolio contains samples of
work over an entire semester, term or year, rather than single points in time (such
as during examination week). Teachers should convey to learners the purpose of
the portfolio, what constitutes quality work and how the portfolio is graded.
Portfolio assessment represents a significant shift in thinking about the role of

assessment in education. Teachers who use this strategy in the classroom have
shifted their philosophy of assessment from merely comparing achievements
based on grades, test scores and percentile rankings to improving learner
achievements through feedback and self-reflection.
7.2.2 Rationale for Portfolio Assessment

In has been frequently suggested that paper and pencil tests (objective and essay
tests) are not able to assess all the learning outcomes in a particular subject area.
For example, many higher-level cognitive skills and the affective domain (feelings,
emotion, attitudes and values) are not adequately assessed using traditional
assessment methods. Portfolio assessment allows for the evaluation of a wider
range of skills and understanding and most importantly, it provides an
opportunity for the teacher to track or monitor change and growth over a period
of time. Since portfolio assessment is an ongoing process, it provides an
opportunity for learners to reflect on their own learning and thinking. They have
an opportunity to monitor their understanding and approaches to problem-
solving and decision-making (Paulson, Paulson & Meyer, 1991). Upon reflection,
learners can identify where they went wrong or how they can further improve
themselves.

Epstein (2006) in Introduction to Portfolios, Synapse Learning Design, cited in

Teachervision.com, argues that portfolio assessment:
(a) Allows the teacher to view the learner as an individual, each with his own
unique characteristics, needs and strengths;
(b) Emphasises improving learner achievement rather than ranking learners

according to their performance based on tests;
(c) Helps learners to be more accountable for their work;
(d) Allows the adaptation of instruction to the learning styles of learners;
(e) Involves learners in the assessment process; and
(f) Invites learners to reflect upon their growth and performance as learners.
However, Epstein (2006) also mentioned some of the problems with portfolio
assessments. Portfolio assessments may be less reliable because they tend to be
more qualitative rather than quantitative in nature. Society is still strongly oriented
towards grades and test scores. In addition, most universities and colleges still use
test scores and grades as the main admission criteria. Moreover, portfolio
assessment may be time consuming for teachers and data from portfolio
assessments can be difficult to analyse.
7.2.3 Types of Portfolios

There are two main types of portfolios, namely process-oriented and product-
oriented portfolios.
(a) Process-oriented Portfolios

These portfolios tell a story about the learner and how the learner has grown.
It includes earlier drafts and how these drafts have been improved upon. For
example, the first draft of a poem written by a Year Three learner reworked
based on the comments by the teacher and the learner reflecting on his or her
work. All the drafts and changes made are kept in the portfolio. In this
manner, learnerÊs works can be compared to and it provides evidence with
regard to the how the learnerÊs skills have improved.

(b) Product-oriented Portfolios

These portfolios contain the works of a learner which he considers the best.
The aim is to document and reflect on the quality of the final products rather
than the process that produced them. The learner is required to collect all his
work at the end of the semester, at which time he must select those works
which is of the highest quality. Learners could be left to make the decision
themselves or the teacher can set the criteria on what a portfolio must contain
and the quality of the works to be included.
7.2.4 How to Develop a Portfolio

The design and development of a portfolio involves four main steps (as stated in
Epstein (2006), Introduction to Portfolios, Synapse Learning Design, cited in
Teachervision.com).
(a) Collection
This step simply requires learners to collect and store all of the work.
Learners have to get used to the idea of documenting and saving their work,
something which they may not have done before.
(i) How should the works be organised? According to subjects or themes?
(ii) How should the works be recorded and stored?
(iii) How to get learners to form the habit of documenting the evidence?
(b) Selection
This will depend on whether it is a process or product portfolio and the
criteria set by the teacher. Learners will go through the work collected and
select certain works for their portfolio. This might include examination
papers and quizzes, audio and video recordings, project reports, journals,
computer work, essays, poems, artwork and so forth. In short,
(i) How does one select? What is the basis of selection?
(ii) Who should be involved in the selection process?
(iii) What are the consequences of not completing the portfolio?

(c) Reflection
This is the most important step in the portfolio process. It is the reflection
involved that differentiates the portfolio as being a mere collection of
learnerÊs works. Reflection is often done in writing but it can also be done
orally. Learners are asked why they have chosen a particular product or
work (for example, an essay) and how it compares with other works, what
particular skills and knowledge were used to produce it and how it can be
further improved. In addition,
(i) Learners should reflect on how or why they chose certain works.
(ii) How should learners go about the reflection process?
(d) Connection
As a result of „reflection‰, learners will begin to ask themselves, „Why are
we doing this?‰ They are encouraged to make connections between their
school work and the value of what they are learning. They are also
encouraged to make connections between the works included in their
portfolio with the world outside the classroom. They learn to exhibit what
they have done in school to the happenings and situations in the community.
Issues to consider include:
(i) How is the cumulative effect of the portfolio evaluated?
(ii) Should learners exhibit their works?
7.2.5 Advantages and Disadvantages of Portfolio

Assessment
As a formative assessment tool, learner portfolios can be used by the teachers as
informal diagnostic techniques or feedback. The feedback enables the learners to
reflect on what they are learning and why. Assessment portfolios require learners
to continuously reflect and perform self-evaluations of their work. The advantages
and disadvantages of portfolio assessment can be summarised as follows:
(a) Advantages of Portfolio Assessment
(i) Ongoing assessment is the main benefit of portfolio assessment;
(ii) Promoting learner self-evaluation, reflection and critical thinking;

(iii) Holding learners accountable for mastering content standards in a

subject area;
(iv) Measuring performance based on genuine samples of learnerÊs work;
(v) Providing flexibility in measuring how learners accomplish their

learning goals;
(vi) Promoting communication between teachers and learners. Enabling

teachers and learners to share the responsibility for setting learning
goals and for evaluating progress towards meeting those goals;
(vii) Giving learners the opportunity to have extensive input into the
learning process; and
(viii) Facilitating cooperative learning activities including peer evaluation

and tutoring, cooperative learning groups and peer conferencing.
(b) Disadvantages of Portfolio Assessment

There are people who think that portfolios cannot provide trustworthy
information needed for sound assessment of learner learning. To them the
faults of the portfolio assessments are:
(i) Not standardised;
(ii) Not feasible for large-scale assessment; and
(iii) Potentially biased.
In addition, other disadvantages according to Venn (2000) are:
(i) Requiring extra time to plan an assessment system and conduct the
assessment especially for large groups of learners;
(ii) Gathering all of the necessary data and work samples can make
portfolios bulky and difficult to manage;
(iii) Scoring portfolios involves extensive use of subjective evaluation

procedures such as rating scales and professional judgment which
limits reliability; and
(iv) Scheduling individual portfolio conference is difficult and the length of

each conference may interfere with other instructional activities.

7.2.6 How and When Should a Portfolio be Assessed?

If the purpose of the assessment is to demonstrate progress, the teacher could make
judgments about the evidence of progress and provide those judgments as
feedback to the learner. The learner could also self-assess his progress to check
whether his goals have been met or not.
The portfolio is more than just a collection of learnerÊs work. The teacher may
assess and assign grades to the process of assembling and reflecting upon the
portfolio of a learner's work. The learner might have also included reflections on
growth, strengths and weaknesses, on goals that were or are to be set, on why
certain samples tell certain stories about them or on why the contents reflect
sufficient progress to indicate completion of designated standards. Some of the
process skills may also be part of the teacher's, school's or district's standards. As
such, the portfolio provides some evidence of attainment of those standards. Any
or all of these elements can be evaluated and/or graded.
Portfolio assignments can also be assessed or graded with a rubric. Rubric is useful
in avoiding personal judgment which goes into assessing a complex product such
as a portfolio. Rubric can provide some clarity and consistency in assessing and
judging the quality of the content and the elements that make up that content.
Moreover, application of a rubric increases the likelihood of consistency among
teachers who are assessing the portfolios.

The following portfolio rubric (see Figure 7.3) may be used for self-assessment and
peer feedback.

Figure 7.3: Sample rubric for self-assessment and peer feedback

Source: Adapted from eportfolio (digital portfolio rubric at
https://www2.uwstout.edu/content/profdev/rubrics/eportfoliorubric.html)

SELF-CHECK 7.7
1. To what extent is portfolio assessment used in Malaysian
classrooms?
2. Do you think portfolio assessment can be used as an assessment

technique in your subject area? Explain.
3. What are the strengths and weaknesses of portfolio assessment?
4. What are advantages and disadvantages of portfolio assessment?
 Performance-based assessments require test takers to complete a task in a

context or setting that closely resembles real-life situation.
 A project is an activity in which time constraints has been largely removed and
can be undertaken individually or by a group, and usually involves a
significant element of work being done at home or out of school.
 A research-based project is more theoretical in nature and may consist of

asking a question, formulating a problem or setting up some hypotheses.
 A product-based project would be the production of a concrete object, a

service, a dance performance, a film, an exhibition, a play, a computer
programme and so forth.
 Project work is a learning experience which enables the development of certain

knowledge, skills and attitudes which prepares learners for lifelong learning
and the challenges ahead. It involves knowledge application, collaboration,
communication and independent learning.
 An effective project should contain elements such as a situation or problem,

project description and purpose, performance specifications, rules, roles of
members and assessment.

 The Six AÊs of a project are authenticity, academic rigour, applied learning,
active exploration, adult relationships and assessment practices.
 Working in groups has become an accepted part of learning due to the widely
recognised benefits of collaborative group work for learner learning.
 Various ways for allocating marks to a project work include shared group
marks, shared-out marks, individual mark, individual mark (examination) and
combination of group average and individual mark.
 Self-assessment is a process by which learners learn about themselves, for

example, what they have learned about the project, how they have learned and
how they reacted in certain situations when carrying out the project.
 A portfolio is a purposeful collection of the works produced by learners which

reflects their efforts, progress and achievements in different areas of the
curriculum.
 Teachers need to know the benefits and weaknesses of portfolios in order to

use them to help in learnersÊ learning.
Artefacts Project design

Group work Project idea
Marks allocation Projects
Peer evaluation Research-based project
Performance-based assessments Self-evaluation
Portfolios Shared group mark
Product-based project Shared-out marks

Topic  Test Reliability
and Validity
8
LEARNING OUTCOMES
1. Explain the concept of true score;
2. Compare the different techniques of estimating the reliability of a
test;
3. Compare the different techniques of establishing the validity of a
test; and
4. Discuss the relationship between reliability and validity.
 INTRODUCTION
In this topic we will address two important issues, namely the reliability and
validity of an assessment. How do we ensure that the techniques we use for
assessing the knowledge, skills and values of learners are reliable and valid? We
are making important decisions about the abilities and capabilities of our future
generation, so obviously we want to ensure that we are making the right decisions.
8.1 WHAT IS RELIABILITY?

You have given a geometry test to a group of Form Four learners and one of your
learners, Swee Leong, obtained a score of 66 per cent in the test. How sure are you
that the score is what Swee Leong should actually receive? In other words, is that
his true score? When you develop a test and administer it to your learners, you are
attempting to measure, as far as possible, the true score of your learners. The true
score is a hypothetical concept of the actual ability, competency and capacity of an

158  TOPIC 8 TEST RELIABILITY AND VALIDITY
individual. A test attempts to measure the true score of a person. When measuring
human abilities, it is practically impossible to develop an error-free test. However,
just because there is error, it does not mean that the test is not good. The more
important factor is the size of the error.
Formally, an observed test score, X, is conceived as the sum of a true score, T, and
an error term, E. The true score is defined as the average of test scores if a test is
repeatedly administered to a learner (and the learner can be made to forget the
content of the test in-between repeated administrations). Given that the true score
is defined as the average of the observed scores, in each administration of a test,
the observed score departs from the true score and the difference is called
measurement error. This departure is not caused by blatant mistakes made by test
writers but it is caused by some chance elements in learnersÊ performance during
the test.
Observed Score = True Score + Error
Measurement error mostly comes from the fact that we have only sampled a small
portion of a learnerÊs capabilities. Ambiguous questions and incorrect markings
can contribute to measurement error but it is only a small part of measurement
error. Imagine if there are 10,000 items and a learner can obtain 60 per cent of all
10,000 items administered (which is not practically feasible). Then 60 per cent is
the true score. Now when you sample only 40 items in a test, the expected score
for the learner is 24 items. But the learner may get 20, 26, 30 and so forth depending
on which items are in the test. In this example, this is the main source of
measurement error. That is to say, measurement error is due to the sampling of
items rather than poorly written items.
Error may come from various sources such as within the test takers (the learners),
within the test (questions are not clear), in the administration of the test or even
during scoring (or marking). For example, fatigue, illness, copying or even the
unintentional noticing of another learnerÊs answer all contribute to error from
within the test taker.
Generally, the smaller the error, the greater the likelihood you are closer to
measuring the true score of the learners. If you are confident that your geometry
test (observed score) has a small error, then you can confidently infer that Swee
LeongÊs score of 66 per cent is close to his true score or his actual ability in solving
geometry problems, in other words, what he actually knows. To reduce the error
in a test, you must ensure that your test is reliable and valid. The higher the
reliability and validity of your test, the greater the likelihood you will be
measuring the true score of your learners.

TOPIC 8 TEST RELIABILITY AND VALIDITY  159
We will first examine the reliability of a test. What is reliability? Reliability is the
consistency of the measurement. Would your learners get the same scores if they
took your test on two different occasions? Would they get approximately the same
score if they took two different forms of your test? These questions have to do with
the consistency of your classroom tests in measuring learnersÊ abilities, skills and
attitudes or values. The generic name for consistency is reliability. Reliability is an
essential characteristic of a good test because if a test does not measure consistently
(reliably), then you could not count on the scores resulting from the administration
of the test (Jacobs, 1991).
8.2 THE RELIABILITY COEFFICIENT

Reliability is quantified as a reliability coefficient. The symbol used to denote a
reliability coefficient is r with two identical subscripts (rxx). The reliability
coefficient is generally defined as the variance of the true score divided by the
variance of the observed score. The following is the equation.
Variance of the True Score  2 True Score

rxx   2
Variance of the Observed Score  Observed Score
If there is relatively little error, the ratio of the true score variance to the observed
score variance approaches a reliability coefficient of 1.00 which is a perfect
reliability. If there is relatively large amount of errors, the ratio of the true score
variance to the observed score variance approaches 0.00 which is total
unreliability.
Test with no reliability Test with perfect reliability

0.00 1.00
High reliability means that the questions of a test tended to „pull together‰.
Learners who answered a given question correctly were more likely to answer
other questions correctly. If an equivalent or parallel test were developed by using
similar items, the relative scores of learners would show little change. Low
reliability means that the questions tended to be unrelated to each other in terms
of who answered them correctly. Low reliability means that the questions tended
to be unrelated to each other in terms of who answered them correctly. The
resulting test scores reflect that something is wrong with the items or the testing
situation rather than learnersÊ knowledge of the subject matter. The following
guidelines may be used to interpret reliability coefficients for classroom tests as
shown in Table 8.1.
Table 8.1: Interpretation of Reliability Coefficients
Reliability Interpretation
0.90 and above Excellent reliability (comparable to the best standardised tests).
0.80ă0.90 Very good for a classroom test.
0.70ă0.80 Good for a classroom test but there are probably a few items which
could be improved.
0.60ă0.70 Somewhat low. There are probably some items which could be
removed or improved.
0.50ă0.60 The test needs to be revised.
0.50 and below Questionable reliability and the test should be replaced or in need of a
major revision.
If you know the reliability coefficient of a test, can you estimate the true score of a
learner on a test? In testing, we use the Standard Error of Measurement to estimate
the true score.
The Standard Error of Measurement  Standard Deviation 1  r
Note: „r‰ is the reliability of the test.
Using the normal curve, you can estimate a learnerÊs true score with some degree
of certainty based on the observed score and Standard Error of Measurement.
For Example:
You gave a history test to group of 40 learners. Khairul obtained a score of 75 in
the test, which is his observed score. The standard deviation of your test is 2.0.
Earlier you had established that your history test had a reliability coefficient of 0.7.
You are interested to find out KhairulÊs true score.
The Standard Error of Measurement  Standard Deviation 1  r
 2.0 1  0.7  2.0  0.55  1.1

Therefore, based on the normal distribution curve (refer to Figure 8.1), KhairulÊs
true score should be:
(a) Between 75 ă 1.1 and 75 + 1.1 or between 73.9 and 76.1 for 68% of the time.
(b) Between 75 ă 2.2 and 75 + 2.2 or between 72.8 and 77.2 for 95% of the time.
(c) Between 75 ă 3.3 and 75 + 3.3 or between 71.7 and 78.3 for 99% of the time.
Figure 8.1: Determining KhairulÊs true score based on a normal distribution
SELF-CHECK 8.1
1. Shalin obtains a score of 70 in a biology test. The reliability of the
test is 0.65 and the standard deviation of the test is 1.5. Compute the
true score of Shalin for the biology test.
(int: Use 1 standard error of measurement)
2. Define the reliability of a test.
3. What does the reliability coefficient indicate?
4. Explain the concept of true score.

8.3 METHODS TO ESTIMATE THE RELIABILITY

OF A TEST
Let us now discuss how we estimate the reliability of a test. Figure 8.2 lists three
common methods of estimating the reliability of a test. It is not possible to calculate
reliability exactly and so we have to estimate reliability.
Figure 8.2: Methods for estimating reliability
Let us now look at each method in detail.
(a) Test-retest
Using the Test-retest technique, the same test is administered again to the
same group of learners. The scores obtained in the first administration of the
test are correlated to the scores obtained in the second administration of
the test. If the correlations between the two scores are high, then the test can
be considered to have high reliability. However, a test-retest situation is
somewhat difficult to conduct as it is unlikely that learners will be prepared
to take the same test twice.
There is also the effect of practice and memory that may influence the
correlation. The shorter the time gap, the higher the correlation; the longer
the time gap, the lower the correlation. This is because the two observations
are related over time. Since this correlation is the test-retest estimate of
reliability, you can obtain considerably different estimates depending on the
interval.

(b) Parallel or Equivalent Forms

For this technique, two equivalent tests (or forms) are administered to the
same group of learners. The two tests are not similar but are equivalent. In
other words, they may have different questions but they are measuring the
same knowledge, skills or attitudes. Therefore, you have two sets of scores
which are correlated and reliability can be established. Unlike the test-retest
technique, the parallel or equivalent forms reliability measure is not affected
by the influence of memory. One major problem with this approach is
that you have to be able to generate lots of items that reflect the same
construct. This is often not an easy feat.
(c) Internal Consistency

Internal consistency is determined using only one test administered once to
learners. Internal consistency refers to how the individual items or questions
behave in relation to each other and to the overall test. In effect we judge the
reliability of the instrument by estimating how well the items that reflect the
same construct yield similar results. We are looking at how consistent the
results are for different items for the same construct within the measure.
The following are two common internal consistency measures that can be
used.
(i) Split-half
To solve the problem of having to administer the same test twice,
the split-half technique is used. In the split-half technique, a test is
administered once to groups of learners. The test is divided into two
equal halves after the learners have completed the test. This technique
is most appropriate for tests which include multiple-choice items, true-
false items and perhaps short-answer essays. The items are selected
based on odd-even method whereby one half of the test consists of odd
numbered items while the other half consists of even numbered items.
Then, the scores obtained for the two halves are correlated to determine
the reliability of the whole test using the Spearman-Brown correlation
coefficient.
2rxy
rsb 
1  rxy 

In this formula, rsb is the split-half reliability coefficient and rxy

represents the correlation between the two halves. Say for example, you
have established that the correlation coefficient between the two halves
is 0.65. What is the reliability of the whole test?
2rxy 2  0.65  1.3

rsb     0.78
1  rxy  1  0.65 1.65
(ii) CronbachÊs Alpha

CronbachÊs coefficient alpha can be used for both binary-type
(1 = correct, 0 = incorrect or 1 = true and 0 = false) and scale items
(1 = strongly agree, 2 = agree, 3 = disagree, 4 = strongly disagree).
Reliability is estimated by computing the correlation between the
individual questions and the extent to which individual questions
correlate with the total test. This is meant by internal consistency. The
key is „internal‰. It is unlike the test-retest and parallel or equivalent
form which require another test as an external reference. The stronger
the items are inter-related, the more likely the test is consistent. The
higher the alpha, the more reliable is the test. There is generally no
agreed cut-off. Usually 0.7 and above is acceptable (Nunnally, 1978).
The formula for CronbachÊs alpha is as follows:
 k 

k  i 1
 pi 1  pi  
Cronbachs alpha    1 
k 1  2x 
 
 
 k is the number of items in the test;

 pi refers to item difficulty which is the proportion of learners who
answered the item i correctly; and
  2 x is the sample variance for the total score.

For Example:
Suppose that in a multiple-choice test consisting of five items or
questions, the following difficulty index for each item was observed:
p 1 = 0.4, p 2 ă 0.5, p 3 = 0.6, p4 = 0.75 and p 5 = 0.85. Sample variance
(  2 x ) = 1.84. CronbachÊs alpha would be calculated as follows:
5  1.045 
 1    0.54
5  1  1.840 
Professionally developed standardised tests should have internal

consistency coefficient of at least 0.85. High reliability coefficients are
required for standardised tests because they are administered only
once and the score on that one test is used to draw conclusions about
each learnerÊs level on the constructs measured. Perhaps, the closest to
a standardised test in the Malaysian context would be the tests for
different subjects conducted at the national level in the PT3 and SPM
examinations. According to Wells and Wollack (2003), it is acceptable
for classroom tests to have reliability coefficients of 0.70 and higher
because a learnerÊs score on any one test does not determine the
learnerÊs entire grade in the subject or course. Usually, grades are based
on several other measures such as project work, oral presentations,
practical tests, class participation and so forth. To what extent in this
true in the Malaysian classroom?
A Word of Caution!
When you obtain a low alpha, you should be careful not to immediately conclude
that the test is a bad test. You should check to determine if the test measures several
attributes or dimensions rather than one attribute or dimension. If it does, there is
the likelihood for the Cronbach alpha to be deflated. For example, an Aptitude Test
may measure three attributes or dimensions such as quantitative ability, language
ability and analytical ability. Hence, it is not surprising that the Cronbach alpha
for the whole test may be low as the questions may not correlate with each other.
Why? This is because the items are measuring three different types of human
abilities. The solution is to compute three different Cronbach alphas; one for
quantitative ability, one for language ability and one for analytical ability which
tells you more about the internal consistency of the items in the test.

SELF-CHECK 8.2
1. What is the main advantage of the split-half technique over the test-
retest technique in determining the reliability of a test?
2. Explain the parallel or equivalent forms technique in determining
the reliability of a test.
3. Explain the concept of internal consistency reliability of a test.
8.4 INTER-RATER AND INTRA-RATER

RELIABILITY
Whenever you use humans as part of your measurement procedure, you have to
worry about whether the results you get are reliable or consistent. People are
notorious for their inconsistency. We are easily distracted. We get tired of doing
repetitive tasks. We daydream. We misinterpret. So, how do we determine
whether:
(a) Two observers are being consistent in their observations?
(b) Two examiners are being consistent in their marking of an essay?
(c) Two examiners are being consistent in their marking of a project?
In order to find the answers to these questions, let us read further on inter-rater
reliability and intra-rater reliability.
(a) Inter-rater Reliability

When two or more persons mark essay questions, the extent to which there
is agreement in the marks allotted is called inter-rater reliability (refer to
Figure 8.3). The greater the agreement, the higher is the inter-rater reliability.
Figure 8.3: Examiner A versus Examiner B

Inter-rater reliability can be low because of the following reasons:
(i) Examiners are subconsciously being influenced by knowledge of the

learners whose scripts are being marked;
(ii) Consistency in marking is affected after marking a set of either very

good or very weak scripts;
(iii) When there is an interruption during the marking of a batch of scripts,

different standards may be applied after the break; and
(iv) The marking scheme is poorly developed resulting in examiners

making their own interpretations of the answers.
According to Frith and Macintosh (1987), inter-rater reliability can be

enhanced if the criteria for marking or marking scheme:
(i) Contains suggested answers related to the question;
(ii) Has made provision for acceptable alternative answers;
(iii) Ensures that the time allotted is appropriate for the work required;
(iv) Is sufficiently broken down to allow the marking to be as objective as

possible and the totalling of marks is correct; and
(v) Allocates marks according to the degree of difficulty of the question.
(b) Intra-rater Reliability

While inter-rater reliability involves two or more individuals, intra-rater
reliability is the consistency of grading by a single rater. Scores on a test are
rated by a single rater at different times. When we grade tests at different
times, we may become inconsistent in our grading for various reasons. For
example, some papers that are graded during the day may get our full
attention while others that are graded towards the end of the day may be
very quickly glossed over. Similarly, change in our mood may affect the
grading of papers. In these situations, the lack of consistency can affect intra-
reliability in the grading of learnersÊ answers.

SELF-CHECK 8.3
1. List the steps that may be taken to enhance inter-rater reliability in

the grading of essay answer scripts.
2. Suggest other steps you would take to enhance intra-rater
reliability in the grading of learnersÊ assignments.
8.5 TYPES OF VALIDITY

What is validity? Validity is often defined as the extent to which a test measures
what is was designed to measure (Nutall, 1987). While reliability relates to the
consistency of the test, validity relates to the relevancy of the test. If it does not
measure what it sets out to measure, then its use is misleading and the
interpretation based on the test in not valid or relevant. For example, a test that is
supposed to measure the „spelling ability of 8-year old children‰ but does not do
so is not a valid test. It would be disastrous if you make claims about what a learner
can or cannot do based on a test that is actually measuring something else. It is for
this reason that many educators argue that validity is the most important aspect of
a test. However, validity will vary from test to test depending on what it is used
for. For example, a test may have high validity in testing the recall of facts in
economics but that same test may be low in validity with regards to testing the
application of concepts in economics.
Messick (1989) was most concerned about the inferences a teacher draws from the
test score, the interpretation the teacher makes about his learners and the
consequences from such inferences and interpretation. You can imagine the power
an educator holds in his hand when designing a test. Your test could determine
the future of thousands of learners. Inferences based on test of low validity could
give a completely different picture of the actual abilities and competencies of
learners. Three types of validity have been identified: construct validity, content
validity and criterion-related validity which is made up of predictive and
concurrent validity (refer to Figure 8.4).

Figure 8.4: Types of validity
Now let us find out more on each type of validity.
(a) Construct Validity

Construct validity relates to whether the test is an adequate measure of the
underlying construct. A construct could be any phenomena such as
mathematics achievement, map skills, reading comprehension, attitude
towards school, inductive reasoning, environmental awareness, spelling
ability and so forth. You might think of construct validity as a „labelling‰
issue. For example, when you measure what you term as „critical thinking‰,
is that what you are really measuring?
Thus, to ensure high construct validity, you must be clear about the
definition of the construct you intend to measure. For example, a construct
such as reading comprehension would include vocabulary development,
reading for literal meaning and reading for inferential meaning. Some
experts in educational measurement have argued that construct validity is is
the most critical type of validity. You could establish the construct validity
of an instrument by correlating it with another test that measures the same
construct. For example, you could compare the scores obtained on your
reading comprehension test with the scores obtained on another well-known
reading comprehension test administered to the same sample of learners. If
the scores for the two tests are highly correlated, then you may conclude that
your reading comprehension test has high construct validity.

A construct is determined by referring to theory. For example, if you are

interested in measuring the construct „self-esteem‰, you need to be clear
what self-esteem is. Perhaps, you need to refer to literature in the field
describing the attributes of self-esteem. You will find that theoretically, self-
esteem is made of the following attributes ă physical self-esteem, academic
self-esteem and social self-esteem. Based on this theoretical perspective, you
can build items or questions to measure self-esteem covering these three
types of self-esteem. Through such a process you will be more certain to
ensure high construct validity.
(b) Content Validity

Content Validity is more straightforward and likely to be related to construct
validity. It concerns the coverage of appropriate and necessary content, for
example, does the test cover the skills necessary for good performance or all
the aspects of the subject taught? It is concerned with sample-population
representativeness. That is to say, the facts, concepts and principles covered
by the test items should be representative of the larger domain (such as the
syllabus) of facts, concepts and principles.
For example, the Science unit on „Energy and Forces‰ may include facts,
concepts, principles and skills on light, sound, heat, magnetism and
electricity. However, it is difficult, if not impossible, to administer a two to
three-hour paper to test all aspects of the syllabus on „Energy and Forces‰
(refer to Figure 8.5). Therefore, only selected facts, concepts, principles and
skills from the syllabus (or domain) are sampled. The content selected will
be determined by content experts who will judge the relevance of the content
in the test to the content in the syllabus or particular domain.
Figure 8.5: Sample of content tested for the unit on „Energy and Forces‰

Content validity will be low if the questions in the test include testing content
not included in the domain or syllabus. To ensure content validity and
coverage, most teachers use the Table of Specifications (as discussed in
Topic 3). Table 8.2 is an example of a Table of Specifications which specifies
the knowledge and skills to be measured and the topics covered for the unit
on „Energy and Forces‰. You cannot measure all the content of a topic,
therefore, you will have to focus on the key areas and give due weightage to
those areas that are important. For example, the teacher has decided that 64
per cent of questions will emphasise on the understanding of concepts while
36 per cent will focus on the application of concepts for the five topics. A
Table of Specifications provides teachers with evidence that a test has high
content validity and that it covers what should be covered.
Table 8.2: Table of Specifications for the Unit on „Energy and Forces‰
Understanding of Application of
Topics Total
Concept Concepts
Light 7 4 11 (22%)
Sound 7 4 11 (22%)
Heat 7 4 11 (22%)
Magnetism 3 3 6 (11%)
Electricity 8 3 11 (22%)
TOTAL 32 (64%) 18 (36%) 50
Content validity is different from face validity which refers to what the test
superficially appears to measure. Face validity assesses whether the test
„looks valid‰ to the examinees, the administrative personnel who decide on
its use and other technically untrained observers. The face validity is a weak
measure of validity but that does not mean that it is incorrect, only that
caution is necessary. Its importance cannot be underestimated.
(c) Criterion-related Validity

Criterion-related validity of a test is established by relating the scores
obtained to the scores of some other criterion or other test.

There are two types of criterion-related validity:
(i) Predictive validity relates to whether the test predicts accurately some
future performance or ability. Is the STPM examination a good
predictor of performance in university? One difficulty in calculating the
predictive validity of STPM is because only those who pass the
examination proceed to university (generally speaking) and we do not
know how well learners who did not pass the examination might have
done (Wood, 1991). Moreover, only a small proportion of the
population sit for the STPM examination. As such, the correlation
between STPM grades and performance at the degree level would be
quite high.
(ii) Concurrent validity is concerned about whether the test correlates

with, or gives substantially the same results as, another test of the same
skill. For example, does your end-of-year language test correlate with
the MUET examination? In other words, if your language test correlates
highly with MUET, then your language test has high concurrent
validity.
8.6 FACTORS AFFECTING RELIABILITY AND

VALIDITY
Deale (1975) suggests that to prepare tests which are acceptably valid and reliable,
these factors should be taken into account:
(a) Length of the Test

Generally, the longer the test, the more reliable and valid is the test. A short
test would not adequately cover a yearÊs work. The syllabus needs to be
sampled. The test should consist of enough questions that are representative
of the knowledge, skills and competencies in the syllabus. However, there is
also a problem with tests that are too long. A long test may be valid but it
will take too much time and fatigue may set in, which may affect
performance and the reliability of the test.
(b) Selection of Topics

The topics selected and the test questions prepared should reflect the way
the topics were treated during teaching and learning. It is necessary to be
clear about the learning outcomes and to design items that measure these

learning outcomes. For example, in your teaching, learners were not given
the opportunity of think critically and solve problems. However, your test
consists of items requiring learners to think critically and solve problems. In
such situation, the reliability and validity of the test will be affected.
(c) Choice of Testing Techniques

The testing techniques selected will also affect reliability and validity. For
example, if you choose to use essay questions, validity may be high but
reliability may be low. Essay questions tend to be less reliable than short-
answer questions. Structured essays are usually more reliable than open-
ended essays.
(d) Method of Test Administration

Test administration is also an important step in the measurement process.
This includes the arrangement of items in a test, the monitoring of test taking
and the preparation of data files from the test booklets. Poor test
administration procedures can lead to problems in the data collected and
threaten the validity of test results.
Adequate time must be allowed for the majority of learners to complete the
test. This would reduce wild guessing and instead encourage learners to
think carefully about the answers. Instructions need to be clear to reduce the
effects of confusion on reliability and validity. The physical conditions under
which the test is taken must be favourable for learners. There must be
adequate space, lighting and appropriate temperature. Learners must be able
to work independently and the possibility of distractions in the form of
movement and noise must be guarded against.
(e) Method of Marking

The marking should be as objective as possible. Marking which depends on
the exercise of human judgement such as in essays, observations of classroom
activities and hands-on practices is subject to the variations of human
fallibility. (Refer to inter-rater reliability discussed in the earlier subtopic.) It
is quite easy to mark objective items quickly, but it is also surprisingly easy
to make careless errors. This is especially true where large numbers of scripts
are being marked. A system of checks is strongly advised. One method is
through the comments of the learners themselves when their marked papers
are returned to them.

8.7 RELATIONSHIP BETWEEN RELIABILITY

AND VALIDITY
Some people may think of reliability and validity as two separate concepts. In
reality, reliability and validity are related. Trochim (2005) provided the following
analogy (refer to Figure 8.6).
Figure 8.6: Graphical representation of the relationship between reliability and validity
The centre or the bullseye is the concept that we are trying to measure. Say for
example, in trying to measure the concept of „inductive reasoning‰, you are likely
to hit the centre (or the bullseye) if your inductive reasoning test is both reliable
and valid, which is what all test developers aim to achieve (see Figure 8.7(d)).
On the other hand, your inductive reasoning test could be reliable but not valid.
How is that possible? Your test may not measure inductive reasoning but the score
you obtain each time you administer the test is approximately the same (see Figure
8.7(b)). In other words, the test is consistently and systematically measuring the
wrong construct (that is not inductive reasoning). Imagine the consequences of
making judgement about the inductive reasoning of learners using such a test!
In the context of psychological testing, if an instrument does not have satisfactory

reliability, one typically cannot claim validity. Validity requires that instruments
are sufficiently reliable. The diagram in Figure 8.7(c) does not have validity
because the reliability is low. In other words, you are not getting a valid estimate
of the inductive reasoning ability of your learners because they are inconsistent.
The worst case scenario is when the test is neither reliable nor valid (see
Figure 8.7(a)). In this scenario the scores obtained by learners tend to concentrate
at the top half of the target and they are consistently missing the centre. Your
measure in this case is neither reliable nor valid and the test should be rejected or
improved.

 The true score is a hypothetical concept as to the actual ability, competency and
capacity of an individual.
 The higher the reliability and validity of your test, the greater the likelihood
you will be measuring the true score of your learners.
 Reliability refers to the consistency of a measure. A test is considered reliable

if we get the same result repeatedly.
 Validity requires that instruments are sufficiently reliable.
 Face validity is a weak measure of validity.
 Using the test-retest technique, the same test is administered again to the same
group of learners.
 For the parallel or equivalent forms technique, two equivalent tests (or forms)
are administered to the same group of learners.
 Internal consistency is determined using only one test administered once to

learners.
 When two or more persons mark essay questions, the extent to which there is
agreement in the marks allotted is called inter-rater reliability.
 While inter-rater reliability involves two or more individuals, intra-rater

reliability is the consistency of grading by a single rater.
 Validity is the extent to which a test measures what it claims to measure. It is

vital for a test to be valid in order for the results to be accurately applied and
interpreted.
 Construct validity relates to whether the test is an adequate measure of the

underlying construct.
 Content validity is more straightforward and likely to be related to construct

validity. It relates to the coverage of appropriate and necessary content.
 Some people may think of reliability and validity as two separate concepts. In
reality, reliability and validity are related.

Construct validity Intra-rater reliability

Concurrent validity Predictive validity
Content validity Reliability
Criterion-related validity Test-retest
Face validity True score
Internal consistency Validity
Inter-rater reliability

Topic  Appraising
Classroom
9 Tests and Item
Analysis
LEARNING OUTCOMES
1. Define item analysis;
2. Compute the difficulty index;
3. Compute the discrimination index;
4. Analyse the effectiveness of distractors in a question;
5. Discuss the relationship between the difficulty index and
discrimination index of an item; and
6. Explain the role of an item bank.
 INTRODUCTION
When you develop a test it is important to identify the strengths and weaknesses
of each item. In other words, to determine how well items in a test perform, some
statistical procedures need to be used.
In this topic, we will discuss item analysis which involves the use of three
procedures, namely item difficulty, item discrimination and distractor analysis to
help the test developer decide whether the items in a test can be accepted, modified
or rejected. These procedures are quite straightforward and easy to use but
educators need to understand the logic underlying the learner analyses in order to
use them properly and effectively.

178  TOPIC 9 APPRAISING CLASSROOM TESTS AND ITEM ANALYSIS
9.1 WHAT IS ITEM ANALYSIS?

Having administered a test and marked it, most teachers would discuss the
answers with learners. Discussion would usually focus on the right answers and
common errors made by learners. Some teachers may focus on the questions most
learners performed poorly on and the questions learners did very well. However,
there is much more information available about a test that is often ignored by
teachers. This information will only be available if item analysis is conducted.
What is item analysis? Item analysis is a process which examines the responses to
individual test items or questions in order to assess the quality of those items and
the test as a whole. Item analysis is especially valuable in improving items or
questions that will be used again in later tests. Moreover, it can also be used to
eliminate ambiguous or misleading items in a single test administration.
Specifically in Classical Test Theory (CTT) the statistics produced from analysing
the test results based on test scores include measures of difficulty index and
discrimination index. Analysing the effectiveness of distractors also becomes part
of the process. We will discuss each of these components of item analysis in detail
later.
The quality of a test is determined by the quality of each item or question in the
test. The teacher who constructs a test can only roughly estimate the quality of a
test. This estimate is based on the fact that the teacher has followed all the rules
and conditions of test construction. However, it is possible that this estimation may
not be accurate and certain important aspects have been ignored. Hence, it is
suggested that to obtain a more comprehensive understanding of the test, item
analysis should be conducted on the responses of learners. Item analysis is
conducted to obtain information about individual items or questions in a test and
how the test can be improved. It also facilitates the development of an item or
question bank which can be used in the construction of a test.
9.2 STEPS IN ITEM ANALYSIS

Both the Classical Test Theory (CTT) and modern test theories such as Item
Response Theory (IRT) provide useful statistics to help us analyse the test data.
For many item analyses CTT is sufficient to provide the information we need. As
such, CTT will be used in this module.

TOPIC 9 APPRAISING CLASSROOM TESTS AND ITEM ANALYSIS  179
Let us take an example of a teacher who has administered a 30-item multiple-

choice objective test in geography to 45 learners in a secondary school classroom.
(a) Step 1
Obviously, upon receiving the answer sheets, the first step would be to mark
each of the answer sheets.
(b) Step 2
Arrange the 45 answer sheets from the highest score obtained to the lowest
score obtained. The paper with the highest score is on top and the paper with
the lowest score is at the bottom.
(c) Step 3
Multiply 45 (the number of answer sheets) with 0.27 (or 27 per cent) which is
12.15 and round up to 12. The use of the value 0.27 or 27 per cent is not
inflexible. It is possible to use any percentage between 27 to 35 per cent as the
value. However the 27 per cent rule can be ignored if the class size is too
small. Instead of taking the 27 per cent sample, divide the number of answer
sheets by 2.
(d) Step 4
Arrange the pile of 45 answer sheets according to scores obtained (highest
score to the lowest score). Take out 12 answer sheets from the top of the pile
and 12 answer sheets from the bottom of the pile. Call these two piles as
„high mark‰ learners and „low mark‰ learners. Set aside the middle group
of papers (21 papers). Although these could be included in the analysis, using
only the high and low groups simplifies the procedure.
(e) Step 5
Refer to Item #1 or Question #1:
(i) Count the number of learners from the „high mark‰ group who
selected each of the options (A, B, C or D); and
(ii) Count the number of learners from the „low mark‰ group who selected
the options A, B, C or D (see Figure 9.1).

Figure 9.1: Item analysis for one item or question
From the analysis, 11 learners from the „high mark‰ group and two learners from
the „low mark‰ group selected „B‰ which is the correct answer. This means that
13 out of 24 learners selected the correct answer. Also, note that all the distractors
(A, C and D) were selected by at least one learner. However, the information
provided in Figure 9.1 is insufficient and further analysis has to be conducted.
9.3 THE DIFFICULTY INDEX

Using the information provided in Figure 9.1, you can compute the difficulty index
which is a quantitative indicator with regard to the difficulty level of an individual
item or question. It can be calculated using the following formula:
Number of learners with the correct answer (R)

Difficulty index  p  
Total number of learners who attempted the question  T 
R 13
   0.54
T 24
What does a difficulty index (p) of 0.54 mean? The difficulty index is a coefficient
that shows the percentage of learners who got the correct answer compared to the
total number of learners in the two groups who answered. In other words, 54 per
cent of learners selected the correct answer. Although our computation is based on
the high and low scoring groups only, it provides a close approximation to the
estimate that would be obtained with the total group. Thus, it is proper to say that
the index of difficulty for this item is 54 per cent (for this particular group). Note
that, since difficulty refers to the percentage of getting the item right, the smaller
the percentage figure the more difficult is the item. Lien (1980) provides these
guidelines on the interpretation of the difficulty index as follows (see Figure 9.2):
Figure 9.2: Interpretation of the difficulty index (p)
If a teacher believes that the achievement 0.54 on the item is too low, he can change
the way he teaches to better meet the objective represented by the item. Another
interpretation might be that the item was too difficult, confusing or invalid, in
which case the teacher can replace or modify the item, perhaps using information
from the item's discrimination index or distractor analysis.
Under CTT, the item difficulty measure is simply the proportion of the correct
answer from learners for an item. For an item with a maximum score of 2, there is
a slight modification to the computation of proportion of percentage correct.
This item has a possible partial credit scoring of 0, 1 and 2. If the total number of
learners attempting this item is 100 and 23 learners score 0, 60 learners score 1 and
17 learners score 2, then a simple calculation will show that 23 per cent of the
learners score 0, 60 per cent of the learners score 1 and 17 per cent of the learners
score 2 for this particular item. The average score for this item should be (0  0.23)
+ (1  0.6) + (2  0.17) = 0.94.
Thus the observed average score of this item is 0.94 out of a maximum of 2. So the
average proportion correct is 0.94/2 = 0.47 or 47 per cent.
SELF-CHECK 9.1
A teacher gave a 20-item science test to a group of 35 learners. The correct
answer for Question #25 is „C‰ and the results are as follows:
Options A B C D Blank
High mark group (n = 12) 0 2 8 2 0
Low mark group (n = 12) 2 4 3 2 1
(a) Compute the difficulty index (p) for Question #25.

(b) Is Question #25 an easy or difficult question?
(c) Do you think you need to improve Question #25?

9.4 THE DISCRIMINATION INDEX

The discrimination index is a basic measure which shows the extent to which a
question discriminates or differentiates between learners in the „high mark‰ and
„low mark‰ groups. This index can be interpreted as an indication of the extent to
which overall knowledge of the content area or mastery of the skills is related to
the response on an item. It is most crucial for a test item that learners got the
answer correct due to their level of knowledge or ability and not due to something
else such as chance or test bias.
Note in our example earlier, 11 learners in the „high mark‰ group and two learners
in the „low mark‰ group selected the correct answer. This indicates positive
discrimination since the item differentiates between learners in the same way that
the total test score does. That is, learners with high scores on the test (high mark
group) got the item right more frequently than learners with low scores on the test
(low mark group). Although analysis by inspection may be all that is necessary for
most purposes, an index of discrimination can be easily computed using the
following formula:
R h  RL
Discrimination index 
1 T
2
Where Rh = Number of learners in „high mark‰ group (RH) with the correct
answer
RL = Number of learners in „low mark‰ group (RL) with the correct
answer
T = Total number of learners
Example:
A test was given to a group of 43 learners and 10 out of the 13 „high mark‰ group
got the correct answer compared to 5 out of 13 „low mark‰ group who got the
correct answer. The discrimination index is computed as follows:
R h  R L 10  5 10  5
D    0.38
1 T 1  26  13
2 2

What does a discrimination index of 0.38 mean? The discrimination index is

a coefficient that shows the extent to which the question discriminates or
differentiates between „high mark‰ learners and „low mark‰ learners. Blood and
Budd (1972) provided the following guidelines on the meaning of the
discrimination index as follows (see Figure 9.3):
Figure 9.3: Interpretation of the Discrimination Index
A question that has a high discrimination index is able to differentiate between

learners who know and those who do not know the answer. When we say that a
question has a low discrimination index, it is not able to differentiate between
learners who know and learners who do not know. A low discrimination index
means that more „low mark‰ learners got the correct answer because the question
was too simple. It could also mean that learners from both the „high mark‰ group
and „low mark‰ group got the answer wrong because the question was too
difficult.
The formula for the discrimination index is such that if more learners in the „high
mark‰ group chose the correct answer than did learners in the low scoring group,
the number will be positive. At the very least, one would hope for a positive value
as that would indicate that it is knowledge of the question that resulted in the
correct answer.
(a) The greater the positive value (the closer it is to 1.0), the stronger the
relationship is between overall test performance and performance on that
item.
(b) If the discrimination index is negative, that means for some reason learners
who scored low on the test were more likely to get the answer correct. This
is a strange situation which suggests poor validity for an item.

SELF-CHECK 9.2
A teacher gave a 35-item economics test to 42 learners. For Question #16,

8 out of the 11 „high mark‰ groups got the correct answer compared to 4
out of 11 from the „low mark‰ group who got the correct answer.
(a) Compute the discrimination index for Question #16.
(b) Does Question #16 have a high or low discrimination index?
9.5 APPLICATION OF ITEM ANALYSIS ON

ESSAY-TYPE QUESTION
The previous subtopic explains the use of item analysis on multiple-choice
questions. Item analysis can also be applied to essay-type questions. This subtopic
will illustrate how this can be done. For ease of understanding, the illustration will
use a short-answer essay question as an example.
Let us assume that a group of 20 learners have responded to a short-answer essay

question with scores ranging from a minimum of 0 to the maximum of 4. Table 9.1
provides the scores obtained by the learners.
Table 9.1: Scores Obtained by Learners for a Short-answer Essay Question
Item Score No. of Learners Earning Each Score Total Scores Earned
4 5 20
3 6 18
2 5 10
1 3 3
0 1 0
Total 51
Average score 51/20 = 2.55
The difficulty index (p) of the item can be computed using the following formula
as suggested by Nitko (2004):
Average score
p
Possible range of score

Using the information from Table 9.1, the difficulty index of the short-answer essay
question can be easily computed. The average score obtained by the group of
learners is 2.55, while the possible range of score for the item is (4 ă 0) = 4. Thus,
2.55
p
4
 0.64
The difficulty index (p) of 0.64 means that on average learners received 64 per cent
out of the possible maximum score for the item. The difficulty index can be
interpreted in the same way as that of the multiple-choice question discussed in
subtopic 9.3. The item is of a moderate level of difficulty (refer to Figure 9.2).
Note that in computing the difficulty index in the above example, the scores from
the whole group are used to obtain the average score. However, for a large group
of learners, it is possible to estimate the difficulty index for an item based on only
a sample of learners comprising the „high mark‰ and „low mark‰ groups as in the
case of computing the difficulty index of a multiple-choice question.
To compute the discrimination index (D) of an essay-type question, the following

formula is suggested by Nitko (2004):
Difference between upper and lower groups average score

D
Possible range of score
Using the information from Table 9.1 and presenting it in the format as shown in
Table 9.2, we can compute the discrimination index of the short-answer essay
question.
Table 9.2: Distribution of Scores Obtained by Learners
Average
Score 0 1 2 3 4 Total
score
High Mark Group (n = 10) 0 0 1 4 5 34 3.4
Low Mark Group (n = 10) 1 3 4 2 0 17 1.7
„n‰ refers to the number of learners

The average score obtained by the upper group of learners is 3.4 while that of the
lower group is 1.7. Using the formula as suggested by Nitko (2004), we can
compute the discrimination index of the short-answer essay question as follows:
3.4  1.7
D
4
 0.43
The discrimination index (D) of 0.43 indicates that the short-answer question does
discriminate between the upper and lower groups of learners and at a high level
(refer to Figure 9.3.). As in the computation of the discrimination index of
the multiple-choice question for a large group of learners, a sample of learners
comprising the top 27 per cent and the bottom 27 per cent may be used to provide
a good estimate.
SELF-CHECK 9.3
The following information is the performance of the high mark and the
low mark groups in a short-answer essay question.
Score 0 1 2 3 4
High mark group (n = 10) 2 2 3 1 2
Low mark group (n = 10) 3 2 2 3 0
(a) Calculate the difficulty index.
(b) Calculate the discrimination index.
(c) Discuss the findings.

9.6 RELATIONSHIP BETWEEN DIFFICULTY

INDEX AND DISCRIMINATION INDEX
Theoretically, the more difficult a question (or item) or the easier the question (or
item), the lower will be the discrimination index. Stanley and Hopkins (1972)
provided a theoretical model to explain the relationship between the difficulty
index and discrimination index of a particular question or item (see Figure 9.4).
Figure 9.4: Theoretical relationship between difficulty index and discrimination index
Source: Stanley & Hopkins (1972)

According to the model, a difficulty index of 0.2 can result in a discrimination

index of about 0.3 for a particular item (which may be described as an item of
moderate discrimination). Note that as the difficulty index increases from 0.1 to
0.5, the discrimination index increases even more. When the difficulty index
reaches 0.5 (described as an item of moderate difficulty), the discrimination index
is positive 1.00 (very high discrimination). Interestingly, for difficulty index of
more than 0.5, the discrimination index decreases. Why is this so?
(a) For example, a difficulty index of 0.9 results in a discrimination index of

about 0.2 which is described as an item of low to moderate discrimination.
What does this mean? The easier the question, the harder it is for that
question or item to discriminate between those learners who know and do
not know the answer to the question.
(b) Similarly, when the difficulty index is about 0.1, the discrimination index
drops to about 0.2. What does this mean? The more difficult the question, the
harder it is for that question or item to discriminate between those learners
who know and do not know the answer to the question.
SELF-CHECK 9.4
1. What can you conclude about the relationship between the

difficulty index of an item and its discrimination index?
2. Do you take these factors into consideration when giving a

multiple-choice test to students in your school?
9.7 DISTRACTOR ANALYSIS

In addition to examining the performance of the entire test items, teachers are also
interested to examine the performance of individual distractors (incorrect answer
options) on multiple-choice items. By calculating the proportion of learners who
chose each answer option, teachers can identify which distractors are "working"
and appear attractive to learners who do not know the correct answer and which
distractors are simply taking up space and not being chosen by many learners. To
eliminate blind guessing which results in a correct answer purely by chance (which
hurts the validity of a test item), teachers want as many plausible distractors as is
feasible. Analyses of response options allow teachers to fine tune and improve
items they may wish to use again with future classes. Let us examine performance
on an item or question (see Figure 9.5).

Example:
Which European power invaded Melaka in 1511?
Figure 9.5: Effectiveness of distractors
Generally, a good distractor is able to attract more „low mark‰ learners to select
that particular response or distract „low mark‰ learners towards selecting that
particular response. What determines the effectiveness of distractors? In Figure
9.5, a total of 24 learners selected the options A, B, C and D for a particular
question. Option B is a less effective distractor because many „high mark‰ learners
(n = 5) selected option B. Option D is relatively a good distractor because two
learners from the „high mark‰ group and five learners from the „low mark‰ group
selected this option. The analysis of response options shows that those who missed
the item were equally likely to choose option B and option D. No learners chose
option C. Therefore, option C does not act as a distractor. This is because learners
are not choosing between four answer options on this item, they are really
choosing between only three options as they are not even considering option C.
This makes guessing correctly more likely, which hurts the validity of the item.
The discrimination index can be improved by modifying and improving options B
and C.
SELF-CHECK 9.5
Which British resident was killed by Maharajalela in Pasir Salak?
Hugh Low Birch Brooke Gurney

Options A B C D No Response
High mark (n = 15) 4 7 0 4 0
Low mark (n = 15) 6 3 2 4 0
The answer is B.
Analyse the effectiveness of the distractors.

9.8 PRACTICAL APPROACH TO ITEM

ANALYSIS
Some teachers may find the techniques discussed earlier as time consuming, which
cannot be denied (especially when you have a test consisting of 40 items).
However, there is a more practical approach which may take up less time. Imagine
that you have administered a 40-item test to a class of 30 learners. Surely it will
take a lot of time to analyse the effectiveness of each item and may discourage you
from analysing each item in a test. Diederich (1971) proposed a method of item
analysis which can be conducted by the teacher and the learners in his class. The
following are the steps:
(a) Step 1
Arrange the 30 answer sheets from the highest score obtained to the lowest
score obtained.
(b) Step 2
Select the answer sheet that obtained a middle score. Group all answer sheets
above this score as „high marks‰ (mark a „H‰ on these answer sheets). Group
all answer sheets below this score as „low marks‰ group (mark an „L‰ on
these answer sheets).
(c) Step 3
Divide the class into two groups (high and low) and distribute the „high‰
answer sheets to the high group and the „low‰ answer sheets to the low
group. Assign one learner in each group to be the counter.
(d) Step 4
The teacher then asks the class.
Teacher: The answer for Question #1 is „C‰ and those who got it correct,
raise your hand.
Counter from „H‰ group: 14 for group H.
Counter from „L‰ group: 8 from group L.

(e) Step 5
The teacher records the responses on the whiteboard as follows:
High Low Total Correct Answers

Question #1 14 8 22
Question #2 12 6 18
Question #3 16 7 23
|
|
Question #n n n n
(f) Step 6
Compute the difficulty index for Question #1 as follows:
R H  R L 14  8
Difficulty index    0.73
30 30
(g) Step 7
Compute the discrimination index for Question #1 as follows:
R H  R L 14  8 6
Difficulty index     0.40
1 30 15 15
2
Note that earlier, we took 27 per cent of answer sheets in the „high mark‰
group and 27 per cent of answer sheets in the „low mark‰ group from the
total answer sheets. However, in this approach we divided the total answer
sheets into two groups. There is no middle group. The important thing is to
use a large enough fraction of the group to provide useful information.
Selecting the top and bottom 27 per cent of the group is recommended for
more refined analysis. The method shown in the example may be less
accurate but it is a „quick and dirty‰ method.
SELF-CHECK 9.6
Compare the difficulty index and discrimination index obtained using
this rough method with the theoretical model by Stanley and Hopkins in
Figure 9.4. Are the indexes very far out?

ACTIVITY 9.1
Teachers should perform an item analysis every time after administering

a test. Discuss this with your coursemates.
9.9 USEFULNESS OF ITEM ANALYSIS TO

TEACHERS
After each test or assessment, it is advisable to carry out item analysis of the test
items because the information from the analysis would be useful to teachers.
Among the benefits from the analysis are as follows:
(a) From the discussions in the earlier subtopics, it is obvious that the results of
item analysis could provide answers to the following questions:
(i) Did the item function as intended?
(ii) Were the items of appropriate difficulty?
(iii) Were the items free from irrelevant clues and other defects?
(iv) Was each of the distracters effective (in multiple-choice questions)?
Answers to the above questions can be used to select or revise test items for
future use. This would help to improve the quality of test items and the test
paper for future use. It also saves time for teachers when preparing test items
for future use because good items can be stored in an item bank.
(b) Item analysis data can provide a basis for efficient class discussion of the test
results. Knowing how effectively each test item functions in measuring the
achievement of the intended learning outcome and how learners perform in
each item, teachers can have a more fruitful discussion with the learners as
feedback based on item analysis is more objective and informative. For
example, teachers can highlight the misinformation or misunderstanding
reflected in the choice of particular distractors on multiple-choice questions
or frequently repeated errors on essay-type questions, thereby enhancing the
instructional value of assessment. If, during the discussion, the item analysis
reveals that there are technical defects in the items or the marking scheme,
learnersÊ marks can also be rectified to ensure a fairer test.

(c) Item analysis data can be used for remedial work. The analysis will reveal
the specific areas that the learners are weak in. Teachers can use the
information to focus remedial work directly on the particular areas of
weakness. For example, based on the distractor analysis, it is found that a
specific distractor has a low discrimination with a high number of learners
from both the high mark and low mark groups choosing the option. This
could suggest that there is some misunderstanding on a particular concept.
Remedial lessons can thus be planned to arrest the problem.
(d) Item analysis data can reveal weaknesses in teaching and provide useful
information to improve teaching. For example, despite the fact that an item
is properly constructed, it has a low difficulty index, suggesting that most
learners fail to answer the item satisfactorily. This might indicate that the
learners have not mastered a particular syllabus content that is being
assessed. This could be due to the weakness in instruction and thus
necessitates the implementation of more effective teaching strategies by the
teachers. Furthermore, if the item is repeatedly difficult for the learners, there
might be a need to revise the curriculum.
(e) Item analysis procedures provide a basis for teachers to improve their skills
in test construction. As teachers analyse learnersÊ responses to items, they
become aware of the defects of the items and what causes them. When
revising the items, they gain experience in rewording the statements so that
they are clearer, rewriting the distractors so that they are more plausible and
modifying the items so that they are at a more appropriate level of difficulty.
As a result, teachers improve their test construction skills.
9.10 CAUTIONS IN INTERPRETING ITEM

ANALYSIS RESULTS
Despite the usefulness of item analysis, the results from such an analysis are
limited in many ways and must be interpreted cautiously. The following are some
of the major concerns to observe:
(a) Item discriminating power does not indicate item validity. A high
discrimination index merely indicates that learners from the high mark
group performed relatively better than the learners from the low mark
group. The division of the high mark and low mark groups is based on the
total test score obtained by each learner, which is an internal criterion. By

using the internal criterion of total test score, item analysis offers evidence
concerning the internal consistency of the test rather than its validity. The
validity of a test needs to be judged using an external criterion, that is, to
what extent the test assesses the learning outcomes intended.
(b) The discrimination index is not always an indicator of item quality. For
example, a low index of discriminating power does not necessarily indicate
a defective item. If an item does not discriminate but it has been found to be
free from ambiguity and other technical defects, the item should be retained
especially in a criterion-referenced test. In such a test, a non-discriminating
item may suggest that all learners have achieved the criterion set by the
teacher. As such, the item does not discriminate between good and weak
learners. Another possible reason why low discrimination occurs for an item
is that the item may be either very easy or very difficult. Sometimes, this item
is necessary or desirable to be retained in order to measure a representative
sample of learning outcomes and course content. Moreover, an achievement
test is usually designed to measure several different types of learning
outcomes (knowledge, comprehension, application and so on). In such a
case, there will be learning outcomes that are assessed by fewer test items
and these items will have low discrimination because they have less
representation in the total test score. Removing these items from the test is
not advisable as it will affect the validity of the test.
(c) This type of traditional item analysis data is tentative. They are not fixed but
influenced by the type and number of learners being tested and the
instructional procedures employed. The data would thus change with every
administration of the same test items. Therefore, if repeated use of items is
possible, item analysis should be carried out for each administration of each
item. The tentative nature of item analysis should therefore be taken
seriously and the results are interpreted cautiously.
9.11 ITEM BANK

An item bank is a large collection of easily accessible questions or items that have
been administered over a period of time. For achievement tests which assess
performance in a body of knowledge such as geography, history, chemistry or
mathematics, the questions that can be asked are rather limited. Hence, it is not
surprising that previous questions are recycled with some minor changes and
administered to different groups of learners. Making good test items is not a
simple task and can be time consuming for teachers. An item or question bank
would be of great assistance to teachers.

An item bank consists of questions that have been analysed and stored because
they are good items. Each stored item will have information on its difficulty index
and discrimination index. Each item is stored according to what it measures
especially in relation to the topics of the curriculum. These items will be stored in
the form of a Table of Specifications indicating the content being measured as well
as the cognitive levels measured. For example, from the item bank, you will be able
to draw items measuring the application of concepts for the topic on „Electricity‰.
You will also be able to draw items from the bank with different difficulty levels.
Perhaps, you want to arrange easier questions at the beginning of the test so as to
build confidence in learners and gradually introduce questions of increasing
difficulty.
With computerised databases, item banks are easy to access. Teachers will have
hundreds of items at their disposal from which they can draw upon when
developing classroom tests. This would certainly help teachers with the tedious
and time consuming task of having to construct items or questions from scratch.
Unfortunately, not many educational institutions are equipped with such an item
bank. The more common practice is for teachers to select items or questions from
commercially prepared workbooks, past examination papers and sample items
from textbooks. These sources do not have information about the difficulty index
and discrimination index of items or information about the cognitive levels of
questions or what they aim to measure. Teachers will have to figure out for
themselves the characteristics of the items based on their experience in teaching
the content.
However, there are some issues with regard to the use of item bank. One of the
major concerns of the item bank is how to place different test items collected over
time on a common scale. The scale should indicate difficulty of the items, one scale
per subject matter. Retrieval of items from the bank is made easy when all items
are placed on the same scale.
The person in charge must also take every effort to add only quality items to the
item pool. To develop and maintain a good item bank requires a great deal of
preparation, planning, expertise and organisation. Although Item Response
Theory (IRT) approach is not a cure all pill for item banking problems, many of
these issues can be solved.

9.12 PSYCHOMETRIC SOFTWARES

Software programs designed for general statistical analysis such as SPSS can often
be used for certain types of psychometric analysis. Various software programs are
available specifically to analyse the data from tests.
Classical Test Theory (CTT) is an approach to psychometric analysis that has

weaker assumptions than Item Response Theory (IRT) and is more applicable to
smaller sample sizes. Under CTT, the learnerÊs raw test score would be the sum of
the scores received on the item in the test. For example, Iteman is a commercial
program while TAP is a free program for classical analysis.
IRT is a psychometric approach which assumes that the probability of a certain

response is a direct function of an underlying trait or traits. Under IRT, the concern
is whether the learner obtained each item correctly or not, rather than the raw test
score. The basic concept of IRT is about the individual item of test rather than about
the test scores. LearnersÊ trait or ability and item characteristics are referenced to
the same scale. For example, ConQuest is a computer program for item response
and latent regression models and TAM is an R package for item response models.
 Item analysis is a process which examines the responses to individual test

items or questions in order to assess the quality of those items and the test as a
whole.
 Item analysis is conducted to obtain information about individual items or

questions in a test and how the test can be improved.
 The difficulty index is a quantitative indicator with regards to the difficulty

level of an individual item or question.
 The discrimination index is a basic measure which shows the extent to which
a question discriminates or differentiates between learners in the „high mark‰
group and „low mark‰ group.
 Theoretically, the more difficult a question (or item) or easier the question (or
item), the lower will be the discrimination index.

 By calculating the proportion of learners who chose each answer option,

teachers can identify which distractors are "working" and appear attractive to
learners who do not know the correct answer and which distractors are simply
taking up space and not being chosen by many learners.
 Generally, a good distractor is able to attract more „low mark‰ learners to select
that particular response or distract „low mark‰ learners towards selecting that
particular response.
 An item bank consists of questions that have been analysed and stored because
they are good items.
Classical Test Theory (CTT) High mark group

Difficult question Item analysis
Difficulty index Item bank
Discrimination index Item Response Theory (IRT)
Distractor analysis Low mark group
Easy question Psychometric
Good distractor

Topic  Analysis and
Interpretation
10 of Test Scores
LEARNING OUTCOMES
1. Compute various central tendency measures;
2. Explain the use of standard scores;
3. Compute the z score and T score;
4. Describe the characteristics of a normal curve; and
5. Explain the role of norms in standardised tests.
 INTRODUCTION
All the data you have collected on the performance of learners will have to be
analysed. In this topic we will focus on the analysis and interpretation of the data
you have collected about the knowledge, skills and attitudes of your learners.
You analyse and interpret the information you have collected about your learners
quantitatively and qualitatively. For quantitative analysis of data, various
statistical tools are used, which we will be focussing on in this topic. For example,
statistics are used to show the distribution of scores on a Geography test and the
average score obtained by learners.

TOPIC 10 ANALYSIS AND INTERPRETATION OF TEST SCORE  199
10.1 WHY USE STATISTICS?

When you administer a Geography test to your class of 40 learners at the end of
the semester, you will obtain a score for each learner which is the measurement of
a sample of the learnerÊs ability. The behaviour tested could be the ability to solve
problems such as reading maps, the globe and interpretation of graphs. For
example, learner A gets a score of 64 while learner B gets 32. Does this mean that
the ability of learner A is better than that of learner B? Does it mean that the ability
of learner A is twice the ability of learner B? Are the scores 64 and 32 in
percentages? These scores or marks are difficult to interpret because they are raw
scores. Raw scores can be confusing if there is no reference made to a „unit‰. It is
only logical that you convert the scores to a unit such as percentages. In this
example, you get 64 per cent and 32 per cent.
Even the use of percentages may not be meaningful. For example, getting 64 per
cent in the test may be considered „good‰ if the test was a difficult test. On the
other hand, if the test was an easy one, then 64 per cent may be considered to be
only „average‰. In other words, to get a more accurate picture of the scores
obtained by learners in the test, the teacher should:
(a) Find out which learner obtained the highest marks in the class and the
number of questions correctly answered;
(b) Find out which learner obtained the lowest marks in the class and the
number of questions correctly answered; and
(c) Find out the number of questions correctly answered by all learners in the
class.
This illustrates that the marks obtained by learners in a test should be carefully
examined. It is not sufficient to just report the marks obtained. More information
should be given about the marks obtained and to do this you have to rely on
statistics. Some teachers may be afraid of statistics while others may regard it as
too time consuming. In fact, many of us often use statistics without being aware of
it. For example, when we talk about average rainfall, per capita income, interest
rates and percentage increases in our daily lives, we are talking the language of
statistics. What is statistics?

200  TOPIC 10 ANALYSIS AND INTERPRETATION OF TEST SCORES
Statistics is a mathematical science pertaining to the analysis, interpretation and

presentation of data. It is applicable to a wide variety of academic disciplines
from physical and social sciences to humanities. Statistics have been widely used
by researchers in education and classroom teachers. In applying statistics in
education, one begins with a population to be studied. Example of this could
be all Form Two learners in Malaysia which is about 450,000 learners or all
secondary school teachers in the country. For practical reasons, rather than
compiling data about an entire population we usually select or draw a subset of
the population called a sample. In other words, the 40 Form Two learners that you
teach is a sample of the population of Form Two learners in the country. The data
you collect about the learners in your class can be subjected to statistical analysis
which serves two related purposes, namely descriptive and inference.
(a) Descriptive Statistics

You use descriptive statistical techniques to describe how your learners
performed. For example, you use descriptive statistics techniques to
summarise data in a useful way either numerically or graphically. The aim is
to present the data collected so that it can be understood by teachers, school
administrators, parents, the community and the Ministry of Education. The
common descriptive techniques used are the mean or average and standard
deviation. Data may also be presented graphically using various kinds of
charts and graphs.
(b) Inferential Statistics

You use inferential statistical techniques when you want to infer about the
population based on your sample. You use inferential statistics when you
want to find out the differences between groups of learners, the relationship
between variables or when you want to make predictions about learner
performance. For example, you want to find out whether the boys did better
than the girls or whether there is a relationship between performance in
coursework and the final examination. The inferential statistics often used
are the t-test, ANOVA and linear regression.
10.2 DESCRIBING TEST SCORES

Let us assume that you have just given a test on Bahasa Malaysia to a class of
35 Form One learners. After marking the scripts, you have a set of scores for each
of the learners in the class and you want to find out more about how your learners
performed. See Figure 10.1 which shows the distribution of the scores obtained by
learners in the test.

Figure 10.1: The distribution of Bahasa Malaysia marks
The „frequency‰ column indicates how many learners scored for each mark shown
and the percentage is shown in the „percentage‰ column. You can describe these
scores using two types of measures, namely central tendency and dispersion.
(a) Central Tendency

The term „central tendency‰ refers to the „middle‰ value and is measured
using the mean, median and mode. It is an indication of the location of the
scores. Each of the three measures is calculated differently. Which one to use
will depend upon the situation and what you want to show.

(i) Mean
The mean is the most commonly used measure of central tendency.
When we talk about an „average‰, we usually refer to the mean. The
mean is simply the sum of all the values (marks) divided by the total
number of items (learners) in the set. The result is referred to as the
arithmetic mean. Using the data from Figure 10.1 and applying the
following formula, you can calculate the mean.
Mean 
 x  35  41  42  75  2148  53.22
N 35 40
(ii) Median
The median is determined by sorting the score obtained from lowest to
highest values and taking the score that is in the middle of the sequence.
For the example in Figure 10.1, the median is 52. There are 17 learners
with scores less than 52 and 17 learners whose scores are greater than
52. If there is an even number of learners, there will not be a single point
at the middle. In this case, you calculate the median by taking the mean
of the two middle points, that is, divide the sum of the two scores by 2.
(iii) Mode
The mode is the most frequently occurring score in the data set. Which
object appears the most often in your data set? In Figure 10.1, the mode
is 57 because 7 learners obtained that score. However, you can also have
more than one mode. If you have two modes it is bimodal.
Distribution of the scores may be graphed to demonstrate visually the

relations among the scores in a group. In such graphs, the horizontal
axis or x axis is the continuum on which the individuals are measured;
the vertical axis or y axis is the frequency (or the number) of individuals
earning any given score shown on the x axis. Figure 10.2 is a histogram
representing the scores for Bahasa Malaysia obtained by a group of 35
learners as indicated earlier in Figure 10.1.

Figure 10.2 shows a graph with the distribution of Bahasa Malaysia scores.
Figure 10.2: Graph showing the distribution of Bahasa Malaysia scores
SELF-CHECK 10.1
What is the difference between mean, median and mode?
(b) Dispersion
Although a mean tells us about the groupÊs average performance, it does not
tell us how close to the average or mean learners scored. For example, did
every learner score 80 per cent in the test or were the scores spread out from
0 to 100 per cent? Dispersion is the distribution of the scores. Among the
measures used to describe spread are range and standard deviation.

(i) Range
The range of scores in a test refers to the lowest and highest scores
obtained in the test. The range is the distance between the extremes of
a distribution.
(ii) Standard Deviation

Standard deviation refers to how much the scores obtained by learners
deviate or defer from the mean. Figure 10.3 is a set of scores obtained
by 10 learners in a Science test.
Figure 10.3: Scores in a Science test obtained by 10 learners
Based on the raw scores, you can calculate the standard deviation using the
formula given in the following.
 x  x 2 153
Standard Deviation   N 1

9
 17
 4.12
(a) The first step in computing the standard deviation is to find the mean which
is 390 divided by 10 = 39.
(b) Next is to subtract the mean from each score in the column labelled x  x .

(c) This is followed by the calculation in the column on the right labelled
( x  x )2 . Note that all numbers in this column are positive. The squared
differences are then summed and the square root calculated.
(c) The standard deviation is 4.12, which is the positive square root of 153
divided by 9.
To better understand what the standard deviation means, refer to Figure 10.4
which shows the spread of scores with the same mean but different standard
deviations.
(a) For Class A, with a standard deviation of 4.12, approximately 68%

(1 standard deviation) of learners scored between 34.88 and 43.12.
(b) For Class B, with a standard deviation of 2, approximately 68% (1 standard

deviation) of learners scored between 37 and 41.
(c) For Class C, with a standard deviation of 1, approximately 68% of learners

scored between 38 and 40.
Figure 10.4: Distribution of scores with varying standard deviations
Note that the smaller the standard deviations, the greater the scores tend to
„bunch‰ around the mean and vice versa. Hence, it is not enough to just examine
the mean alone because the standard deviation tells us a lot about the spread of
the scores around the mean. Which class do you think performed better? The mean
does not tell us which class performed better. Class C performed the best because
approximately two thirds ( 2 3 ) of the learners scored between 38 and 40.

SELF-CHECK 10.2
1. What is the difference between range and standard deviation?
2. What is the difference between a standard deviation of 2 and a

standard deviation of 5?
Skew
Skew refers to the symmetry of a distribution. A distribution is skewed if one of
its tails is longer than the other. Refer to Figure 10.5 which shows the distribution
of the scores obtained by 38 learners on a History test. There is a negative skew
because it has a longer tail in the negative direction. What does it mean? It means
that more learners were getting high scores in the history test which may indicate
that either the test was too easy or the teaching methods and materials were
successful in bringing about the desired learning outcomes.
Figure 10.5: Negative skew

Figure 10.6 illustrates the distribution of the scores obtained by 38 learners in a

Biology test. There is a positive skew because it has a longer tail in the positive
direction. What does it mean? It means that more learners were getting low scores
in the Biology test which indicates that the test was too difficult. Alternatively, it
could also mean that the questions were not clear or that the teaching methods and
materials did not bring about the desired learning outcomes.
Figure 10.6: Positive skew
SELF-CHECK 10.3
A teacher administered an English test to 10 children in her class. The
children earned the following marks: 14, 28, 48, 52, 77, 63, 84, 87, 90 and
98. For the distribution of marks, find the following:
(a) Mean
(b) Median
(c) Range
(d) Standard deviation

10.3 STANDARD SCORES

Having given a test, most teachers report the raw scores obtained by learners. Say
for example, Zulinda, a Form Four learner earned the following scores at the end
of semester examination:
(a) 80 for Science;
(b) 72 for History; and
(c) 40 for English.
With just the raw scores, what can you say about ZulindaÊs performance on these
tests or her standing in the class? Well, actually not very much. Without knowing
how these raw scores compare to the total distribution of raw scores for each
subject, it is difficult to draw any meaningful conclusions regarding her relative
performance in each of these tests.
How do you make these raw scores more meaningful?
(a) Assume that the score of all three tests are approximately normally
distributed.
(b) The mean and standard deviation of the three tests are as follows:
(i) Science: Mean = 90 and Standard deviation = 10
(ii) History: Mean = 60 and Standard deviation = 12
(iii) English: Mean = 40 and Standard deviation = 15
Based on the additional information, what statements can you make regarding
ZulindaÊs relative performance in each of the three tests? The following are some
conclusions you can make:
(a) Zulinda scored the best in the History test and her raw score of 72 falls at a
point one standard deviation above the mean;
(b) Her next best score is English and her raw score of 40 falls exactly at the mean
of the distribution of the scores; and
(c) Finally, even though her raw score for Science was 80, it falls one standard
deviation below the mean.

Raw scores, like Zulinda's scores, can be converted to two types of standard scores
which are the Z score and T score.
(a) Z Score
Converting ZulindaÊs raw scores into „z scores‰, we can say that she
achieved a:
(i) Z score of +1 for History
(ii) Z score of 0 for English
(iii) Z score of ă1 for Science.
What is a z score? How do you compute the z score? A z score is a type of

standard score. The term standard score is the general name for converting a
raw score to another scale using a predetermined mean and a predetermined
standard deviation. Z scores tell how many standard deviations away from
the mean the score is located. Z scores can be positive or negative. A positive
z score indicates that the value is above the mean while a negative z score
indicates that the value is below the mean. A z score is a raw score that has
been transformed or converted to a scale with a predetermined mean of 0
and a predetermined standard deviation of 1. A z score of ă6 means that the
score is 6 standard deviations below the mean.
The formula used for transforming raw scores into z scores involves
subtracting the mean from the raw score and dividing it by the standard
deviation.
x x
z
SD
Let us use this formula to convert KumarÊs marks of 52 obtained in a

Geography test. The mean for the test is 70 and the standard deviation is 7.5.
x x 52  70 18
z    2.4
SD 7.5 7.5
The z score computed for the raw score of 52 is ă2.4 which means that
KumarÊs score for the Geography test is located 2.4 standard deviations
below the mean.

Example: Using the Z Score to Make Decisions

A teacher administered two Bahasa Malaysia tests to learners in Form Four
A, Form Four B and Form Four C. The two top learners in Form Four C are
Seng Huat and Mei Ling. The teacher was planning to give a prize for the
best learner in Bahasa Malaysia in Form Four C but was not sure who is the
better learner.
Test 1 Test 2
Seng Huat 30 50
Mei Ling 45 35
Mean 42 47
Standard Deviation 7 8
The teacher could use the mean to determine who is better. But both learners
have the same mean. How does the teacher decide? By using the z score, the
teacher can know how far from the mean are the scores of the two learners
and thus who performed better. Using the formula above, the teacher
computes the z score shown in the following:
Test 1 Test 2 Total

30  42 50  47
Seng Huat  1.71  0.375 1.34
7 8
45  42 35  72
Mei Ling  0.43  1.50 1.07
7 8
Upon examination of the information in the table, the teacher finds that both
Seng Huat and Mei Ling have negative z scores for the total of both tests.
However, Mei Ling has a higher total z score (ă1.07) compared to Seng HuatÊs
total z score (ă1.34). In other words, Mei LingÊs total score was closer to the
mean and therefore the teacher concludes that Mei Ling did better than Seng
Huat.
Z scores are relatively simple to use but many educators are reluctant to use
it especially when test scores are reported as negative numbers. How would
you like to have your Mathematics score reported as ă4? For these reasons,
alternative standard score methods are used such as the T score.

(b) T Score
The T score was developed by W. McCall in the 1920s and is one of the many
standard scores currently being used. T scores are widely used in psychology
and education especially when reporting performance in standardised tests.
The T score is a standardised score with a mean of 50 and a standard
deviation of 10. The formula for computing the T score is:
T = 10(z) + 50
Say for example, a learner has a z score of ă1.0 and to convert it to T score:
T  10(z)  50  10(1.0)  50  (10)  50

 40
When converting z scores to T scores, you should be careful not to drop the
negatives. Dropping the negatives will result in a completely different score.
SELF-CHECK 10.4
Convert the following z scores to T scores:
z score T score
+1.0
ă2.4
+1.8
Why would you use T scores rather than z scores when reporting the
performance of students in the classroom?

10.4 THE NORMAL CURVE

The normal curve (also called the „bell curve‰) is a hypothetical curve that is
supposed to represent all natural occurring phenomena. In a normal distribution,
the mean, median and mode have the same value. It is assumed that if we were to
sample a particular characteristic such as the height of Malaysian men, you will
find the average height to be 5 feet 4 inches or 163 cm. However, there will be some
men who will be relatively shorter than the average height and an equal number
who will be relatively taller. By plotting the heights of all Malaysian men according
to frequency of occurrence, you can expect to obtain something similar to a normal
distribution curve.
Figure 10.7 shows a normal distribution curve for IQ based on the Wechsler
Intelligence Scale for Children. In a normal distribution, about two-thirds (⅔) of
individuals will have an IQ of between 85 and 115 with a mean of 100. According
to the American Association of Mental Retardation (2006), individuals who have
an IQ of less than 70 may be classified as mentally retarded or mentally challenged
and those who have an IQ of more than 130 may be considered as gifted.
Figure 10.7: A normal distribution curve of IQ based on the Wechsler Intelligence Scale
for Children
Similarly, test scores that measure a particular characteristic such as language

proficiency, quantitative ability or scientific literacy of a specific population can be
expected to produce a normal curve. The normal curve is divided according to
standard deviations (such as ă4s, ă3s ⁄⁄ +3s and 4s) which are shown on the
horizontal axis. The area of the curve between standard deviations is indicated

as a percentage on the diagram. For example, the area between the mean and
standard deviation +1 is 34.13%. Similarly, the area between the mean and
standard deviation ă1 is also 34.13%. Hence, the area between standard deviation
ă1 and standard deviation +1 is 68.26%. It means that in a normal distribution,
68.26% of individuals will score between standard deviations ă1 and +1.
In using the normal curve, it is important to make a distinction between standard

deviation values and standard deviation scores. A standard deviation value is a
constant and is shown on the horizontal axis in Figure 10.7. On the other hand, the
standard deviation score is the obtained score when we use the standard deviation
formula (which we discussed earlier). For example, if we obtained a standard
deviation equal to 5, then the score for 1 standard deviation is 5 and the score for
2 standard deviations is 10, the score for 3 standard deviations is 15 and so forth.
Standard deviation values of ă1, ă2, and ă3 will have corresponding negative
scores of ă5, ă10 and ă15.
Note that in Figure 10.7, z scores are indicated from + 1 to + 4 and ă1 to ă4 with the
mean as 0. Each interval is equal to one standard deviation. Similarly, T scores are
reported from 10 to 90 (interval of 10) with the mean set at 50. Each interval of 10
is equal to one standard deviation.
Ć The term „central tendency‰ refers to the „middle‰ value and is measured
using the mean, median and mode. It is an indication of the location of the
scores.
Ć The mean is simply the sum of all the values (marks) divided by the total
number of items (learners) in the set.
 The median is determined by sorting the score obtained from the lowest to
highest values and taking the score that is in the middle of the sequence.
 The mode is the most frequently occurring score in the data set.
Ć The range of scores in a test is the distance between the lowest score and the
highest score obtained in the test.
Ć Standard deviation refers to how much the scores obtained by learners deviate
or defer from the mean.
Ć Skew refers to the symmetry of a distribution. A distribution is skewed if one

of its tails is longer than the other.

Ć A negative skew has a longer tail in the negative direction.
Ć A positive skew has a longer tail in the positive direction.
Ć The standard score refers to raw score that has been converted from one scale
to another scale using the mean and standard deviation.
Ć Z scores indicate how many standard deviations away from the mean the score
is located.
Ć The T score is a standardised score with a mean of 50 and a standard deviation

of 10.
Ć The normal curve (also called the „bell curve‰) is a hypothetical curve that is
supposed to represent all natural occurring phenomena.
Central tendency Positive skew

Dispersion Range
Mean Standard deviation
Median Standard score
Mode T scores
Negative skew z scores
Normal curve

 ANSWERS 215
References
Advantages and disadvantages of rubrics. Retrieved from
https://engage.intel.com/ thread/11468
ASCD (2017). How to create and use rubrics for formative assessment. Retrieved
from http://www.ascd.org/publications/books/112001/chapters/What-
Are-Rubrics-and-Why-Are-They-Important%C2%A2.aspx
Assessing project-based learning. Retrieved from

http://pblmm.k12.ca.us/PBLGuide/AssessPBL.html
Authentic assessment in the classroom. Retrieved from

https://www.teachervision.com/teaching-methods-and-management/
educational-testing/4911.html
Blank, W., & Harwell, S. (1997). Connecting high school with the real world.
ERIC Document Reproduction Service No. ED407586 blank.
British Columbia Institute of Technology (2010). Developing checklists and

rating scales. Retrieved from
http://www.northernc.on.ca/leid/docs/ ja_developchecklists.pdf
Brookhart, S. M. (1999). The art and science of classroom assessment: The missing
part of pedagogy. ASHE-ERIC Higher Education Report (Vol. 27, No.1).
Washington, DC: The George Washington University, Graduate. School of
Education and Human Development.
Cashin, W. E. (2014). Improving essay tests. USA: Kansas State University.

Retrieved from http://www.idea.ksu.edu/papers/Idea_Paper_17.pdf
Chard, S. C. (1992). The project approach: A practical guide for teachers.

Edmonton, Alberta: University of Alberta Printing Services.
Childs, R. A. (1989). Constructing achievement tests. ERIC Digest. (ED315426ERIC

Clearinghouse on Tests Measurement and Evaluation Washington DC.,
American Institutes for Research Washington DC. Retrieved from
http://www.ericdigests.org/pre-9213/classroom.htm
Chong, H. Y. Reliability and validity. Retrieved from

http://seamonkey.ed.asu.edu/~alex/teaching/assessment/reliability.html

216  REFERENCES
Connie Malamed. (2016). Writing multiple choice questions for higher order
thinking. Retrieved from
http://theelearningcoach.com/elearning_design/multiple-choice-
questions/
Concordia University. Advice on using authentic assessment in teaching.

Retrieved from
http://education.cu-portland.edu/blog/curriculum-instruction/tips-on-
how-to-use-authentic-assessment-as-a-teaching-strategy/
Frey, B. B. (2003). Matching questions. Retrieved from

http://www.specialconnections.ku.edu/cgi-
bin/cgiwrap/specconn/main.php?cat=assessment&section=main&subsect
ion=qualitytest/match
Frey, B. B. (2003). Multiple-choice questions. Retrieved from

ion=qualitytest/multiple
Frey, B. B. (2003). Table of specifications. Retrieved from

ion=qualitytest/table
Flinders University. (2017). What is a rubric? Retrieved from

https://www.flinders.edu.au/teaching/teaching-
strategies/assessment/grading/rubric.cfm
Gibson, I. (2005) Designing projects for learning. Retrieved from

http://www.aishe.org/readings/2005-2/chapter3.pdf
Gramm, B. F. (1981). Constructing the essay exam. Retrieved from

http://www.studygs.net/tsttak4a.htm
Gramm, B. F. (1981). Rules for constructing essay questions. NY: Holt, Rinehart &
Winstoo. Retrieved from
http://www.personal.psu.edu/faculty/s/r/sra113/602/essayexams.htm
Griffith, K., & Nguyen, A. (2005). Are educators prepared to affect the affective
domain? Retrieved from
www.nationalforum.com/Electronic Journal Volumes/Griffith, Kimberly G

REFERENCES  217
Huitt, W. (2004). Bloom et al.Ês taxonomy of the cognitive domain. Educational

psychology interactive. Valdosta, GA: Valdosta State University. Retrieved
from http://chiron.valdosta.edu/whuitt/col/cogsys/bloom.html
Item analysis for multiple-choice items. ERIC Digest. Retrieved from

http://www.eric.ed.gov/ERICDocs/data/ericdocs2/content_storage_01/0
000000b/80/2a/26/4c.pdf
Item analysis. Special Connections. Retrieved from

ion=qualitytest/item
Kemp, J., & Teperof, D. (n.d.) Guidelines for portfolio assessment. Retrieved from
http://www.anglit.net/main/portfolio/default.html
Kohn, A. (2006). The trouble with rubrics. English Journal March 2006. Vol 95, no 4.
Kubiszyn, T., & Borich, G. (2003). Educational testing and measurement:

Classroom Application and practice. UAS: John Wiley & Sons
L. Brady (2012). Assessment and Reporting: Celebrating Student Achievement

(4th ed.). Pearson Australia.
Landsberger, J. F. (1996). Constructing True-False Items. Retrieved from

http://www.studygs.net/tsttak2a.htm
Larapinta Primary School. (2008). Thinking toolbox: Affective domain

(Krathwohl's Taxonomy). Retrieved from http://www.schools.nt.edu.au/
larapsch/affective.htm
Lewis, T. (1904). General intelligence: Objectively determined and measured.

American Journal of Psychology. 15. 201-293. Retrieved from
http://psychclassics.yorku.ca/Spearman
Lewis, T. (1916). The Uses of Intelligence Tests. First published in The

measurement of intelligence (chapter 1). Boston: Houghton Mifflin.
Retrieved from http://psychclassics.yorku.ca/Terman/terman1.htm
Lin, R. L., & Gronlund, N. E. (2000). Measurement and assessment in teaching.

New Jersey: Prentice Hall.

218  REFERENCES
Linquist, E. F. (1951). Preliminary Considerations in Objective Test Construction.

In E. F. Linquist (Ed). Educational Measurement (pp 4ă22). Washington
DC. American Council on Education.P
Long, C. (2006). Realising the potential of school-based assessment. Retrieved

from
http://www.iaea2006.seab.gov.sg/conference/download/papers/Realisin
g%20the%20potential%20of%20school%20based%20assessment.pdf
Mathematics (2008). Assessment in Mathematics. Retrieved from

http://www.learnalberta.ca/content/mewa/html/assessment/checklists.
html
Mean, median and mode discussion. Retrieved from

http://www.shodor.org/interactivate/discussions/sd1.html
Mueller, J. (2016). Authentic assessment toolbox. Retrieved from

http://jfmueller.faculty.noctrl.edu/toolbox/howstep1.htm
Pennekamp, M. (n.d.) A guide to classroom behaviour and instruction. Retrieved

from http://www.humboldt.edu/~tha1/observ.html
Pennsylvania State University (2009). Affective domain taxonomy. Retrieved from

http://www.personal.psu.edu/staff/b/x/bxb11/Objectives/affective.htm
Performing item analysis. Retrieved from

http://www.alexandercsd.org/testing/performing_item_analysis.htm
Phillips, J. A., Ansary Ahmed, & Kuldip Kaur. (2005). Instructional design
principles in the development of an e-learning graduate course. Paper
presented at The International Conference in E-Learning. Bangkok,
Thailand
Practical work in science. Retrieved from

http://www.cie.org.uk/CIE/WebSite/UCLESData/Documents/IGCSE/
Sample%20Papers/Maths,%20Science%20&%20Technical/Science.pdf
Quick facts about the normal curve. Retrieved from

http://www.willamette.edu/~mjaneba/help/normalcurve.html
Randall, V. (2004). The psychomotor domain. Retrieved from

http://academic.udayton.edu/gender/01Unit/lesson01c.htm

REFERENCES  219
Reliability and validity of classroom tests. Retrieved from

http://www.ncrel.org/sdrs/areas/issues/methods/assment/as5relia.htm
Reliability and validity. Retrieved from

http://www.georgetown.edu/departments/psychology/researchmethods
/researchanddesign/validityandreliability.htm
Reynolds, C. R., Livingston, R. B., & Willson, V. (2006). Measurement and

assessment in education. USA: Pearson Education.
ScorePak. Item analysis. Retrieved from

http://www.washington.edu/oea/pdfs/resources/item_analysis.pdf
Skew. Retrieved from http://davidmlane.com/hyperstat/A11284.html
Stanley, G., & Hopkins, D. (1972). Introduction to educational measurement and

testing. Boston: Macmillan.
Svinicki, M. (2002). Test construction: some practical ideas. Retrieved from

http://www.utexas.edu/academic/cte/sourcebook/tests.pdf
T-score. Retrieved from

http://www.sadovsky.wustl.edu/Tscore/Introduction.html
University of Mississippi. (2009). Psychomotor domain. Retrieved from

http://www.olemiss.edu/depts/educ_school2/docs/stai_manual/manual
10.htmUniversity of North Texas Health Science Center UNTHSC Scholarly
Repository (2012). Designing Multiple Choice Questions to Measure Higher
Order Thinking. Retrieved from
http://digitalcommons.hsc.unt.edu/cgi/viewcontent.cgi?article=1009&con
text=test_items
University of Washington (2017). Constructing tests. Retrieved from

http://www.washington.edu/teaching/constructing-tests/
University of Wisconsin. Exam questions types and student competencies.

Retrieved from
http://wiscweb3.wisc.edu/teachingacademy/Assistance/course/mult.htm
Validity. Retrieved from

http://www.ihs.ox.ac.uk/sepho/publications/carrhill/vii/7-4-2.htm
Vander Ark, T. (2013). What is performance assessment? Retrieved from

http://gettingsmart.com/2013/12/performance-assessment/

220  REFERENCES
Whitehouse, M. (2013). The importance of assessment in curriculum planning.

Retrieved from
http://www.workingoutwhatworks.com/en-
GB/Magazine/2015/2/Assessment_and_curriculum_planning
Wiliam, D. (2011). Embedded formative assessment. Bloomington, IN: Solution

Tree Press.
Wiggins, G., & McTighe, J. (2005). Understanding by design. New Jersey: Pearson
Education.
Wiggins, Grant. (1990). The case for authentic assessment. Practical Assessment,
Research & Evaluation, 2(2). Retrieved July 14, 2016 from
http://PAREonline.net/getvn.asp?v=2&n=2 .
Z score explained. Retrieved from

http://www.sysurvey.com/tips/statistics/
zscore.htm

MODULE FEEDBACK
MAKLUM BALAS MODUL
If you have any comment or feedback, you are welcome to:
1. E-mail your comment or feedback to modulefeedback@oum.edu.my
OR
2. Fill in the Print Module online evaluation form available on myINSPIRE.
Thank you.
Centre for Instructional Design and Technology

(Pusat Reka Bentuk Pengajaran dan Teknologi )
Tel No.: 03-27732578
Fax No.: 03-26978702


HPGD2303 Educational Assessment - Caug17 (Bookmark) PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HPGD2303 Educational Assessment - Caug17 (Bookmark) PDF

Uploaded by

Copyright:

Available Formats

HPGD2303

Copyright © Open University Malaysia (OUM)

Copyright © Open University Malaysia (OUM)

Module Writer: Prof Dr John Arul Phillips

Enhancer: Mr Yap Yee Khiong

Moderator: Prof Dr Kuldip Kaur

Adapted by: Assoc Prof Dr Chung Han Tek

Developed by: Centre for Instructional Design and Technology

First Edition, April 2011

Copyright © Open University Malaysia (OUM)

Course Assignment Guide xxiăxxv

Topic 1 The Role of Assessment in Teaching and Learning 1

Topic 2 Foundation for Assessment: What to Assess 17

Topic 3 Planning the Classroom Test 41

Copyright © Open University Malaysia (OUM)

Topic 4 Constructing Objective Test Items 55

Topic 5 Constructing Essay Questions 81

Topic 6 Authentic Assessments 107

Copyright © Open University Malaysia (OUM)

6.4 Assessment Tools 118

Topic 7 Project and Portfolio Assessments 125

Topic 8 Test Reliability and Validity 157

Copyright © Open University Malaysia (OUM)

Topic 9 Appraising Classroom Tests and Item Analysis 177

Topic 10 Analysis and Interpretation of Test Scores 198

Copyright © Open University Malaysia (OUM)

Copyright © Open University Malaysia (OUM)

What will You Gain from this Course? x

How can You Get the Most Out of this Course? xi

How will You Be Assessed? xiv

What Support will You Get in Studying this Course? xv

How should You Study for this Course? xvi

Copyright © Open University Malaysia (OUM)

WELCOME TO HPGD2303 EDUCATIONAL

WHAT WILL YOU GAIN FROM THIS COURSE?

Aim of the Course

Copyright © Open University Malaysia (OUM)

1. Identify the different principles and theories of educational testing and

2. Compare the different procedures of educational testing and assessment;

3. Apply the different principles and theories in the development of

4. Critically evaluate the quality of different assessment procedures.

HOW CAN YOU GET THE MOST OUT OF THIS

1. The Course Guide you are currently reading;

2. The Course Content (consisting of 10 topics);

3. The Course Assessment Guide (which describes the assignments to be

4. The Course online platform i.e. MyINSPIRE Online Discussion.

Copyright © Open University Malaysia (OUM)

Topic Title Week

Topic 1 discusses the differences between testing, measurement, evaluation and

Topic 2 discusses the behaviours to be tested focussing on cognitive, affective

Copyright © Open University Malaysia (OUM)

Topic 6 introduces a form of assessment in which learners are assigned to

Topic 7 discusses in detail two examples of authentic assessments, namely

Text Arrangement Guide

Copyright © Open University Malaysia (OUM)

Self-Check: This component of the module is inserted at strategic locations

Activity: Like Self-Check, the Activity component is also placed at various

References: The References section is where a list of relevant and useful

HOW WILL YOU BE ASSESSED?

Copyright © Open University Malaysia (OUM)

WHAT SUPPORT WILL YOU GET IN STUDYING

MyINSPIRE Online Discussion

Ć You have difficulty with the self-tests and activities; or