Professional Documents
Culture Documents
HPGD2303 Educational Assessment - Caug17 (Bookmark) PDF
HPGD2303 Educational Assessment - Caug17 (Bookmark) PDF
Educational Assessment
References 215
Course Outcomes
By the end of this course, you should be able to:
Please ensure that you have all of these materials and the correct url.
Course Synopsis
To enable you to achieve the four course outcomes, the course content has been
divided into ten topics. Specific learning outcomes are stated at the start of each
topic indicating what you should be able to achieve after completing each topic.
Topic 3 provides some useful guidelines to help teachers plan valid, reliable and
useful assessments. The discussion includes determining what is to be measured
and minimising measurement irrelevancies. The topic will also guide teachers to
devise strategies to measure the domain well. Example of the Table of
Specifications, a 2-way table is presented.
Topic 4 discusses the design and development of objective tests in the assessment
of various kinds of behaviours with emphasis on the limitations and advantages
of using this type of assessment tool.
Topic 5 examines the role of essay tests in assessing various kinds of learning
outcomes as well as its limitations and strengths, and the procedures involved in
the design of good essay questions.
Topic 8 focuses on basic concepts of test reliability and validity. The topic also
includes methods to estimate the reliability of a test and factors to increase
reliability and validity of a test.
Topic 9 examines the concept of item analysis and the different procedures for
establishing the effectiveness of objective and essay-type tests focussing on item
difficulty and item discrimination. The topic concludes with a brief explanation
of item bank.
Topic 10 focuses on the analysis and interpretation of the data collected by tests.
For quantitative analysis of data, various statistical procedures are used. Some of
the statistical procedures used in the interpretation and analysis of assessment
results are measures of central tendency and correlation coefficients.
Learning Outcomes: This section refers to what you should achieve after you
have completely covered a topic. As you go through each topic, you should
frequently refer to these learning outcomes. By doing this, you can continuously
gauge your understanding of the topic.
Summary: You will find this component at the end of each topic. This component
helps you to recap the whole topic. By going through the summary, you should
be able to gauge your knowledge retention level. Should you find points in the
summary that you do not fully understand, it would be a good idea for you to
revisit the details in the module.
Key Terms: This component can be found at the end of each topic. You should go
through this component to remind yourself of important terms or jargon used
throughout the module. Should you find terms here that you are not able to
explain, you should look for the terms in the module.
Facilitator
Your facilitator will mark your assignments and assist you during the course. Do
not hesitate to discuss during the tutorial sessions or online if:
Ć You do not understand any part of the course content or the assigned
readings;
Library Resources
The Digital Library has a large collection of books and journals which you can
access using your learner ID.
(a) The most important step is to read the contents of this Course Guide
thoroughly.
(b) Organise a study schedule. Note the time you are expected to spend
on each topic and the date for submission of assignments as well as
seminar and examination dates. These are stated in your Course
Assessment Guide. Note down all this information in one place such
as your diary or a wall calendar. Jot down your own dates for
working on each topic. You have some flexibility as there are 10 topics
spread over a period of 14 weeks.
(c) Once you have created your own study schedule, make every effort to
„stick to it‰. The main reason learners are unable to cope is because
they lag behind in their coursework.
Ć Read the introduction (to see how it connects with the previous
topic).
Ć Work out the activities stated (to see if you can apply the concepts
learned to real-world situations).
(f) When you have completed the topic, review the learning outcomes to
confirm that you have achieved them and are able to do what is
required.
(g) If you are confident, you can proceed to the next topic. Proceed topic
by topic through the course and try to pace your study so that you
keep to your planned schedule.
(h) After completing all topics, review the course and prepare yourself for
the final examination. Check that you have achieved all the topicsÊ
learning outcomes and the course objectives (listed in this Course
Guide).
FINAL REMARKS
Once again, welcome to the course. To maximise your gain from this course you
should try at all times to relate what you are studying to the real world. Look at
the environment in your organisation and ask yourself whether the ideas
discussed apply. Most of the ideas, concepts and principles you learn in this
course have practical applications. It is important to realise that much of what we
do in education and training has to be based on sound theoretical foundations.
The contents of this course merely address the basic principles and concepts of
assessment in education. You are advised to go beyond the course and continue
with lots of self-study to further enhance your knowledge on educational
assessment.
We wish you success with the course and hope that you will find it interesting,
useful and relevant in your development as a professional. We hope you will
enjoy your experience with OUM and we would like to end with a saying by
Confucius ă „Education without thinking is labour lost‰.
(c) Referencing
Journal Articles
Online Journal
Webpage
Book
Article in Book
Printed Newspaper
INTRODUCTION
This guide explains the basis in which you will be assessed in this course during
the semester. It contains details of the facilitator-marked assignments, final
examination and participation required for the course.
One element in the assessment strategy of the course is that all learners and
facilitators should be provided with the same information about how the learners
will be assessed. Therefore, this guide also contains the marking criteria that
facilitators will use in assessing your work.
Please read through the whole guide. It should be read at the start of the course.
ACADEMIC WRITING
(a) Plagiarism
(i) What Is Plagiarism?
Any written assignment (essays, project, take-home tests and others)
submitted by a learner must not be deceptive with regard to the
abilities, knowledge or amount of work contributed by the learner.
There are many ways that this rule can be violated. Among them are:
(c) Referencing
All sources that you cite in your paper should be listed in the Reference
section at the end of your paper. Here is how you should list your
references:
You stated your arguments clearly with supporting evidences and proper
referencing of sources; and
INTRODUCTION
The topic discusses the difference between tests, measurement, evaluation and
assessment, the roles of assessment in teaching and learning, and some general
principles of assessment. Also explored is the difference between formative and
summative assessments as well as the difference between criterion and norm-
referenced tests. The topic concludes with a brief discussion on the current trends
in assessment.
Figure 1.1: Can you differentiate between tests, measurement, and evaluation and
assessment?
(a) Tests
Most people are familiar with tests because all of us, at some point in our
lives, have taken some form of tests. In school, tests are given to measure
our academic aptitude and indirectly to evaluate whether we have gained
from the teaching by the teacher. At the workplace, tests are conducted
to select suitable persons for specific jobs, tests are used as the basis for
job promotions and tests are used to encourage re-learning. Physicians,
While most people know what a test is, many have difficulty differentiating
between measurement, evaluation and assessment. Some have even argued
that they are similar!
(b) Measurement
Measurement is the act of assigning numbers to a phenomenon. In
education, it is the process by which the attributes of a person are measured
and assigned numbers. Remember it is a process, indicating there are
certain steps involved!
useful depending on the accuracy of the instruments we use and our skill at
using them. For example, we measure temperature using a thermometer,
and the thermometer is the instrument used.
For example, some authors used the term „formative evaluation‰ while
others use the term „formative assessment‰. We will use the two terms
interchangeably because there is too much overlap in the interpretations
of the two concepts. Generally, assessment is viewed as the process
of collecting information with the purpose of making decisions about
students.
Answer : Whether they can remember what I taught them and able to solve
problems.
Answer : Well, I provide students the right answers and point out the
mistakes made when answering the questions.
The above could be the reasons educators give when asked about the purpose of
assessment. In the context of education, assessment is performed to gain an
understanding of an individualÊs strengths and weaknesses in order to make
appropriate educational decisions. The best educational decisions are based on
information and better decisions are usually based on more information (Salvia &
Ysseldyke, 1995). Based on the reasons for assessment provided by Harlen (1978)
and Deale (1975), two main reasons are identified (see Figure 1.2).
(i) Diagnosis
Diagnostic evaluation or assessment is performed at the beginning of
a lesson or unit for a particular subject area to assess studentsÊ
readiness and background for what is about to be taught. This pre-
instructional assessment is done when you need information on a
particular student, group of students or a whole class before you can
proceed with the most effective instructional method. For example,
you could administer a Reading Test to Year One students to assess
their reading level. Based on the information, you may want to assign
weak readers for special intervention or remedial action. On the other
hand, the test might reveal that some students are reading at an
exceptionally high level and you might want to recommend that they
be assigned to an enrichment programme (refer to Table 1.1).
Types of Decisions
To Help
Questions to Be Answered
Learning:
Diagnosis for Should the student be sent for remedial classes so that
remedial action difficulty in learning can be overcome?
Diagnosis for Should the student be provided with enrichment
enrichment activities?
Exceptionality Does the student have special learning needs that
require special education assistance?
Selection Should the student be streamed into X or Y class?
Progress To what extent is the student making progress toward
specific instructional goals?
Communication How is the child doing in school and how can parents
to parents help?
Certification What are the strengths and weaknesses in the overall
performance of a student in specific areas assessed?
Administration How is the school performing in comparison to other
and counselling schools?
Why should students be referred to counselling?
(ii) Exceptionality
Assessment is also conducted to make decisions on exceptionality.
Based on information obtained from the assessment, teachers may
make decisions as to whether a particular student needs to be
assigned to a class with exceptional students. Exceptional students are
students who are physically, mentally, emotionally or behaviourally
different from the normal population. For example, based on the
assessment information, a child who is found to be dyslexic may be
assigned for special treatment or a student who has been diagnosed to
be learning disabled may be assigned for special education.
(iii) Certification
Certification is perhaps the most important reason for assessment. For
example, the Sijil Pelajaran Malaysia (SPM) is an examination aimed
at providing students with a certificate. The scores obtained are
converted into letter grades signifying performance level in various
subject areas and used as a basis for comparison between students.
The certificate obtained is further used for selecting students for
further studies, scholarships or jobs.
Copyright © Open University Malaysia (OUM)
8 TOPIC 1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING
(iv) Placement
Besides certification, assessment is also conducted for the purpose of
placement. Students are endowed with varying abilities and one of
the tasks of the school is to place them according to their aptitude and
interests. For example, performance in the Pentaksiran Tingkatan 3
(PT3) (previously Penilaian Menengah Rendah) is used as the basis for
placing students in the arts or science stream. Assessment is also used
to stream students according to academic performance. It has been the
tradition that the „A‰ and „B‰ classes consist of high achievers based
on their results in the end of semester or end of year examinations.
Placement tests have even been used in preschools to stream children
according their literacy levels! The practice of placing students
according to academic achievement has been debated for decades
with some educationists arguing against it while others supporting its
merits.
ACTIVITY 1.1
Assessment data may also provide insights into why some teachers are
more successful in teaching a particular group of students while others are
less successful.
ACTIVITY 1.2
To what extent have you used assessment data to review your teaching-
learning strategies? Discuss this with your course mates.
For example, if a teacher observes that some students still have not grasp
a concept, he may design a review activity or use a different instructional
strategy. Likewise, students can monitor their progress with periodic
quizzes and performance tasks. The results of formative assessments are
used to modify and validate instruction. In short, formative assessments are
ongoing and include reviews and observations of what is happening in the
classroom.
Formative assessments are generally low stakes, which means that they
have minimal or no point value.
„When the cook tastes the soup, thatÊs formative evaluation; when the
guests taste the soup, thatÊs summative evaluation‰
ă Robert Stakes
Summative assessments are often high stakes, which means that they have
a high point value.
However, when the information is used formatively, the tests results can provide
an important source of detailed, individualised feedback identifying where each
student needs to deepen their understanding and improve their recall of the
knowledge they have learned.
The more teachers know about individual students as they engage in the learning
process, the better teachers can adjust instruction to ensure that all students
continue to move forward in their learning.
(a) Summative data reveals how the students performed at the end of a
learning programme, namely advanced, proficient, basic or below basic.
For example, if a student has scored below basic in the semester exam and
exhibits signs of a struggling student, the teacher may want to place the
student at the front of the class so that the teacher can easily access the
student when the student needs extra support;
The data that is collected using a summative assessment can help teachers and
schools make decisions based on the instruction that has already been completed.
This contrasts with formative assessment, whereby formative assessment can
help teachers and students during the instruction process. It is important to
understand the difference between the two, as both assessments can play an
important role in education.
SELF-CHECK 1.1
(d) Subject areas and courses state more explicitly about the expectations in
assessment, more specifically the kinds of performance required from
students when they are assessed. This is unlike earlier practices where
assessment is so secretive and students had to figure out for themselves
what was required of them;
(e) An understanding of the process is now seen as, at least, equally important
to the knowledge of facts. This is in line with the general shift from
product-based assessment towards process-based assessment; and
Easing up on Exams
Putrajaya: Reducing the number of Among the measures proposed are:
examination subjects and having a
semester system are among the Ć Reducing the number of subjects
major changes being planned to in public examinations;
make the education system more Ć Emphasising skills and abilities
holistic and less focussed on rather than focusing on content
academic achievement. and achievement;
Education Minister, Datuk Seri Ć Encouraging personal
Hishamumuddin Tun Hussein said development through subjects
that these measures were in line like Art and Physical Education;
with the GovernmentÊs aim to and
reform the countryÊs education
system. „We do not intend to abolish Ć Improving teaching-learning
public or school-level examinations methods by encouraging more
totally, but we recognise that the project-based assignments.
present assessment system needs to
be looked at‰, he said. He said that emphasis should be on
individual accomplishments rather
than the schoolÊs performance in
public examinations and also
highlighting the individualÊs co-
curricular achievements.
ACTIVITY 1.3
The major reason for norm-referenced tests is to classify students. These tests
are designed to highlight achievement differences between and among
students to produce dependable rank order of students.
Assessment Measurement
Criterion-referenced test Norm-referenced test
Evaluation Summative assessment
Formative assessment Test
INTRODUCTION
If you were to ask a teacher, what should be assessed in the classroom, the
immediate response would be, of course, the facts and concepts taught. They are
the facts and concepts found in science, history, geography, language, arts,
religious education and other similar subjects. However, the Malaysian
Philosophy of Education states that education should aim towards the holistic
development of the individual. Hence, it is only logical that the assessment system
should also seek to assess more than the acquisition of the facts and concepts of a
subject area. What about assessment of physical and motor abilities? What about
socio-emotional behaviours such as attitudes, interests, personality and so forth?
Do they not contribute to the holistic person?
In this topic, you will learn the types of learning outcomes that need to be assessed
in a curriculum. The topic will conclude with a brief explanation on how to plan a
table of specification for a classroom test.
Copyright © Open University Malaysia (OUM)
TOPIC 2 FOUNDATION FOR ASSESSMENT: WHAT TO ASSESS 21
(e) Whether students are equipped with the abilities and attitudes that will
enable them „to contribute to the harmony and betterment of the family,
society and the nation at large‰.
Copyright © Open University Malaysia (OUM)
22 TOPIC 2 FOUNDATION FOR ASSESSMENT: WHAT TO ASSESS
To assess the three domains, one has to identify and isolate the behaviour that
represents these domains. When we assess we evaluate some aspects of the
learnerÊs behaviour, for example, his ability to compare, explain, analyse, solve,
draw, pronounce, feel, reflect and so forth. The term „behaviour‰ is used broadly
to include the learnerÊs ability to think (cognitive), feel (affective) and perform a
skill (psychomotor). For example, you have just taught about „The Rainforest of
Malaysia‰ and you would like to assess your students in their:
(a) Thinking ă You might ask them to list the characteristics of the Malaysian
rainforest and compare it with the coniferous forest of Canada;
(b) Feelings (emotions, attitudes) ă You could ask them to design an exhibition
on how students could contribute towards conserving the rainforest; and
(c) Skill ă You could ask them to prepare satellite maps about the changing
Malaysian rainforest by accessing websites from the Internet.
ACTIVITY 2.1
When we assess we do not assess the learnerÊs store of the facts, concepts or
principles of a subject but rather what the learner is able to do with the facts,
concepts or principles of a subject area. For example, we evaluate the learnerÊs
ability to compare facts, explain the concept, analyse a generalisation (or
statement) or solve a problem based on a given principle. In other words, we assess
the understanding or mastery of a body of knowledge based upon what the learner
is able to do with the contents of the subject. Let us look at two mechanisms used
to measure or assess cognitive learning, namely BloomÊs Taxonomy and The
Helpful Hundred.
Copyright © Open University Malaysia (OUM)
TOPIC 2 FOUNDATION FOR ASSESSMENT: WHAT TO ASSESS 25
rewrite NewtonÊs three laws of motion, explain in oneÊs own words the
steps for performing a complex task and translate an equation into a
computer spreadsheet.
In 2001, Krathwohl and Anderson modified the original BloomÊs Taxonomy (1956).
They identified and isolated the following list of behaviours that an assessment
system should address (see Table 2.2).
Note that the sequencing of some of the levels has been rearranged and renamed.
The first two original levels of „knowledge‰ and „comprehension‰ were replaced
with „remembering‰ and „understanding‰ respectively. The „synthesis‰ level was
renamed with the term „creating‰. Note that in the original taxonomy the sequence
was „synthesis‰ followed by „evaluate‰. In the modified taxonomy, the sequence
was rearranged to „evaluating‰ followed by „creating‰.
As you can see, the primary differences between the original and the revised
taxonomy are not in the listings or rewordings from nouns to verbs, or in the
renaming of some of the components, or even in the re-positioning of the last two
categories. The major differences lie in the more useful and comprehensive
additions of how the taxonomy intersects and acts upon different types and levels
of knowledge ă factual, conceptual, procedural and metacognitive.
SELF-CHECK 2.1
(a) The belief that the development of appropriate feelings is the task of the
family and religion;
(b) The belief that appropriate feelings develop automatically from knowledge
and experience with content and do not require any special pedagogical
attention; and
However, affective goals are no more intangible than cognitive ones. Some have
claimed that affective behaviours can be developed automatically when specific
knowledge are taught while others argue that affective behaviours have to be
explicitly developed in schools (see Figure 2.4). Affective goals do not necessarily
take longer to achieve in the classroom than cognitive goals. All that is required is
to state a goal more concretely and behaviourally-oriented so that it can be
assessed and monitored.
There is also the belief that affective characteristics are private and should not be
made public. While people value their privacy, the public also has the right to
information. If the information gathered is needed to make a decision, then the
gathering of such information is not generally considered an invasion of privacy.
For example, if an assessment is used to determine whether a learner needs further
attention such as special education, then gathering such information is not an
invasion of privacy. On the other hand, if the information being sought-after is not
relevant to the stated purpose, then gathering of such information is likely to be
an invasion of privacy.
Similarly, information about affective characteristics can be used for good or bad.
For example, if a mathematics teacher discovers a learner has a negative attitude
towards mathematics and ridicules that learner in front of the class, then the
information has been misused. But if the teacher uses the information to change
his instructional methods so as to help the learner develop a more positive attitude
towards mathematics, then the information has been used wisely. Krathwohl,
Bloom and Bertram and their colleagues developed the affective domain in 1973
which deals with things emotionally such as feelings, values, appreciation,
enthusiasm, motivation and attitudes. The five major categories which listed the
simplest behaviour to the most complex behaviour are receiving, responding,
valuing, organisation and characterisation (see Figure 2.5).
Figure 2.5: Krathwohl, Bloom & BertramÊs taxonomy of affective learning outcomes
(a) Receiving
The behaviours at the receiving level require the learner to be aware of,
willing to hear and focus his or her attention. Examples of verbs describing
behaviours at the receiving level are ask, choose, describe, follow, give, hold,
locate, name, point to, reply and so forth.
(b) Responding
The behaviours at the responding level require the learner to be an active
participant, attend to and react to a particular phenomenon, be willing to
respond and gain satisfaction in responding (motivation). Examples of verbs
describing behaviours at the responding level are answer, assist, aid, comply
with, conform, discuss, greet, help, label, perform, practise, present, read,
recite, report, select, tell, and write.
(c) Valuing
This level relates to the worth or value a person attaches to a particular object,
phenomenon or behaviour. It ranges from simple acceptance to the more
complex state of commitment. Valuing is based on the internalisation of a set
of specified values while clues to these values are expressed in the learner as
overt behaviours and are often identifiable. Examples of verbs describing
behaviours at the valuing level are demonstrate, differentiate, explain,
follow, form, initiate, invite, join, justify, propose, read, report, select, share,
study and work.
(d) Organisation
At this level, a person organises values into priorities by contrasting different
values, resolving conflicts between them and creating a unique value system.
The emphasis is on comparing, relating and synthesising values. Examples
of verbs describing behaviours at the level of organisation are adhere to,
alter, arrange, combine, compare, complete, defend, explain, formulate,
generalise, identify, integrate, modify, order, organise, prepare, relate and
synthesise.
(i) Recognises the need for balance between freedom and responsible
behaviour;
(v) Creates a life plan in harmony with abilities, interests and beliefs; and
(vi) Prioritises time effectively to meet the needs of the organisation, family
and self.
(e) Characterisation
At this level, a personÊs value system controls his behaviour. The behaviour
is pervasive, consistent, predictable and most importantly, characterises the
learner. Examples of verbs describing behaviours at this level are act,
discriminate, display, influence, listen, modify, perform, practise, propose,
qualify, question, revise, serve, solve and verify.
(v) Revises judgment and changes behaviour in light of new evidence; and
(vi) Values people for what they are, not how they look.
Table 2.3 shows how the affective taxonomy may be applied to a value such as
honesty. It traces the development of an affective attribute such as honesty from
the „receiving‰ level up to the „characterisation‰ level where the value becomes a
part of the individualÊs character.
SELF-CHECK 2.2
ACTIVITY 2.2
1. The Role of Affect in Education
„Some say schools should only be concerned with content.‰
„It is impossible to teach content without teaching affect as well.‰
„To what extent, if at all, should we be concerned with the
assessment of affective outcomes?‰
Discuss the three statements in the context of the Malaysian
education system. Share your answer with your course mates.
2. Select any TWO values from the list of 16 universal values and
design an affective taxonomy for each value as shown in Table 2.3.
Share your answer with your course mates.
(a) Perception
Perception is the ability to use sensory cues to guide motor activity. This
ranges from sensory stimulation through cue selection to translation.
Examples of verbs describing these types of behaviours are choose, describe,
detect, differentiate, distinguish, identify, isolate, relate and select.
(ii) Estimates where a ball will land after it is thrown and then moving to
the correct location to catch the ball;
(iii) Adjusts heat of the stove to the correct temperature through smell and
taste of food; and
(iv) Adjusts the height of the ladder in relation to the point on the wall.
(b) Set
It includes mental, physical and emotional sets. These three sets are
dispositions that predetermine a personÊs response to different situations
(sometimes called mindset). Examples of verbs describing „set‰ are begin,
display, explain, move, proceed, react, show, state and volunteer.
(d) Mechanism
This is the intermediate stage in learning a complex skill. Learned responses
have become habitual and the movements can be performed with some
confidence and proficiency. Examples of verbs describing „mechanism‰
include assemble, calibrate, construct, dismantle, display, fasten, fix, grind,
heat, manipulate, measure, mend, mix and organise.
Note that many of the verbs are the same as „mechanism‰, but will
have adverbs or adjectives that indicate that the performance is
quicker, better and more accurate.
(f) Adaptation
Skills are well-developed and the individual can modify movement patterns
to fit special requirements. Examples of verbs describing „adaptation‰ are
adapt, alter, change, rearrange, reorganise, revise and vary.
(iii) Performs a task with a machine that it was originally not designed to
do (assuming that the machine is not damaged and there is no danger
in performing the new task).
(g) Origination
Origination is about creating new movements or patterns to fit a particular
situation or specific problem. Learning outcomes emphasise creativity based
upon highly developed skills. Examples of verbs describing „origination‰ are
arrange, build, combine, compose, construct, create, design, initiate, make
and originate.
SELF-CHECK 2.3
1. Explain the differences between adaptation and guided response
according the Psychomotor Taxonomy of Learning Outcomes.
As a guide, Table 2.5 shows the allotment of time for each type of question.
Once your questions are developed, make sure that you include clear instructions
for the learners. For the objective items, specify that they should select one answer
for each item and indicate the point value of each question, especially if you are
allocating different weightage to different sections of the test. For essay items,
indicate the point value and suggested time to be spent on the item. We will
discuss different types of questions in more detail in Topics 3 and 4. If you are
teaching a large class with close seating arrangements and are giving an objective
test, you may want to consider administering several versions of your test to
minimise the opportunities for cheating. This is done by creating versions of your
test with different numberings of the items.
There are six levels in BloomÊs taxonomy of cognitive learning outcomes with
the lowest level termed knowledge followed by five increasingly difficult
levels of mental abilities, which are comprehension, application, analysis,
synthesis and evaluation.
The six levels in the revised version of BloomÊs taxonomy are remembering,
understanding, applying, analysing, evaluating and creating.
The five major categories of the affective domain from the simplest behaviour
to the most complex behaviour are receiving, responding, valuing,
organisation and characterisation.
The seven major categories of the psychomotor domain from the simplest
behaviour to the most complex are perception, set, guided response,
mechanism, complex overt response, adaptation and origination.
INTRODUCTION
In this topic we will focus on methods of planning classroom tests. Testing is part
of the teaching and learning process. The importance of planning and writing a
reliable, valid and fair test cannot be underestimated. Designing tests is an
important part of assessing learnersÊ understanding of course content and their
level of competency in applying what they have learned. Whether you use low-
stake quizzes or high-stake mid and final semester examinations, careful design of
the tests will help provide more calibrated results. Assessments should reveal how
well learners have learned based on what the teachers want them to learn while
the instructions facilitates their learning. Thus, solely conducting a summative
assessment at the end of a teaching programme is not sufficient. It is helpful to
think about assessing learners at every stage of the planning process. Identifying
ways in which to assess their learners help determine the most suitable learning
activities.
In this topic we will discuss the general guidelines applicable to most assessment
tools. Topics 4 and 5 will discuss in detail the objective and essay tests. Authentic
assessment tools such as projects and portfolios will be discussed in the respective
topics.
(a) Traditional paper and pencil or computer-based tests in the form of multiple-
choice, short answer or essay tests; and
Tests provide teachers with an objective feedback as to how much learners have
learned and understand the subject taught. Commercially published achievement
tests can provide, to some extent, evaluation of the knowledge levels of individual
learners but only limited instructional guidance in assessing a wide range of skills
taught in any given classroom.
Teachers know their learners. Tests developed by the individual teachers for use
in their own class are the most instructionally relevant. Teachers can tailor tests to
emphasise the information they consider important and to match the ability levels
of their learners. If carefully constructed, classroom tests can provide teachers with
accurate and useful information about the knowledge retained by their learners.
The key to this process is the test questions that are used to elicit evidence of
learning. Test questions and tasks are not just a planning tool; they also form an
essential part of the teaching sequence. Incorporating the tasks into teaching and
using the evidence of the learnersÊ learning to determine what happens next in the
lesson is truly an embedded formative assessment.
„Sharing high quality questions may be the most significant thing we can do to
improve the quality of student learning,‰ (William, D., 2011).
Tests can also serve a diagnostic purpose. In such cases, the test is used to provide
learners with insights into gaps in their current knowledge and skill sets.
Alternatively, tests can also be used to motivate learners to exhibit effective
studying behaviour.
The learning objectives that the teachers would like to emphasise on will
determine not only what materials to include in the test but also the specific form
of the test. For example, if it is important for learners to be able to solve long
division problems rapidly, consider giving a speed test. If it is important for
learners to understand how historical events affect one another, short answer or
essay questions might be appropriate. If it is important that learners remember
dates, multiple-choice or fill-in-the-blank questions might be appropriate.
A sample of the Table of Specifications shown in Table 3.1 has content on one
column and cognitive levels across the top. However, teachers could also arrange
the content across the top and levels down the column. In this sample, the teacher
who prepared the table grouped „Remembering‰ and „Understanding‰ levels
together. It is very likely that he believed a straight recall was too simple to be
considered as real learning.
Levels
Content Remembering and Applying Analysing, Evaluating Total
Understanding (%) (%) and Creating (%) (%)
Topic 1 15 15 30 60
Topic 2 10 20 10 40
Total 25 35 40 100
In the example shown in Table 3.2, the vertical columns on the left of the 2-way
table show a list of the topics covered in class and the amount of time spent on
those topics. The topics can also be further subdivided into subtopics such as
„Subtract two numbers without regrouping 2-digit numbers from a 2-digit
number‰ under the topic „Subtraction within the range of 1000‰.
The amount of time spent in the topics as shown in the column „Hours of
Interaction‰ can be used as a basis to compute the weightage or percentage and
the number of questions or items for each topic. For example, the teacher has spent
20 hours teaching the three topics of which 6 hours are allotted to „Addition with
the highest total of 1000‰. Thus, 6 hours from a total of 20 hours amount to 30% or
9 items from the total 30 items as planned by the teacher.
The teacher might have a reason for allocating 25%, 35% and 40% for the levels
„Remembering‰, „Understanding‰ and „Applying‰ respectively. Perhaps he is
trying to train his Year 2 learners to pay more attention to the „thinking‰ questions.
The 25% of „Remembering‰ level is actually 7.5 questions, 35% of
„Understanding‰ level is 10.5 questions and 40% of „Applying‰ level is 12
questions. The total 30 is not affected as the number 7.5 and 10.5 are conveniently
rounded up to 8 and 10.
The cells in the # columns can be arbitrarily filled or computed using a simple
formula. In the first # column, the topic „Addition ...‰ under the level
„Remembering‰ should be 25% 9 = 2.25, the topic „Subtraction ...‰ under
„Remembering‰ is 25% 9 = 2.25 and the topic „Multiplication ...‰ under
„Remembering‰ is 25% 12 = 3. The teacher can either round up the numbers 2.25,
2.25, 3 to 3, 2, 3 or 2, 3, 3.
The teacher, especially one who is newly trained, is advised to have this Table of
Specifications together with the subject syllabus reviewed by the subject matter
expert or the subject Head of Department to confirm whether the test plan would
actually measure what it set out to measure. When the test items have been drafted
and assembled, it is advisable to once again submit the draft test paper and the
Table of Specifications to the Head of Department or the recognised subject matter
expert to evaluate whether the test items do, in actual fact, assess the defined
content. Content validity is different from face validity. Face validity assesses
whether the test „looks valid‰ to the examinees who sit for the test whereas content
validity requires recognised subject matter experts to evaluate whether the test
items assess the defined content.
The Table of Specifications helps to ensure that there is a match between what is
taught and what is tested. From the example, we can see that classroom assessment
is driven by classroom teaching which in turn is driven by learning objectives.
SELF-CHECK 3.1
The determination of what it is that the teachers would like to measure with the
test should precede the determination of how they are going to measure it.
The teacher should make it a habit to write a model answer which can be easily
understood by others. This model answer can be used by other teachers who act
as external examiners, if need be. Besides, a rubric can also be an effective tool to
help the teacher or the external examiner.
Coordination should be done once the test scripts are collected. The teacher should
try to read some of the answers from the scripts and review the correct answers in
the marking scheme. The teacher may sometimes find that learners have
interpreted the test question in a way that is different from what is intended.
Learners may come up with excellent answers that may be slightly outside of what
was asked. Consider giving these learners marks accordingly.
The teacher should make a note in the marking scheme for any error made earlier
but carried through the answer. Marks should be deducted if the rest of the
response is sound.
A marking scheme may increase the efficiency of grading the test but they often
provide only limited information to promote learning. Besides marking scheme, a
rubric can be prepared to act as an additional tool to help the teacher or the external
examiner. Many of the limitations of a marking scheme can be overcome by having
rubrics which contain carefully considered and clearly stated descriptions of levels
of performance.
General rubrics can help learners build up a concept of what it means to perform
a skill well. General rubrics do not „give away answers‰ to questions. Instead, they
contain descriptions such as „Explanation of reasoning is clear and supported by
appropriate details.‰ Descriptions like this help learners focus on what their
learning target is supposed to be. They provide clarification to learners on how to
approach the project. Rubrics will be discussed in greater detail in Topic 6 of this
module.
(a) Use simple and brief instructions for each type of question;
(c) Write items that require specific understanding or ability developed in that
course, not just general intelligence or test-wiseness;
(d) Do not provide clues or suggest the answer to one question in the body of
another question;
(e) Avoid writing questions in the negative. If you must use negatives, highlight
them as they may mislead learners into answering incorrectly;
(g) Try, as far as possible, to construct your own questions. Check to make sure
the questions fit the learning objectives and requirements in the Table of
Specifications if you need to use questions from other sources; and
(d) Is the Material I Tested for Really What I Wanted Learners to Learn?
For example, if you had wanted learners to use analytical skills such as the
ability to recognise patterns or draw inferences but only used true-false
questions requiring non-inferential recall, you might try constructing more
complex true-false, or multiple-choice questions.
Learners should know what is expected of them. They should be able to identify
the characteristics of a satisfactory answer and understand the relative importance
of those characteristics. This can be achieved in many ways. For example, you can
provide feedback on tests, describe your expectations in class or post model
solutions on a class blog. Teachers are encouraged to make notes on the test scripts.
When test scripts are returned to the learners, the notes will help them understand
their mistakes and correct them.
SELF-CHECK 3.2
The first step in test planning is to decide on the purpose of the test. Tests can
be used for many different purposes.
The next step is to consider the learning objectives and the relative importance
of the learning objectives. Teachers will have to select the appropriate
knowledge and skills to be assessed and include more questions for more
important learning objectives.
The Table of Specifications describes the content, the behaviour of the learners,
the number of questions in the test corresponding to the number of hours
devoted to the learning objectives in class.
The Table of Specifications helps to ensure that there is a match between what
is taught and what is tested. Classroom assessment is driven by classroom
teaching which in turn is driven by learning objectives.
The test format used is one of the main driving factors in the learnersÊ learning
behaviour.
Preparing a rubric or marking scheme well in advance of testing date will give
teachers ample time to review their questions and make changes to answers
when necessary.
INTRODUCTION
In this topic we will focus on using objective tests to assess various kinds of
behaviour in the classroom. Firstly, the discussion will be limited to the simple
forms of objective test items, namely short-answer item, true-false item and
matching item. Three types of objective tests are examined and guidelines for the
construction of each type of the tests are discussed. The advantages and limitations
of each of these types of objective tests are explained. Secondly, we will discuss the
multiple-choice item, a more complex form of objective test items. The discussion
will focus on the characteristics and uses of multiple-choice items, their advantages
and limitations and some suggestions for the construction of such items.
You can refer to Linn and Gronlund (1995) for more examples.
Now, let us look at the advantages and the limitations of this type of question.
Many short-answer questions can be set for a specific period of time. A test paper
consisting of short-answer questions is thus able to cover a fairly wide content of
the course to be assessed. A wide content coverage enhances the content validity
of the test.
Scoring of answers to the short-answer question can also pose a problem. Unless
the question is carefully phrased, learners can provide answers of varying degree
of correctness. For example, answer to a question such as „When was Malaysia
formed?‰ could either be „In 1963‰ or „On 16 September 1963‰. The teacher has to
decide whether learners who gave the partial answer have the same level of
knowledge as those who provided the complete answer. Besides, learnersÊ
answers can also be contaminated by spelling errors. If spelling is taken into
consideration, the test scores of learners will reflect their level of knowledge of the
content assessed as well as their spelling ability. If spelling is not considered in the
scoring, the teacher has to decide whether the misspelled word actually represents
the correct answer.
(a) Word the question so that the intended answer is brief and specific.
As far as possible, the question should be phrased in such a way that only
one answer is correct.
For Example:
Better item: An animal that eats the flesh of other animals is classified as
___________.
For Example:
Possible answers for 1st item: a boat, the 15th century, a search for India
(c) If the problem requires a numerical answer, indicate the units in which the
answer is to be expressed.
For Example:
Possible answers for 1st item: a boat, the 15th century, a search for India
(d) Do not include too many blanks for the completion item.
Blanks for answers should be equal in length. For the completion item, place
the blank near the end of the sentence.
For Example:
Possible answers for 1st item: a boat, the 15th century, a search for India
SELF-CHECK 4.1
For Example:
True False
True-false questions can be quickly written and can cover a lot of content. True-
false questions are well suited for testing learner recall or comprehension. Learners
can generally respond to many questions, covering a lot of content in a fairly short
amount of time. From the teacherÊs perspective, these questions can be written
quickly and are easy to score. Because they can be objectively scored, the scores
are more reliable than for items that are at least partially dependent on the
teacherÊs judgment. Generally, they are easier to construct compared to multiple-
choice questions because there is no need to develop distractors. Hence, they are
less time consuming compared to constructing multiple-choice questions.
(a) Guessing – A learner has a one in two chance of guessing the correct answer
of a question. Scores on true-false items tend to be high because of the ease
of guessing the correct answers when the answer is not known. With only
two choices (true or false) the learner could expect to guess correctly on half
of the items for which correct answers are not known. Thus, if a learner
knows the correct answers to 10 questions out of 20 and guesses on the other
10, the learner could expect a score of 15. The teacher can anticipate scores
ranging from approximately 50 per cent for a learner who did nothing but
guess on all items to 100 per cent for a learner who knows the material.
(b) Because these items are in the form of statements, there is sometimes a
tendency to take quotations from the text, expecting the learner to recognise
a correct quotation or note a change (sometimes minor) in wording. There
may also be a tendency to include trivial or inconsequential material from
the text. Both of these practices are discouraged.
(d) True-false items provide little diagnostic information. Teachers can often get
useful information about learner errors and misconceptions by examining
learnersÊ incorrect answers but true-false items do not provide such
diagnostic information.
(e) True-false items may produce a negative suggestion effect. Some testing
experts feel that exposing false statements might promote learning false
information.
(f) False statements do not provide evidence that learners know the correct
answer.
For Example:
(b) Use negative statements sparingly but avoid double negatives. Double
negatives tend to contribute to the ambiguity of the statement. Statements
with words like none, no and not should be avoided as far as possible.
For Example:
(c) Avoid broad, general statements. Most broad generalisations are false unless
qualified.
For Example:
(d) Avoid long complex sentences. Such sentences also test reading
comprehension besides the achievements to be measured.
For Example:
(e) Try using in combination with other materials such as graphs, maps and
written material. This combination allows for the testing of more advanced
learning.
(f) Avoid lifting statements directly from assigned reading, notes or other
course materials so that recall alone will not permit a correct answer.
(g) In general, avoid the use of words which would signal the correct response
to the test-wise learner. Absolutes such as „none‰, „never‰, „always‰, „all‰
and „impossible‰ tend to be false while qualifiers such as „usually‰,
„generally‰, „sometimes‰ and „often‰ are likely to be true.
(h) A similar situation occurs with the use of "can" in a true-false statement. If
the learner knows of a single case in which something „can‰ be done, it
would be true.
(i) Ambiguous or vague statements and terms such as "large", "long time",
"regularly", "some" and "usually" are best avoided in the interest of clarity.
(j) Some terms have more than one meaning and may be interpreted differently
by individuals.
(k) True statements should be about the same length as false statements (there is
a tendency to add details in true statements to make them more precise).
(l) Word the statement so precisely that it can be judged unequivocally true or
false.
(n) Avoid verbal clues (specific determiners) that indicate the answer.
SELF-CHECK 4.2
For Example:
Directions: Column A contains statement describing selected Asian cities.
For each description find the appropriate city in Column B. Each city in Column
B can be used only once.
Column A Column B
1. The ancient capital of Thailand A. Ayuthia
2. The largest city in Sumatera B. Ho Chi Minh City
3. The capital of Myanmar C. Karachi
4. Formerly known as Saigon D. Medan
The learner reads a premise (Column A) and finds the correct response from
among those in Column B. The learner then prints the letter of the correct response
in the blank beside the premise in Column A. An alternative is to have the learner
draw a line from the correct response to the premise but this is more time
consuming to score. One of the ways to reduce the possibility of guessing the
correct answers is to list a larger number of responses (Column B) than premises
(Column A) as is done in the example. Another way to decrease the possibility of
guessing is to allow responses to be used more than once. Instructions to the
learners should be very clear about the use of responses.
(b) They can also assess a learnerÊs ability to apply knowledge by requiring a
test-taker to match the following:
(d) Matching questions are generally easy to write and score when the content
tested and objectives are suitable for matching question.
(a) Matching questions are limited to the materials that can be listed into two
columns and there may not be much material that lends itself to such a
format;
(b) If there are four items in a matching question and the learners know the
answers for three of them, the fourth item is a give-away through
elimination;
(a) Provide clear instructions. They should explain how many times responses
can be used;
(c) Include more responses than premises or allow the responses to be used
more than once;
(e) Correct answers should not be obvious to those who do not know the content
being taught;
(f) There should not be keywords appearing in both a premise and response,
providing a clue to the correct answer; and
(g) All of the responses and premises for a matching item should appear on the
same page.
SELF-CHECK 4.3
ACTIVITY 4.1
1. Select five true-false questions in your subject area and analyse each
item using the guidelines discussed.
2. Select five matching questions in your subject area and analyse each
item using the guidelines discussed.
3. Suggest how you would improve the weak items for each type of
questions that you have identified.
MCQs are the most difficult to prepare. These questions have two parts, namely a
stem which contains the question and four or five options which contains the
correct answer. The correct answer is called the keyed response and the incorrect
options are called distractors. The stem may be presented as a question, direction
or a statement while the options could be a word, phrase, numbers, symbols and
so forth. Cruel as it may seem, the role of the distractor is to attract the attention of
respondents who are not sure of the correct answer.
Now, let us look at what an MCQ consists of. It has the stem and also the options.
(a) Stem
The stem should:
(iv) Generally ask for one answer only (the correct or the best answer); and
(i) Have either four or five alternatives, all of which should be mutually
exclusive and not too long;
(iv) Contain the intended answer or the keyed response and it should
appear to be clearly correct to the informed but it should be definitely
incorrect but plausible to the distractors.
SELF-CHECK 4.4
1. Why is MCQ a popular form of objective test?
As stated earlier, MCQs are the most difficult to prepare. We need to focus on
writing the stem as well as providing the options or alternatives. All the options in
multiple-choice items need to be plausible but they also need to separate learners
of different ability levels. Table 4.1 shows some considerations that need to be
taken into account when constructing MCQs, particularly the stems.
(c) Use clear, straight As the level of fertility A major decline in fertility
forward language approaches its nadir, what is the in a developing nation is
in the stem of the most likely ramification for the likely to produce
item. Questions citizenry of a developing nation? A. a decrease in the labour
that are A. A decrease in the labour force participation rate
constructed using force participation rate of of women.
complex wordings women. B. a downward trend in
may become a test B. A downward trend in the the youth dependency
of reading youth dependency ratio. ratio.
comprehension
C. A broader base in the C. a broader base in the
rather than an
population pyramid. population pyramid.
assessment of
whether the D. An increased infant D. an increased infant
learner knows the mortality rate. mortality rate.
subject matter. Note: In the improved item, the word „nadir‰ is replaced with
„decline‰ and „ramification‰ is replaced with „produce‰
which are more straight forward words.
(d) Use negatives Which of the following is not a Which of the following is a
sparingly. If symptom of osteoporosis? symptom of osteoporosis?
negatives must be A. Decreased bone density. A. Hair loss.
used, capitalise, B. Frequent bone fractures. B. Painful joints.
underscore or
C. Raised body temperature. C. Raised body
bold.
D. Lower back pain. temperature.
D. Decreased bone density.
Note: The Better Item is stated in the positive so as to avoid
use of the negative „not‰.
(e) Put as much of the Theorists of pluralism have Theorists of pluralism have
question in the asserted which of the following? asserted that the
stem as possible, A. The maintenance of maintenance of democracy
rather than democracy requires a large requires
duplicating middle class. A. a large middle class.
material in each of B. The maintenance of B. the separation of
the options. democracy requires governmental powers.
autonomous centres of C. autonomous centres of
countervailing power. countervailing power.
C. The maintenance of D. the existence of a
democracy requires the multiplicity of religious
existence of a multiplicity of groups.
religious groups.
D. The maintenance of
democracy requires the
separation of governmental
powers.
(h) Avoid using ALWAYS and NEVER in the stem as test-wise learners are likely to
rule such universal statements out of consideration.
ACTIVITY 4.2
1. Select ten MCQs in your subject area and analyse the stem of each
item using the guidelines discussed.
Now, let us look at Table 4.2 which shows some considerations when constructing
the distractors for MCQs.
(d) Distractors based on common learner errors or misconceptions are very effective.
One technique for compiling distractors is to ask learners to respond to open-ended
short-answer questions, perhaps as formative assessments. Identify which incorrect
responses appear most frequently and use them as distractors for a multiple-choice
version of the question.
(e) Do not create distractors that are so close to the correct answer that they may confuse
learners who really know the answer to the question. „Distractors should differ from
the key in a substantial way, not just in some minor nuances of phrasing or
emphasis.‰ (Isaacs, 1994)
(f) Provide a sufficient number of distractors.
You will probably choose to use three, four or five alternatives in a MCQ. Until
recently, it was thought that three or four distractors were necessary for the item
to be suitably difficult. However a study by Owen and Freeman suggested that
three choices are sufficient (Brown, 1987). Clearly the higher the number of
distractors, the less likely it is for the correct answer to be chosen through guessing
(provided all alternatives are of equal difficulty).
SELF-CHECK 4.5
1. Do you agree that teachers should not use negatives in the stems of
MCQs? Justify your answer.
2. Do you agree that teachers should avoid using distractors such as
„All of the above‰ and „None of the above‰? Justify your answer.
3. Select ten MCQs in your subject area and analyse the distractors of
each item using the guidelines discussed. Suggest how you would
improve the weak items.
(f) Scores are more reliable than subjectively scored items (such as essays);
(h) Item analysis can reveal how difficult each item was and how well it
discriminates between strong and weaker learners in the class;
(i) Performance can be compared from class to class and year to year;
(j) Can cover a lot of material very efficiently (about one item per minute of
testing time); and
(k) Items can be written so that learners must discriminate among options that
vary in degree of correctness.
While there are many advantages of using MCQs, there are also many limitations
in using such items, namely:
(c) MCQs are not effective for measuring problem-solving skills as well as the
ability to organise and express ideas;
(f) Learners can sometimes read more into the question than was intended;
(g) It often focuses on testing factual information and fails to test higher levels
of cognitive thinking;
(i) It places a high degree of independence on learnersÊ reading ability and the
constructorÊs writing ability;
Now, let us look at Table 4.3 which summarises the procedural rules for the
construction of MCQs.
Test for important or significant Avoid giving clues through the use of
information. faulty grammatical construction.
Avoid trick items. Avoid the use of humour when
Keep the vocabulary consistent developing options.
with the learnersÊ level of Present practical or real-world
understanding. situations to learners.
Avoid overly specific knowledge Use pictorial materials that require
when constructing items. learners to apply principles and
Avoid items based on opinions. concepts.
Be sensitive to cultural, religious Avoid textbook, verbatim phrasing
and gender issues. when developing items.
Keep options or alternatives Use charts, tables or figure that
independent and not overlapping. require interpretation.
Avoid distractors that can provide
clue to test-wiseness.
For Example:
Read the following comment that a teacher made about testing and then
answer the question.
Which of the following types of test is this teacher primarily talking about?
A. Aptitude test
B. Diagnostic test
C. Formative test
D. Summative test*
Adapted from:
https://www.learningsolutionsmag.com/articles/804/writing-
multiple-choice-questions-for-higher-level-thinking
For Example:
Two 60-year-old male patients (P#1 and P#2) have Type 2 diabetes. Each has
a BMI of 27. The primary treatment for each patient is a diet to reduce blood
glucose levels.
What is the most likely reason why P#2 did not show a decline in glucose
level after three months?
Adapted from:
http://digitalcommons.hsc.unt.edu/cgi/viewcontent.cgi?article=100
9&context=test_items
For Example:
Both Su and Ramya want to lose weight. Su goes on a low carbohydrate diet
while Ramya goes on a vegan diet. After six months, Su lost 30 pounds and
Ramya lost 15 pounds.
Adapted from:
http://digitalcommons.hsc.unt.edu/cgi/viewcontent.cgi?article=100
9&context=test_items
SELF-CHECK 4.6
Objective tests vary depending on how the questions are presented. The four
common types of questions used in most objective tests are short-answer
questions, matching questions, true-false questions and multiple-choice
questions (MCQs).
The two forms of short-answer questions are direct questions and completion
questions.
True-false questions are those in which a statement is presented and the learner
indicates in some manner whether the statement is true or false.
True-false questions can be written quickly and are easy to score. Because they
can be objectively scored, the scores are more reliable than for items that are at
least partially dependent on the teacherÊs judgment.
Avoid lifting statements directly from assigned reading, notes or other course
materials so that recall alone will not permit a correct answer.
MCQs have two parts: a stem that contains the question and four or five
options that contains the correct answer called the keyed response and
incorrect options called distractors.
MCQs are widely used because they can be used to measure learning
outcomes, from simple to complex. They are highly structured with clear tasks
provided and able to test a broad sample of achievement.
MCQs are difficult to construct, tend to measure low level learning outcomes,
lend themselves to guessing and do not measure writing ability.
INTRODUCTION
In Topic 4, we discussed in detail the use of objective tests in assessing learners. In
this topic, we will examine the essay test. The essay test is a popular technique for
assessing learning and is used extensively at all levels of education. It is also widely
used in assessing learning outcomes in business and professional examinations.
Essay questions are used because they challenge learners to create their own
responses rather than merely selecting a response. Essay questions have the
potential to reveal learnersÊ ability to reason, create, analyse and synthesise. These
skills may not be effectively assessed using objective tests.
Elaborating on this definition, Reiner, Bothell, Sudweeks and Wood (2002), argued
that to qualify as an essay question it should meet the following four criteria:
(a) The learner has to compose rather than select his response or answer. In
essay questions, learners have to construct their own answer and decide on
what material to include in their response. Objective test questions (MCQ,
true-false, matching) require learners to select the answer from a list of
possibilities.
(b) The response or answer provided by the learner consists of one or more
sentences. Learners do not respond with a „yes‰ or „no‰ response but instead
have to respond in the form of sentences. In theory there is no limit to the
length of the answer. However, in most cases its length is predetermined by
the demands of the question and the time limit allotted for the question.
(c) There is no one single correct response or answer. In other words, the
question should be composed so that it does not ask for one single correct
response. For example, the question, „Who killed J. W. W. Birch?‰ assesses
verbatim recall or memory and not the ability to think. Hence, it cannot
qualify as an essay question. You could modify the question to „Identify who
killed J.W.W. Birch and explain the factors that led to the killing.‰ Now, it is
an essay question that assesses learnersÊ ability to think and give reasons for
the killing supported with relevant evidence.
(d) The accuracy and quality of learnersÊ responses or answers to essays must be
judged subjectively by a specialist in the subject matter. The nature of essay
questions is such that only specialists in the subject matter can judge to what
degree responses (or answers) to an essay is complete, accurate and relevant.
Good essay questions encourage learners to think deeply about their answers
which can only be judged by someone with the appropriate experience and
expertise in the content area. Thus, content expertise is essential for both
writing and grading essay questions. For example, the question „List three
reasons for the opening of Penang by the British in 1789‰ requires learners to
recall a set list of items. The person marking or grading the essay does not
have to be a subject matter expert to know if the learner has listed the three
reasons correctly as long as the list of three reasons is available as an answer
key. For the question „To what extent is commerce the main reason for the
opening of Penang by the British in 1789?‰ a subject-matter expert is needed
to grade or mark an answer to this essay question.
ACTIVITY 5.1
Select a few essay questions that have been used in tests or examinations.
To what extent do these questions meet the criteria of an essay question
as defined by Stalnaker (1951) and elaborated by Reiner, Bothell,
Sudweeks and Wood (2002)?
(b) To assess thinking skills that require more than simple verbatim recall of
information by challenging learners to reason with their knowledge.
To determine what type of test (essay or objective) to use, it is helpful that you
examine the verb(s) that best describes the desired ability to be assessed. We have
discussed these verbs in Topic 2. These verbs indicate what learners are expected
to do and how they should respond. These verbs serve to channel and focus
learnersÊ responses towards the performance of specific tasks. Some verbs clearly
indicate that learners need to construct rather than select their answer (for
example, to explain). Other verbs indicate that the intended learning outcome is
focused on learnersÊ ability to recall information (for example, to list). Perhaps,
recall is best assessed through objectively scored items. Verbs that test for
SELF-CHECK 5.1
Compare, explain, arrange, apply, state, classify, design, illustrate,
describe, name, complete, choose, defend, name.
(a) One purpose of testing is to assess a learnerÊs mastery of the subject matter.
In most cases it is not possible to assess the learnerÊs mastery of the complete
subject-matter domain with just a few questions. Because of the time it takes
for learners to respond to essay questions and for markers to mark learnersÊ
responses, the number of essay questions that can be included in a test is
limited. Therefore, using essay questions will limit the degree to which the
test is representative of the subject-matter domain, thereby reducing content
validity. For instance, a test of 80 multiple-choice questions will most likely
cover more of the content domain than a test of 3 or 4 essay questions.
(b) Essay questions have reliability limitations. While essay questions allow
learners some flexibility in formulating their responses, the reliability of
marking or grading is questionable. Different markers or graders may vary
in their marking or grading of the same or similar responses (inter-scorer
reliability) and one marker can vary significantly in his marking or grading
consistency across questions depending on many factors (intra-scorer
reliability). Therefore, essay answers of similar quality may receive notably
different scores. Characteristics of the learner, length and legibility of
responses and personal preferences of the marker or grader with regard to
the content and structure of the responses are some of the factors that may
lead to unreliable marking or grading.
(c) Essay questions require more time for marking learner responses. Teachers
need to invest large amounts of time to read and mark learnersÊ responses to
essay questions. On the other hand, relatively little or no time is required of
teachers for scoring objective test items such as the multiple-choice items and
matching exercises.
(d) As mentioned earlier, one of the strengths of essay questions is that they
provide learners with authentic experiences because learners are challenged
to construct rather than to select their responses. To what extent does the
short time given affect learner response? Learners have relatively little time
to construct their responses and it does not allow them to pay appropriate
attention to the complex process of organising, writing, and reviewing their
responses. In fact, in responding to essay questions, learners use a writing
process that is quite different from the typical process that produces excellent
writing (such as draft, review, revise and evaluate). In addition learners
usually have no resources to aid their writing when answering essay
questions (such as dictionary and thesaurus). These limitations may offset
whatever advantage accrued from the fact that responses to essay questions
are more authentic than responses to multiple-choice items.
(a) By their Very Nature Essay Questions Assess Higher-order Thinking (HOT)
Whether or not an essay item assesses HOT depends on the design of the
question and how learnersÊ responses are scored. An essay question does not
automatically assess higher-order thinking skills (HOTS). It is possible to
write essay questions that simply assess recall. Also, if a teacher designs an
essay question meant to assess HOT but then scores learnersÊ responses in a
way that only rewards recall ability, this means the teacher is not assessing
HOT. Teachers must be well-trained to design and write HOTS items.
(d) Essay Questions Benefit All Learners by Placing Emphasis on the Importance
of Written Communication Skills
Written communication is a life competency that is required for effective and
successful performance in many vocations. Essay questions challenge
learners to organise and express the subject matter and problem solutions in
their own words, thereby giving them a chance to practice written
communication skills that will be helpful to them in future vocational
responsibilities. At the same time, the focus on written communication skills
is also a serious disadvantage for learners who have marginal writing skills
but know the subject matter being assessed. For learners who are
knowledgeable in the subject matter but obtain low scores because of their
inability to write well, the validity of the test scores is compromised.
SELF-CHECK 5.2
ACTIVITY 5.2
Compare the following two essay questions and decide which one
assesses higher-order thinking skills. Discuss your answer with your
course mates.
(a) What are the major advantages and limitations of solar energy?
The following are specific guidelines that can help you improve existing essay
questions or create new ones:
(b) Avoid Using Essay Questions for Intended Learning Outcomes that Can be
Better Assessed with Other Types of Assessment
Some types of learning outcomes can be more efficiently and more reliably
assessed with objective tests than with essay questions. Since essay questions
sample a limited range of subject-matter content, are more time consuming
to score and involve greater subjectivity in scoring, the use of essay questions
should be reserved for learning outcomes that cannot be better assessed by
some other means.
For Example:
Birds Amphibians
A Lay a few eggs at a time Lay many eggs at a time
B Lay eggs Gives birth
C Do not incubate eggs Incubate eggs
D Lay eggs in nest Lay eggs on land
(i) The problem of learner responses containing ideas that were not meant
to be assessed; and
Although more structure helps to avoid these problems, how much and what
kind of structure and focus to provide is dependent on the intended learning
outcome that is to be assessed by the essay question. The process of writing
effective essay questions involves defining the task and delimiting the scope
of the content in an effort to create an effective question that is aligned with
the intended learning outcome to be assessed (illustrated in Figure 5.1).
Figure 5.1: Alignment between content, learning activities and assessment tasks
Source: Phillips, Ansary Ahmed and Kuldip Kaur (2005)
The alignment shown in Figure 5.1 is absolutely necessary in order to obtain the
learnerÊs response that can be accepted as evidence that the learner has achieved
the intended learning outcome. Hence, the essay question must be carefully and
thoughtfully written in such a way that it elicits learner responses which provide
the teacher with valid and reliable evidence about the learnersÊ achievement of the
intended learning outcome. Failure to establish adequate and effective limits for
learnersÊ answers to the essay question may result in learners setting their own
boundaries for their responses. This means that learners might provide answers
that are outside the intended task or address only a part of the intended task. When
this happens the teacher is left with unreliable and invalid information about the
learnersÊ achievement of the intended learning outcome. Moreover, there is no
basis for marking or grading the learnersÊ answers. Therefore, it is the
responsibility of the teacher to write essay questions in such a way that they
provide learners with clear boundaries for their answers or responses.
For Example:
The verb is „evaluate‰ which is the task the learner is supposed to do. The scope
of the question is the impact of the Industrial Revolution on England. Very little
guidance is given to learners about the task of evaluating and the scope of the
task. A learner reading the question may ask:
(b) Evaluate based on what criteria? The significance of the revolution? The
quality of life in England? Progress in technological advancement? [The
task is not clear].
The improved question delimits the task for learners by specifying a particular
unit of society in England affected by the Industrial Revolution (family). The
task is also delimited by giving learners a criterion for evaluating the impact of
the Industrial Revolution (whether or not families were able to provide for their
childrenÊs education). Learners are clearer as to what must be done to
„evaluate.‰ They need to explain how the family has changed and judge
whether the changes are an improvement for their children.
SELF-CHECK 5.3
(e) Specify the Approximate Time Limit and Marks Allotted for each Question
Specifying the approximate time limit helps learners allocate their time in
answering several essay questions. Without such guidelines learners may be
at a loss as to how much time to spend on a question. When deciding the
guidelines for how much time should be spent on a question, keep the slower
learners and learners with certain disabilities in mind. In addition, make sure
that learners can be realistically expected to provide an adequate answer in
the given or suggested time. Similarly, state the marks allotted for each
question so that learners can decide how much they should write for the
question.
(f) Use Several Relatively Short Essay Questions Rather than One Long Essay
Question
Only a very limited number of essay questions can be included in a test
because of the time it takes for learners to respond to them and the time it
takes for teachers to grade the learnersÊ responses. This creates a challenge
with regard to designing valid essay questions. Shorter essay questions are
better suited to assess the depth of learnersÊ learning within a subject whereas
longer essay questions are better suited to assess the breadth of learnersÊ
learning within a subject. As such, there is a trade-off when choosing
between several short essay questions and one long essay question. Focus on
assessing the depth of learnersÊ learning limits the assessment of the breadth
of learnersÊ learning within the same subject and vice versa. When choosing
between using several short essay questions or one long one, also keep in
mind that short essays are generally easier to mark than long essay questions.
(ii) Some questions are likely to be harder which could make the
comparative assessment of learnersÊ abilities unfair.
One way to write HOT essay questions is to present learners with a situation
they have not previously encountered so that they must reason with their
knowledge and provide an authentic assessment of complex thinking. For
example, present something in the form of introductory text, visuals,
scenarios, resource material or problems of some sort for learners to think
about.
SELF-CHECK 5.4
1. Why should you specify the time to be allotted for each question?
SELF-CHECK 5.5
1. Select some essay questions in your subject area and examine
whether the verbs used are similar to the list in Table 5.1. Do you
think the tasks required by the verbs used are appropriate?
2. Do you think learners are able to differentiate between the tasks
required in the verbs listed?
3. Are teachers able to describe to learners the tasks required by these
verbs?
The two most common forms of scoring guides used are the analytic and holistic
methods.
factors (three marks for each factor) and one mark is given for providing
a relevant example. Marks are also allotted for the „introduction‰ and
„conclusion‰ which are important elements in an essay.
For Example:
Explain the five factors that determine the location of a manufacturing plant.
1. Introduction Points
Sets the organisation of the essay 2
2. Site or land
Description (how site influences location) 3
Example 1
3. Labour
Description (role of workers) 3
Example 1
5. Transport system
Description (access to port, population) 3
Example 1
6. Entrepreneurial skills
Description (kind of skill) 3
Example 1
The marker reads and compares the learnerÊs answer with the model answer.
If all the necessary elements are present, the learner receives the maximum
number of points. Partial credit is given based on the elements included in
the answer. In order to arrive at the overall exam score, the instructor adds
the points earned on the separate questions.
Identify in advance what will be worth a point and how many points are
available for each question. Make sure that learners are aware of this so that
they do not give more (or less) than necessary and they know precisely what
you are looking for. If learners produce an unexpected but correct example,
give them the point immediately and add that point to your answer key so
that the next learner will also get the point.
You can develop a description of the type of response that would illustrate
each category before you start and try out this draft version using several
actual papers. After reading and categorising all of the papers, it is a good
idea to re-examine the papers within a category to see if they are similar
enough in quality to receive the same points or grade. It may be faster to read
essays holistically and provide only an overall score or grade but learners do
not receive much feedback about their strengths and weaknesses. Some
instructors who use holistic scoring also write brief comments on each paper
to point out one or two strengths and/or weaknesses so that learners will
have a better idea as to why their responses received such scores.
Level of
General Presentation Reasoning, Argumentation
Achievement
Exemplary Addresses the question. Demonstrates accurate and
(10 pts) States a relevant complete understanding of the
argument. question.
SELF-CHECK 5.6
1. Compare and contrast the analytical method and holistic method of
marking essays.
2. Which method is widely practised in your institution? Why?
Lastly before we end the topic, let us look at some suggestions for marking or
scoring essays:
(a) Grade papers anonymously. This will help control the influence of our
expectations regarding the learner being evaluated.
(b) Read and score the answers to one question before going on to the next
question. In other words, score all the learnersÊ responses to Question 1
before looking at Question 2. This helps to keep one frame of reference and
one set of criteria in mind throughout all the papers, which will also result in
more consistent grading. It also prevents an impression that we form in
reading one question from being carried over to our reading of the learnerÊs
next answer. If a learner has not done a good job on the first question, for
example, we might allow this impression to influence our evaluation of the
learnerÊs second answer. However, if other learnersÊ papers come in
between, we are less likely to be influenced by the original impression.
(c) If possible, try to grade all the answers to one particular question without
interruption. Our standards might vary from morning to night or from one
day to the next.
(d) Shuffle the papers after each item is scored throughout all the papers.
Changing the order reduces the context effect and the possibility that a
learnerÊs score is the result of the location of the paper in relation to other
papers. For example, if RakeshÊs „B‰ work is always followed by JamalÊs „A‰
work, then it might look more like „C‰ work and his grade would be lower
than if his paper were somewhere else in the stack.
(e) Decide in advance how you are going to handle extraneous factors and be
consistent in applying the rule. Learners should be informed about how you
treat matters such as misspelled words, neatness, handwriting, grammar and
so on.
(f) Be on the alert for bluffing. Some learners who do not know the answer may
write a well-organised coherent essay which may contain material that is
irrelevant to the question. Decide how to treat irrelevant or inaccurate
information contained in learnersÊ answers. We should not give credit for
irrelevant material. It is not fair to other learners who may also have
preferred to write on another topic but instead wrote on the required
question.
(g) Write comments on learnersÊ test scripts. TeacherÊs comments make essay
tests a good learning experience for learners. They also serve to refresh your
memory of your evaluation should the learner question the grade he
receives.
(h) Be aware of the order in which papers are marked as it can have an impact
on the grades that are awarded. A marker may grow more critical (or more
lenient) after having read several papers, thus the early papers receive lower
(or higher) marks than papers of similar quality that are scored later.
(i) When learners are directed to take a stand on a controversial issue, the
marker must be careful to insure that the evidence and the way it is presented
is evaluated, NOT the position taken by the learner. If the learner takes a
position contrary to that of the marker, the marker must be aware of his or
her own possible bias in marking the essay because the learnerÊs position
differs from that of the marker.
There are two types of essays based on their function, namely, coursework
essay and examination essay.
It is not possible with the use of just a few essay questions to assess the learnerÊs
mastery of the complete subject-matter domain.
Essay questions have two variable elements, which are the degree to which the
task is structured and the degree to which the scope of the content is focused.
Specifying the approximate time limit helps learners allocate their time in
answering several essay questions during a test.
Avoid using essay questions for intended learning outcomes that can be better
assessed with other types of assessment.
INTRODUCTION
Many teachers use traditional assessment tools such as multiple-choice and essay
tests to assess their learners. How well do these multiple-choice or essay tests
really evaluate learner understanding and achievement? These traditional
assessment tools do serve a role in the assessment of learner outcomes. However,
assessment does not always have to involve paper and pencil tests. It can also be a
project, an observation or a task as long as it is able to demonstrate a learner has
learned the material. Are these alternative assessment tools more effective than
traditional ones?
Some classroom teachers are using testing strategies that do not focus entirely on
recalling facts. Instead, they ask learners to demonstrate skills and concepts they
have learned. Teachers may want to ask the learners to apply their skills to
authentic tasks and projects or to have learners demonstrate their knowledge in
ways that are much more applicable to life outside of the classroom. Learners must
then be trained to perform meaningful tasks that replicate real world challenges.
In other words, these teachers are trying to assess studentsÊ abilities in „real-
world‰ contexts. In order to do this, learners are asked to perform a task such as
to explain the historical events or solve math problems rather than select an answer
from a ready-made list.
Learner and school performance gains are achieved through regular reviews of
results followed by targeted adjustments to curriculum and instruction.
Teachers can teach learners how to do mathematics, do history and do science, not
just know them. Subsequently, to assess what the learners had learned, teachers
can ask learners to perform tasks that replicate the challenges faced by using
mathematical principles, conducting historical or scientific investigations.
(b) Authentic activities require learners to define the tasks and sub-tasks needed
to complete the activity: Problems inherent in the activities are open to
multiple interpretations rather than easily solved by the application of
existing algorithms.
(d) Authentic activities provide the opportunity for learners to examine the task
from different perspectives, using a variety of resources: The use of a variety
of resources rather than a limited number of preselected references requires
learners to detect relevant from irrelevant information.
(g) Authentic activities can be integrated and applied across different subject
areas and lead beyond domain-specific outcomes: Activities encourage
interdisciplinary perspectives and enable learners to play diverse roles, thus
building robust expertise rather than knowledge limited to a single well-
defined field or domain.
(i) Authentic activities create polished products valuable in their own right
rather than as preparation for something else.
(j) Authentic activities allow for competing solutions and diversity of outcomes:
Activities allow for a range and diversity of outcomes open to multiple
solutions of an original nature rather than a single correct response obtained
by the application of rules and procedures.
SELF-CHECK 6.1
(a) Identify which standards you want the learners to achieve through this
assessment. Standards, like goals, are statements of what learners should
know and be able to meet. Standards must be observable and measurable.
Teachers can observe a performance but not an understanding. Thus a
statement such as „Learners will understand how to add two-digit numbers‰
is not observable whereas a statement such as „Learners will correctly add
two-digit numbers‰ is observable and measurable.
(b) Choose a relevant task for the standard or set of standards so that learners
can demonstrate how they have or have not met the standards. In this step,
teachers may want to find a way in which learners can demonstrate how they
are fully capable of meeting the standard. For the standard such as
„Understand how to add two-digit numbers‰, the task may be to ask learners
to describe a real-life situation, story or problem. Teachers may elicit
strategies from the learners, asking them to demonstrate and explain their
reasoning to their classmates. That might take the form of a multimedia
presentation which learners develop (individually or collaboratively),
utilising Ten-Frames with some counters.
(c) Define the characteristics of good performance for the task. This will provide
useful information regarding how well learners have met the standards.
For this step, teachers identify the criteria for good performance of this task.
They may write down a few characteristics for successful completion of the
task.
(d) Create a rubric or set of guidelines for learners to follow so that they are able
to assess their work as they perform the assigned task.
SELF-CHECK 6.2
Authentic assessments have many benefits, the main benefit being that it ensures
learner success. Authentic assessments focus on progress rather than identifying
weaknesses.
(a) It has the advantage of providing parents and community members with
directly observable products and understandable evidence concerning their
learnersÊ performance. The quality of learnersÊ work is more discernible to
laypersons compared to the reliance on abstract statistical figures;
(b) Uses tasks that reflect normal classroom activities or real-life learning. The
tasks are a means for improving instruction, allowing teachers to plan a
comprehensive, developmentally oriented curriculum based on their
knowledge of each child;
There is nothing new about this authentic assessment methodology. It is not some
kind of radical invention recently fabricated by the opponents of traditional tests
to challenge the testing industry. Rather it is a proven method of evaluating human
characteristics and has been in use for decades (Linquist, 1951).
SELF-CHECK 6.3
Table 6.1 summarises the major differences between the authentic and traditional
assessments.
The quality of information acquired through the use of checklists, rating scales
and rubrics is highly dependent on the quality of the descriptors chosen for the
assessment. Their benefit is also dependent on learnersÊ direct involvement in the
assessment and understanding of the feedback provided.
(c) Provide samples of criteria for learners prior to collecting and evaluating
data on their work recording the development of specific skills, strategies,
attitudes and behaviours necessary for demonstrating learning; and
Scoring rubrics have become a common method for assessing learners. Scoring
rubrics are descriptive scoring schemes that are developed by teachers or other
evaluators to guide the analysis of the products or processes of learnersÊ efforts
(Brookhart, 1999). As scoring tools, rubrics are a way of describing evaluation
criteria based on the expected outcomes and performances of learners. Each rubric
consists of a set of scoring criteria and point values associated with these criteria.
In most rubrics the criteria are grouped into categories so that the teacher and the
learner can discriminate among the categories by level of performance.
Rubrics have been introduced into todayÊs classrooms in order to give learners a
better understanding of what is being assessed, what criteria the grades are based
upon as well as what clear and compelling product standards are being addressed.
The focus of rubrics and scoring guides is to monitor and adjust progress rather
than to only assess the end result.
As a guide for planning, rubrics and scoring guides give learners clear targets of
proficiency. With these assessments in hand, they know what quality looks like
before they start working. When learners use such assessments regularly to judge
their own work, they begin to accept more responsibility for the end product.
Rubrics and scoring guides offer several advantages for assessment:
(a) Learners become better judges of the quality of their own work;
(b) Learners have more informative feedback about their strengths and areas in
need of improvement;
(c) Learners become aware of the criteria to use in providing peer feedback;
The following rubric shown in Table 6.2 covers the research portion of a project.
A rubric comprised two components, namely the criteria and the levels of
performance. Each rubric has at least two criteria and at least two levels of
performance. The criteria, which are the characteristics of good performance on a
task in this example, are listed on the left-hand column in the rubric (number of
sources, historical accuracy, sources of information and bibliography).
The rubric also contains a mechanism for assigning a score to each project. In the
second last column a weight (Wt.) is assigned for each criterion. Learners can
receive 1, 2 or 3 points for number of sources criterion. However, the historical
accuracy criterion which is considered more important in the teacherÊs opinion, is
weighted three times (3) more heavily. This means, learners can receive 3, 6 or 9
points (that is 1 3, 2 3 or 3 3) for the level of accuracy in their projects.
In the example, „lots of historical inaccuracies‰, „can tell with difficulty where the
information came from‰ and „all relevant information is included‰ are descriptors.
The descriptors help the teacher to be more precise and able to consistently
distinguish between learnersÊ works. However, it is not easy to write good
descriptors for each level and each criterion.
Teachers can use rating scales to record observations and learners can use them as
self-assessment tools. Teaching learners to use descriptive words such as always,
usually, sometimes and never, helps them pinpoint specific strengths and needs.
Rating scales also provide learners with information for setting goals and
improving performance.
Rating scales list performance statements in one column and the range of
accomplishments in descriptive words, with or without numbers, in other
columns.
The descriptive word is more important than the corresponding number. The more
precise and descriptive the words for each scale point, the more reliable is the tool.
Effective rating scales use descriptors with clearly understood measures such as
frequency. Scales that rely on subjective descriptors for quality such as fair, good
or excellent, are less effective because the single adjective does not contain enough
information on what criteria are indicated at each of these points on the scale.
The range of numbers should always increase or always decrease. For example, if
the last number is the highest achievement in one section, the last number should
also be the highest achievement in all the other sections as well.
Figure 6.1 is an example of the rating scale used for interpersonal skills assessment.
6.4.3 Checklists
Checklists usually offer a „yes‰ or „no‰ format in relation to learners
demonstrating specific criteria. An assessment checklist takes each achievement
objective and turns it into one or more „learner can do‰ statements. Checklists may
be used to record observations of an individual, a group or a whole class.
Name:
Date:
Class
Achievement
Items Yes No Comments
Objective
Communicate Can understand the numbers
about numbers 1ă100 through listening
Can say the numbers 1ă100
Can count 1ă100
Can write the numbers 1ă100
Can understand numbers 1ă100
when written in words
Can write numbers 1ă100 in
words
SELF-CHECK 6.4
1. Define authentic assessment.
INTRODUCTION
Besides objective and essay tests, there are other types of assessments which can
be collectively categorised as performance-based assessment. According to the
definition provided by the Standards (AERA et al., cited in Reynold, Livingston &
Willson, 2006), performance assessments require test takers to complete a task in a
context or setting that closely resembles real-life situation. For example, to assess
oral communication skills, the assessment might require test takers to participate
in a group dialogue session. Likewise, to assess the teaching ability of teacher
trainees, the trainees might be required to conduct a lesson with a group of
learners. The emphasis of performance-based assessment is thus on doing, not
merely knowing, on process as well as product. In this context, an essay test that
is used to assess writing skills in language learning can be considered as a type of
performance assessment. Since essay questions have been discussed earlier, we
will focus on other types of such assessments in this topic, namely projects and
portfolios. Project assessments which require learners to perform practical tasks to
organise or create something are fast becoming a common practice in school-based
assessments. Portfolios, which are considered a specific form of performance
assessment that is useful in assessing learner learning over time involve the
systematic collection of a learnerÊs work produced over a specified period of time
according to a specific set of guidelines (Reynold, Livingston & Willson, 2006).
Technically there are differences between project and project-based learning (PBL).
While PBL also features projects, the focus is more on the process of learning and
learner-peer-content interaction than the end-product itself. PBL closely resembles
work done in the real world. The scenario or simulation is real whereas projects
are usually based on events that have already been resolved.
ACTIVITY 7.1
A project is an activity in which time constraints have been largely removed and
can be undertaken individually or by a group and usually involves a significant
element of work being done at home or out of school (Firth & Mackintosh, 1987).
Project work has its roots in the constructivist approach which evolved from the
work of psychologists and educators such as Lev Vygotsky, Jerome Bruner, Jean
Piaget and John Dewey. Constructivism views learning as the result of mental
construction whereby learners learn by constructing new ideas or concepts based
on their current and previous knowledge.
Most projects have certain common defining features (Katz & Chard, 1989) such
as:
(a) Learner-centred;
(h) A tangible product that can be shared with the intended audience;
(k) Multiple types and authentic assessments (portfolios, journals, rubrics and
others).
In project work, the whole work process is as important as the final result or
product. Work process refers to learners choosing a knowledge area, delimiting it
and formulating a problem or putting forward questions. It also involves learners
investigating and describing what is required to solve a given problem or answer
a specific question through further work, collection of materials and knowledge.
Project work is planned so that it can be carried out within the available time.
Preferably, the task should be drawn from knowledge areas in the current
curriculum. Project work is an integrated learning experience that encourages
learners to break away from the compartmentalisation of knowledge and instead
involves drawing upon different aspects of knowledge. For example, making an
object not only requires handicraft skills but also knowledge of materials, working
methods and uses of the object. Technological support will also enhance learnersÊ
learning. Thinking skills are integral to project work.
Similarly, writing the project report requires writing skills learned in the language
classroom and applying it when analysing and drawing conclusions for a science
project. Generally, there are two types of projects, namely research-based and
product-based projects.
There are many types of effective projects. The following are just some project
ideas:
(f) Compile oral histories of the local area by interviewing community elders;
The possibilities for projects are endless. The key ingredient for any project idea is
that it is learner-driven, challenging and meaningful. It is important to realise that
project-based instruction complements the structured curriculum. Project-based
instruction builds on and enhances what learners learn through systematic
instruction. Teachers do not let learners become the sole decision makers about
what project to work on, nor do teachers sit back and wait for the learners to figure
out how to go about the process which can be very challenging (Bryson, 1994). This
is where the teacherÊs ability to facilitate and act as a coach plays an important part
in the success of a project. The teacher would have brainstormed ideas with the
learner to generate project possibilities, discuss possibilities and options, help the
learner form a guiding question and be ready to help the learner throughout the
implementation process such as setting guidelines, due dates, resource selection
and so forth (Bryson, 1994; Rankin, 1993).
SELF-CHECK 7.1
1. What are the main differences between a project and project-based
learning?
You can see this with a class of young learners. When the teacher tells a story, little
kindergarten children raise their hands eager to share their experiences about
something related to the story. They want to be able to apply their natural
tendencies to the learning process. This is how life is much of the time! By giving
project work, we open up areas in schooling where learners can speak about what
they already know.
SELF-CHECK 7.2
(d) Rules
Guidelines for carrying out the project include a timeline and short-term
goals such as having interviews completed by a certain date and specifying
the completion date of the project.
(f) Assessment
How the learnerÊs performance will be evaluated. In project work, the
learning process is being evaluated as well as the final product.
Before designing the project, identify the learning goals and objectives. What
specific skills or concepts will learners learn? Herman, Aschbacher and Winters
(1992) have identified five questions to consider when determining learning goals:
(b) What social and affective skills do I want my learners to develop? (For
example, to develop teamwork skills);
(e) What concepts and principles do I want my learners to be able to apply? (For
example, to apply basic principles of biology and geography in their lives, to
understand cause-and-effect in relationships)
Steinberg (1998) provides a checklist for the design of effective projects (see
Table 7.2). The checklist can be used throughout the process to help both the
teacher and learner to plan and develop a project as well to assess whether the
project was successful in meeting instructional goals.
Applied Does the learner solve a problem that is grounded in real life and/or
learning work (for example, design a project or organise an event)?
Does the learner need to acquire and use skills expected in high-
performance work environments (for example, teamwork, problem-
solving, communication or technology)?
Does the project require the learner to develop organisational and
self-management skills?
Active Does the learner spend significant amounts of time doing field work,
exploration outside school?
Does the project require the learner to engage in real investigative
work, using a variety of methods, media and sources?
Is the learner expected to explain what he or she has learned through
a presentation or performance?
Adult Does the learner meet and observe adults with relevant experience
relationships and expertise?
Is the learner able to work closely with at least one adult?
Do adults and the learner collaborate on the design and assessment
of the project?
Assessment Does the learner reflect regularly on his or her learning, using clear
practices project criteria that he or she has helped to set?
Do adults from outside the community help the learner develop a
sense of the real world standards from this type of work?
Is the learnerÊs work regularly assessed through a variety of methods
including portfolios and exhibitions?
Source: Adaptation of Steinberg (1998) Real learning, real work: School-to-work as high
school reform
(a) Do the learners have easy access to the resources they need? This is especially
important if a learner is using specific technology or subject-matter expertise
from the community;
(b) Do the learners know how to use the resources? Learners who have minimal
experience with computers, for example, may need extra assistance in
utilising them;
(c) Do the learners have mentors or coaches to support them in their work? This
can be in-school or out-of-school mentors; and
(d) Are learners clear on the roles and responsibilities of each person in the
group?
SELF-CHECK 7.3
1. What are some of the factors you should consider when designing
project work for learners in your subject area?
(a) Aligning project goals with curriculum goals can be difficult. To make
matters worse, parents are not always supportive of projects when they
cannot see how it relates to the overall assessment of learning;
(b) Projects can often take longer than expected and teachers need a lot of time
to prepare good authentic projects;
(c) Learners are not clear as to what is required. There is need for adequate
structure, guidelines and guidance on how to carry out projects;
(d) Intensive staff development is required. This is because teachers are not
traditionally prepared to integrate content into real-world activities;
(e) The resources needed for project work may not be readily available and there
might be a lack of administrative support; and
(f) Some teachers may not be familiar with how they should assess the projects.
What are some benefits of group work in projects? Let us read the following:
(a) Peer Learning Can Improve the Overall Quality of Learner Learning
Group work enhances learner understanding. Learners learn from each other
and benefit from activities that require them to articulate and test their
knowledge. Group work provides an opportunity for learners to clarify and
refine their understanding of concepts through discussions and rehearsals
with peers. Many, but not all, learners recognise the value to their personal
development in group work and of being assessed as a member of the group.
Working with a group and for the benefit of the group also motivates some
learners. Group assessment helps some learners develop a sense of
responsibility: „I felt that because one is working in a group, it is not possible
to slack off or to put things off. I have to keep working otherwise I would be
letting other people down‰.
(b) Group Work Can Help Develop Specific Generic Skills Sought by Employers
As a direct response to the objective of preparing graduates with the capacity
to function successfully as team members in the workplace, there has been a
trend in recent years to incorporate generic skills alongside traditional
subject-specific knowledge in the expected learning outcomes in higher
education. Group work can facilitate the development of skills which
include:
(c) Group Work May Reduce the Workload Involved in Assessing, Grading and
Providing Feedback to Learners
Group work and group assessment in particular, is sometimes implemented
in the hope of streamlining assessment and grading tasks. In simple terms, if
learners submit group assignments the number of pieces of work to be
assessed can be vastly reduced. This prospect might be particularly attractive
for staff teaching large classes.
SELF-CHECK 7.4
1. What are some project problems in the implementation of project
work and how would you solve them?
Table 7.3 could give you some ideas on how to assess and give marks for a project
work.
Marks Criteria
90ă Exceptional and distinguished work of a professional standard.
100% Outstanding technical and expressive skills.
Work demonstrating exceptional creativity and imagination.
Work displaying great flair and originality.
80ă89% Excellent and highly developed work of a professional standard.
Extremely good technical and expressive skills.
Work demonstrating a high level of creativity and imagination.
Work displaying flair and originality.
70ă79% Very good work which approaches professional standard.
Very good technical and expressive skills.
Work demonstrating good creativity and imagination.
Work displaying originality.
60ă69% A good standard of work.
Good technical and expressive skills.
Work displaying creativity and imagination.
Work displaying some originality.
50ă59% A reasonable standard of work.
Adequate technical and expressive skills.
Work displaying competence in the criteria assessed but which may be
lacking some creativity or originality.
40ă49% Limited but adequate standard of work.
Limited technical and expressive skills.
Work displaying some weaknesses in the criteria assessed and lacking
creativity or originality.
30ă39% Limited work which fails to meet the required standard.
Weak technical and expressive skills.
Work displaying significant weaknesses in the criteria assessed.
20ă29% Poor work. Unsatisfactory technical or expressive skills.
Work displaying significant or fundamental weaknesses in the criteria
assessed.
10ă19% Very poor work or work where very little attempt has been made.
A lack of technical or expressive skills.
Work displaying fundamental weaknesses in the criteria assessed.
1ă9% Extremely poor work or work where no serious attempt has been made.
When assessing a project work, you need to be clear of what to assess. Is it the
product, the process or both? According, to Bonthron and Gordon (1999), from the
onset you should be clear:
(a) Whether you are going to assess the product of the group work or both
product and process.
(b) If you intend to assess the process, what proportion are you going to allocate
for process and what criteria to use and how are you going to assess the
process?
(c) What criteria are you planning to use to assess the project work and how will
the marks be distributed?
Some educators believe there is a need to assess the processes within groups as
well as the products or outcomes. What exactly does „process‰ mean? Both
teachers and learners must have a clear understanding of what the process means.
For example, if you want to assess „'the level of interaction‰ among learners in the
group, they should know what „high‰ or „low‰ interaction means. Should the
teacher involve himself in the workings of each group or rely on self or peer
assessment? Obviously, being involved in many groups would be physically
impossible for the teacher. As a result, some educators may say, „I don't care what
they do in their groups. All I'm interested in is the final product and how they
arrive at their results is their business‰. However, to provide a more balanced
assessment, there is growing interest in both the process and product of group
work and the issue that arises is „What proportion of assessment should focus on
product and what proportion should focus on process?‰
The criteria for the evaluation of group work can be determined by the teacher
alone or both by the teacher and learners through consultation between the two
parties. Group members can be consulted on what should be assessed in a project
through consultation with the teacher. Obviously, you have to be clear about the
intended learning outcomes of the project in your subject area. It is a useful starting
point for determining criteria for assessment of the project. Once these broader
learning outcomes are understood, you can establish the criteria for marking the
project. Generally, it is easier to establish criteria for measuring the „product‰ of
project work and much more difficult to measure the „processes‰ involved in
Another important point to note is that we need to be clear who gets the marks ă
individuals or the group as a whole? Most projects involve more than one learner
and the benefits of group work have been discussed earlier. A major problem of
evaluating projects involving group work is how to allocate marks fairly among
group members. The following questions are those mentioned by learners: „I
would like my teacher to tell me what amount of work and effort will get enable
me to obtain a certain mark‰, „Do all learners obtain the same mark even though
not all learners put in the same effort?‰ and „Are marks given based on individual
contributions of team members?‰
These are questions that trouble teachers especially when it is common to find
freeloaders or sleeping partners in group projects. The following are some
suggestions how group work may be assessed:
SELF-CHECK 7.5
Which of the five methods of assessing group work would you use in
assessing project work in your subject area? Give reasons for your choice.
Having a logbook can potentially provide plenty of information to form the basis
of assessment while keeping minutes helps members to focus on the process which
is a learning experience in itself. These techniques may be perceived as a fair way
to deal with freeloaders and outstanding contributors. However, reviewing logs
can be time consuming for teachers or instructors and learners may need a lot of
training and experience in order to keep the records. In addition, emphasis on
second hand evidence may not be reliable.
According to Edwards (2000), the following are some questions that a learner can
ask himself or herself while conducting self-assessment:
(c) How well did I meet my learning goals? What was most difficult about
meeting the goals?
(e) What was my groupÊs best team effort? Worst team effort?
(f) How do I think other people involved with the project felt about the progress
and end product of the project?
(g) What were the skills which I used during this project? How can I engage
these skills in the future?
SELF-CHECK 7.6
Learner portfolios may take many forms. It is not easy to describe them. A portfolio
is not the pile of learner work that accumulates over a semester or year. Rather a
portfolio is a purposeful collection of the works produced by learners which
reflects their efforts, progress and achievements in different areas of the
curriculum. According to Paulson, Paulson and Meyer (1991), „portfolios offer a
way of assessing learner learning that is different from traditional methods.
Portfolio assessment provides the teachers an opportunity to observe learners in a
broader context which involves taking risks, developing creative solutions and
learning to make judgements about their performances‰.
(a) Allows the teacher to view the learner as an individual, each with his own
unique characteristics, needs and strengths;
(f) Invites learners to reflect upon their growth and performance as learners.
However, Epstein (2006) also mentioned some of the problems with portfolio
assessments. Portfolio assessments may be less reliable because they tend to be
more qualitative rather than quantitative in nature. Society is still strongly oriented
towards grades and test scores. In addition, most universities and colleges still use
test scores and grades as the main admission criteria. Moreover, portfolio
assessment may be time consuming for teachers and data from portfolio
assessments can be difficult to analyse.
(a) Collection
This step simply requires learners to collect and store all of the work.
Learners have to get used to the idea of documenting and saving their work,
something which they may not have done before.
(iii) How to get learners to form the habit of documenting the evidence?
(b) Selection
This will depend on whether it is a process or product portfolio and the
criteria set by the teacher. Learners will go through the work collected and
select certain works for their portfolio. This might include examination
papers and quizzes, audio and video recordings, project reports, journals,
computer work, essays, poems, artwork and so forth. In short,
(c) Reflection
This is the most important step in the portfolio process. It is the reflection
involved that differentiates the portfolio as being a mere collection of
learnerÊs works. Reflection is often done in writing but it can also be done
orally. Learners are asked why they have chosen a particular product or
work (for example, an essay) and how it compares with other works, what
particular skills and knowledge were used to produce it and how it can be
further improved. In addition,
(i) Learners should reflect on how or why they chose certain works.
(d) Connection
As a result of „reflection‰, learners will begin to ask themselves, „Why are
we doing this?‰ They are encouraged to make connections between their
school work and the value of what they are learning. They are also
encouraged to make connections between the works included in their
portfolio with the world outside the classroom. They learn to exhibit what
they have done in school to the happenings and situations in the community.
Issues to consider include:
(vii) Giving learners the opportunity to have extensive input into the
learning process; and
(i) Requiring extra time to plan an assessment system and conduct the
assessment especially for large groups of learners;
(ii) Gathering all of the necessary data and work samples can make
portfolios bulky and difficult to manage;
The portfolio is more than just a collection of learnerÊs work. The teacher may
assess and assign grades to the process of assembling and reflecting upon the
portfolio of a learner's work. The learner might have also included reflections on
growth, strengths and weaknesses, on goals that were or are to be set, on why
certain samples tell certain stories about them or on why the contents reflect
sufficient progress to indicate completion of designated standards. Some of the
process skills may also be part of the teacher's, school's or district's standards. As
such, the portfolio provides some evidence of attainment of those standards. Any
or all of these elements can be evaluated and/or graded.
Portfolio assignments can also be assessed or graded with a rubric. Rubric is useful
in avoiding personal judgment which goes into assessing a complex product such
as a portfolio. Rubric can provide some clarity and consistency in assessing and
judging the quality of the content and the elements that make up that content.
Moreover, application of a rubric increases the likelihood of consistency among
teachers who are assessing the portfolios.
The following portfolio rubric (see Figure 7.3) may be used for self-assessment and
peer feedback.
SELF-CHECK 7.7
1. To what extent is portfolio assessment used in Malaysian
classrooms?
A project is an activity in which time constraints has been largely removed and
can be undertaken individually or by a group, and usually involves a
significant element of work being done at home or out of school.
The Six AÊs of a project are authenticity, academic rigour, applied learning,
active exploration, adult relationships and assessment practices.
Working in groups has become an accepted part of learning due to the widely
recognised benefits of collaborative group work for learner learning.
Various ways for allocating marks to a project work include shared group
marks, shared-out marks, individual mark, individual mark (examination) and
combination of group average and individual mark.
INTRODUCTION
In this topic we will address two important issues, namely the reliability and
validity of an assessment. How do we ensure that the techniques we use for
assessing the knowledge, skills and values of learners are reliable and valid? We
are making important decisions about the abilities and capabilities of our future
generation, so obviously we want to ensure that we are making the right decisions.
individual. A test attempts to measure the true score of a person. When measuring
human abilities, it is practically impossible to develop an error-free test. However,
just because there is error, it does not mean that the test is not good. The more
important factor is the size of the error.
Formally, an observed test score, X, is conceived as the sum of a true score, T, and
an error term, E. The true score is defined as the average of test scores if a test is
repeatedly administered to a learner (and the learner can be made to forget the
content of the test in-between repeated administrations). Given that the true score
is defined as the average of the observed scores, in each administration of a test,
the observed score departs from the true score and the difference is called
measurement error. This departure is not caused by blatant mistakes made by test
writers but it is caused by some chance elements in learnersÊ performance during
the test.
Measurement error mostly comes from the fact that we have only sampled a small
portion of a learnerÊs capabilities. Ambiguous questions and incorrect markings
can contribute to measurement error but it is only a small part of measurement
error. Imagine if there are 10,000 items and a learner can obtain 60 per cent of all
10,000 items administered (which is not practically feasible). Then 60 per cent is
the true score. Now when you sample only 40 items in a test, the expected score
for the learner is 24 items. But the learner may get 20, 26, 30 and so forth depending
on which items are in the test. In this example, this is the main source of
measurement error. That is to say, measurement error is due to the sampling of
items rather than poorly written items.
Error may come from various sources such as within the test takers (the learners),
within the test (questions are not clear), in the administration of the test or even
during scoring (or marking). For example, fatigue, illness, copying or even the
unintentional noticing of another learnerÊs answer all contribute to error from
within the test taker.
Generally, the smaller the error, the greater the likelihood you are closer to
measuring the true score of the learners. If you are confident that your geometry
test (observed score) has a small error, then you can confidently infer that Swee
LeongÊs score of 66 per cent is close to his true score or his actual ability in solving
geometry problems, in other words, what he actually knows. To reduce the error
in a test, you must ensure that your test is reliable and valid. The higher the
reliability and validity of your test, the greater the likelihood you will be
measuring the true score of your learners.
We will first examine the reliability of a test. What is reliability? Reliability is the
consistency of the measurement. Would your learners get the same scores if they
took your test on two different occasions? Would they get approximately the same
score if they took two different forms of your test? These questions have to do with
the consistency of your classroom tests in measuring learnersÊ abilities, skills and
attitudes or values. The generic name for consistency is reliability. Reliability is an
essential characteristic of a good test because if a test does not measure consistently
(reliably), then you could not count on the scores resulting from the administration
of the test (Jacobs, 1991).
If there is relatively little error, the ratio of the true score variance to the observed
score variance approaches a reliability coefficient of 1.00 which is a perfect
reliability. If there is relatively large amount of errors, the ratio of the true score
variance to the observed score variance approaches 0.00 which is total
unreliability.
High reliability means that the questions of a test tended to „pull together‰.
Learners who answered a given question correctly were more likely to answer
other questions correctly. If an equivalent or parallel test were developed by using
similar items, the relative scores of learners would show little change. Low
reliability means that the questions tended to be unrelated to each other in terms
of who answered them correctly. Low reliability means that the questions tended
to be unrelated to each other in terms of who answered them correctly. The
resulting test scores reflect that something is wrong with the items or the testing
situation rather than learnersÊ knowledge of the subject matter. The following
guidelines may be used to interpret reliability coefficients for classroom tests as
shown in Table 8.1.
Copyright © Open University Malaysia (OUM)
160 TOPIC 8 TEST RELIABILITY AND VALIDITY
Reliability Interpretation
0.90 and above Excellent reliability (comparable to the best standardised tests).
0.80ă0.90 Very good for a classroom test.
0.70ă0.80 Good for a classroom test but there are probably a few items which
could be improved.
0.60ă0.70 Somewhat low. There are probably some items which could be
removed or improved.
0.50ă0.60 The test needs to be revised.
0.50 and below Questionable reliability and the test should be replaced or in need of a
major revision.
If you know the reliability coefficient of a test, can you estimate the true score of a
learner on a test? In testing, we use the Standard Error of Measurement to estimate
the true score.
Using the normal curve, you can estimate a learnerÊs true score with some degree
of certainty based on the observed score and Standard Error of Measurement.
For Example:
You gave a history test to group of 40 learners. Khairul obtained a score of 75 in
the test, which is his observed score. The standard deviation of your test is 2.0.
Earlier you had established that your history test had a reliability coefficient of 0.7.
You are interested to find out KhairulÊs true score.
Therefore, based on the normal distribution curve (refer to Figure 8.1), KhairulÊs
true score should be:
(a) Between 75 ă 1.1 and 75 + 1.1 or between 73.9 and 76.1 for 68% of the time.
(b) Between 75 ă 2.2 and 75 + 2.2 or between 72.8 and 77.2 for 95% of the time.
(c) Between 75 ă 3.3 and 75 + 3.3 or between 71.7 and 78.3 for 99% of the time.
SELF-CHECK 8.1
1. Shalin obtains a score of 70 in a biology test. The reliability of the
test is 0.65 and the standard deviation of the test is 1.5. Compute the
true score of Shalin for the biology test.
(int: Use 1 standard error of measurement)
2. Define the reliability of a test.
3. What does the reliability coefficient indicate?
4. Explain the concept of true score.
(a) Test-retest
Using the Test-retest technique, the same test is administered again to the
same group of learners. The scores obtained in the first administration of the
test are correlated to the scores obtained in the second administration of
the test. If the correlations between the two scores are high, then the test can
be considered to have high reliability. However, a test-retest situation is
somewhat difficult to conduct as it is unlikely that learners will be prepared
to take the same test twice.
There is also the effect of practice and memory that may influence the
correlation. The shorter the time gap, the higher the correlation; the longer
the time gap, the lower the correlation. This is because the two observations
are related over time. Since this correlation is the test-retest estimate of
reliability, you can obtain considerably different estimates depending on the
interval.
The following are two common internal consistency measures that can be
used.
(i) Split-half
To solve the problem of having to administer the same test twice,
the split-half technique is used. In the split-half technique, a test is
administered once to groups of learners. The test is divided into two
equal halves after the learners have completed the test. This technique
is most appropriate for tests which include multiple-choice items, true-
false items and perhaps short-answer essays. The items are selected
based on odd-even method whereby one half of the test consists of odd
numbered items while the other half consists of even numbered items.
Then, the scores obtained for the two halves are correlated to determine
the reliability of the whole test using the Spearman-Brown correlation
coefficient.
2rxy
rsb
1 rxy
k
k i 1
pi 1 pi
Cronbachs alpha 1
k 1 2x
For Example:
Suppose that in a multiple-choice test consisting of five items or
questions, the following difficulty index for each item was observed:
p 1 = 0.4, p 2 ă 0.5, p 3 = 0.6, p4 = 0.75 and p 5 = 0.85. Sample variance
( 2 x ) = 1.84. CronbachÊs alpha would be calculated as follows:
5 1.045
1 0.54
5 1 1.840
A Word of Caution!
When you obtain a low alpha, you should be careful not to immediately conclude
that the test is a bad test. You should check to determine if the test measures several
attributes or dimensions rather than one attribute or dimension. If it does, there is
the likelihood for the Cronbach alpha to be deflated. For example, an Aptitude Test
may measure three attributes or dimensions such as quantitative ability, language
ability and analytical ability. Hence, it is not surprising that the Cronbach alpha
for the whole test may be low as the questions may not correlate with each other.
Why? This is because the items are measuring three different types of human
abilities. The solution is to compute three different Cronbach alphas; one for
quantitative ability, one for language ability and one for analytical ability which
tells you more about the internal consistency of the items in the test.
SELF-CHECK 8.2
1. What is the main advantage of the split-half technique over the test-
retest technique in determining the reliability of a test?
2. Explain the parallel or equivalent forms technique in determining
the reliability of a test.
3. Explain the concept of internal consistency reliability of a test.
In order to find the answers to these questions, let us read further on inter-rater
reliability and intra-rater reliability.
(iii) Ensures that the time allotted is appropriate for the work required;
SELF-CHECK 8.3
Messick (1989) was most concerned about the inferences a teacher draws from the
test score, the interpretation the teacher makes about his learners and the
consequences from such inferences and interpretation. You can imagine the power
an educator holds in his hand when designing a test. Your test could determine
the future of thousands of learners. Inferences based on test of low validity could
give a completely different picture of the actual abilities and competencies of
learners. Three types of validity have been identified: construct validity, content
validity and criterion-related validity which is made up of predictive and
concurrent validity (refer to Figure 8.4).
Thus, to ensure high construct validity, you must be clear about the
definition of the construct you intend to measure. For example, a construct
such as reading comprehension would include vocabulary development,
reading for literal meaning and reading for inferential meaning. Some
experts in educational measurement have argued that construct validity is is
the most critical type of validity. You could establish the construct validity
of an instrument by correlating it with another test that measures the same
construct. For example, you could compare the scores obtained on your
reading comprehension test with the scores obtained on another well-known
reading comprehension test administered to the same sample of learners. If
the scores for the two tests are highly correlated, then you may conclude that
your reading comprehension test has high construct validity.
For example, the Science unit on „Energy and Forces‰ may include facts,
concepts, principles and skills on light, sound, heat, magnetism and
electricity. However, it is difficult, if not impossible, to administer a two to
three-hour paper to test all aspects of the syllabus on „Energy and Forces‰
(refer to Figure 8.5). Therefore, only selected facts, concepts, principles and
skills from the syllabus (or domain) are sampled. The content selected will
be determined by content experts who will judge the relevance of the content
in the test to the content in the syllabus or particular domain.
Figure 8.5: Sample of content tested for the unit on „Energy and Forces‰
Content validity will be low if the questions in the test include testing content
not included in the domain or syllabus. To ensure content validity and
coverage, most teachers use the Table of Specifications (as discussed in
Topic 3). Table 8.2 is an example of a Table of Specifications which specifies
the knowledge and skills to be measured and the topics covered for the unit
on „Energy and Forces‰. You cannot measure all the content of a topic,
therefore, you will have to focus on the key areas and give due weightage to
those areas that are important. For example, the teacher has decided that 64
per cent of questions will emphasise on the understanding of concepts while
36 per cent will focus on the application of concepts for the five topics. A
Table of Specifications provides teachers with evidence that a test has high
content validity and that it covers what should be covered.
Table 8.2: Table of Specifications for the Unit on „Energy and Forces‰
Understanding of Application of
Topics Total
Concept Concepts
Light 7 4 11 (22%)
Sound 7 4 11 (22%)
Heat 7 4 11 (22%)
Magnetism 3 3 6 (11%)
Electricity 8 3 11 (22%)
TOTAL 32 (64%) 18 (36%) 50
Content validity is different from face validity which refers to what the test
superficially appears to measure. Face validity assesses whether the test
„looks valid‰ to the examinees, the administrative personnel who decide on
its use and other technically untrained observers. The face validity is a weak
measure of validity but that does not mean that it is incorrect, only that
caution is necessary. Its importance cannot be underestimated.
(i) Predictive validity relates to whether the test predicts accurately some
future performance or ability. Is the STPM examination a good
predictor of performance in university? One difficulty in calculating the
predictive validity of STPM is because only those who pass the
examination proceed to university (generally speaking) and we do not
know how well learners who did not pass the examination might have
done (Wood, 1991). Moreover, only a small proportion of the
population sit for the STPM examination. As such, the correlation
between STPM grades and performance at the degree level would be
quite high.
learning outcomes. For example, in your teaching, learners were not given
the opportunity of think critically and solve problems. However, your test
consists of items requiring learners to think critically and solve problems. In
such situation, the reliability and validity of the test will be affected.
Adequate time must be allowed for the majority of learners to complete the
test. This would reduce wild guessing and instead encourage learners to
think carefully about the answers. Instructions need to be clear to reduce the
effects of confusion on reliability and validity. The physical conditions under
which the test is taken must be favourable for learners. There must be
adequate space, lighting and appropriate temperature. Learners must be able
to work independently and the possibility of distractions in the form of
movement and noise must be guarded against.
Figure 8.6: Graphical representation of the relationship between reliability and validity
The centre or the bullseye is the concept that we are trying to measure. Say for
example, in trying to measure the concept of „inductive reasoning‰, you are likely
to hit the centre (or the bullseye) if your inductive reasoning test is both reliable
and valid, which is what all test developers aim to achieve (see Figure 8.7(d)).
On the other hand, your inductive reasoning test could be reliable but not valid.
How is that possible? Your test may not measure inductive reasoning but the score
you obtain each time you administer the test is approximately the same (see Figure
8.7(b)). In other words, the test is consistently and systematically measuring the
wrong construct (that is not inductive reasoning). Imagine the consequences of
making judgement about the inductive reasoning of learners using such a test!
The worst case scenario is when the test is neither reliable nor valid (see
Figure 8.7(a)). In this scenario the scores obtained by learners tend to concentrate
at the top half of the target and they are consistently missing the centre. Your
measure in this case is neither reliable nor valid and the test should be rejected or
improved.
The true score is a hypothetical concept as to the actual ability, competency and
capacity of an individual.
The higher the reliability and validity of your test, the greater the likelihood
you will be measuring the true score of your learners.
Using the test-retest technique, the same test is administered again to the same
group of learners.
For the parallel or equivalent forms technique, two equivalent tests (or forms)
are administered to the same group of learners.
When two or more persons mark essay questions, the extent to which there is
agreement in the marks allotted is called inter-rater reliability.
Some people may think of reliability and validity as two separate concepts. In
reality, reliability and validity are related.
INTRODUCTION
When you develop a test it is important to identify the strengths and weaknesses
of each item. In other words, to determine how well items in a test perform, some
statistical procedures need to be used.
In this topic, we will discuss item analysis which involves the use of three
procedures, namely item difficulty, item discrimination and distractor analysis to
help the test developer decide whether the items in a test can be accepted, modified
or rejected. These procedures are quite straightforward and easy to use but
educators need to understand the logic underlying the learner analyses in order to
use them properly and effectively.
What is item analysis? Item analysis is a process which examines the responses to
individual test items or questions in order to assess the quality of those items and
the test as a whole. Item analysis is especially valuable in improving items or
questions that will be used again in later tests. Moreover, it can also be used to
eliminate ambiguous or misleading items in a single test administration.
Specifically in Classical Test Theory (CTT) the statistics produced from analysing
the test results based on test scores include measures of difficulty index and
discrimination index. Analysing the effectiveness of distractors also becomes part
of the process. We will discuss each of these components of item analysis in detail
later.
The quality of a test is determined by the quality of each item or question in the
test. The teacher who constructs a test can only roughly estimate the quality of a
test. This estimate is based on the fact that the teacher has followed all the rules
and conditions of test construction. However, it is possible that this estimation may
not be accurate and certain important aspects have been ignored. Hence, it is
suggested that to obtain a more comprehensive understanding of the test, item
analysis should be conducted on the responses of learners. Item analysis is
conducted to obtain information about individual items or questions in a test and
how the test can be improved. It also facilitates the development of an item or
question bank which can be used in the construction of a test.
(a) Step 1
Obviously, upon receiving the answer sheets, the first step would be to mark
each of the answer sheets.
(b) Step 2
Arrange the 45 answer sheets from the highest score obtained to the lowest
score obtained. The paper with the highest score is on top and the paper with
the lowest score is at the bottom.
(c) Step 3
Multiply 45 (the number of answer sheets) with 0.27 (or 27 per cent) which is
12.15 and round up to 12. The use of the value 0.27 or 27 per cent is not
inflexible. It is possible to use any percentage between 27 to 35 per cent as the
value. However the 27 per cent rule can be ignored if the class size is too
small. Instead of taking the 27 per cent sample, divide the number of answer
sheets by 2.
(d) Step 4
Arrange the pile of 45 answer sheets according to scores obtained (highest
score to the lowest score). Take out 12 answer sheets from the top of the pile
and 12 answer sheets from the bottom of the pile. Call these two piles as
„high mark‰ learners and „low mark‰ learners. Set aside the middle group
of papers (21 papers). Although these could be included in the analysis, using
only the high and low groups simplifies the procedure.
(e) Step 5
Refer to Item #1 or Question #1:
(i) Count the number of learners from the „high mark‰ group who
selected each of the options (A, B, C or D); and
(ii) Count the number of learners from the „low mark‰ group who selected
the options A, B, C or D (see Figure 9.1).
From the analysis, 11 learners from the „high mark‰ group and two learners from
the „low mark‰ group selected „B‰ which is the correct answer. This means that
13 out of 24 learners selected the correct answer. Also, note that all the distractors
(A, C and D) were selected by at least one learner. However, the information
provided in Figure 9.1 is insufficient and further analysis has to be conducted.
What does a difficulty index (p) of 0.54 mean? The difficulty index is a coefficient
that shows the percentage of learners who got the correct answer compared to the
total number of learners in the two groups who answered. In other words, 54 per
cent of learners selected the correct answer. Although our computation is based on
the high and low scoring groups only, it provides a close approximation to the
estimate that would be obtained with the total group. Thus, it is proper to say that
the index of difficulty for this item is 54 per cent (for this particular group). Note
Copyright © Open University Malaysia (OUM)
TOPIC 9 APPRAISING CLASSROOM TESTS AND ITEM ANALYSIS 181
that, since difficulty refers to the percentage of getting the item right, the smaller
the percentage figure the more difficult is the item. Lien (1980) provides these
guidelines on the interpretation of the difficulty index as follows (see Figure 9.2):
If a teacher believes that the achievement 0.54 on the item is too low, he can change
the way he teaches to better meet the objective represented by the item. Another
interpretation might be that the item was too difficult, confusing or invalid, in
which case the teacher can replace or modify the item, perhaps using information
from the item's discrimination index or distractor analysis.
Under CTT, the item difficulty measure is simply the proportion of the correct
answer from learners for an item. For an item with a maximum score of 2, there is
a slight modification to the computation of proportion of percentage correct.
This item has a possible partial credit scoring of 0, 1 and 2. If the total number of
learners attempting this item is 100 and 23 learners score 0, 60 learners score 1 and
17 learners score 2, then a simple calculation will show that 23 per cent of the
learners score 0, 60 per cent of the learners score 1 and 17 per cent of the learners
score 2 for this particular item. The average score for this item should be (0 0.23)
+ (1 0.6) + (2 0.17) = 0.94.
Thus the observed average score of this item is 0.94 out of a maximum of 2. So the
average proportion correct is 0.94/2 = 0.47 or 47 per cent.
SELF-CHECK 9.1
A teacher gave a 20-item science test to a group of 35 learners. The correct
answer for Question #25 is „C‰ and the results are as follows:
Options A B C D Blank
High mark group (n = 12) 0 2 8 2 0
Low mark group (n = 12) 2 4 3 2 1
Note in our example earlier, 11 learners in the „high mark‰ group and two learners
in the „low mark‰ group selected the correct answer. This indicates positive
discrimination since the item differentiates between learners in the same way that
the total test score does. That is, learners with high scores on the test (high mark
group) got the item right more frequently than learners with low scores on the test
(low mark group). Although analysis by inspection may be all that is necessary for
most purposes, an index of discrimination can be easily computed using the
following formula:
R h RL
Discrimination index
1 T
2
Where Rh = Number of learners in „high mark‰ group (RH) with the correct
answer
RL = Number of learners in „low mark‰ group (RL) with the correct
answer
T = Total number of learners
Example:
A test was given to a group of 43 learners and 10 out of the 13 „high mark‰ group
got the correct answer compared to 5 out of 13 „low mark‰ group who got the
correct answer. The discrimination index is computed as follows:
R h R L 10 5 10 5
D 0.38
1 T 1 26 13
2 2
The formula for the discrimination index is such that if more learners in the „high
mark‰ group chose the correct answer than did learners in the low scoring group,
the number will be positive. At the very least, one would hope for a positive value
as that would indicate that it is knowledge of the question that resulted in the
correct answer.
(a) The greater the positive value (the closer it is to 1.0), the stronger the
relationship is between overall test performance and performance on that
item.
(b) If the discrimination index is negative, that means for some reason learners
who scored low on the test were more likely to get the answer correct. This
is a strange situation which suggests poor validity for an item.
SELF-CHECK 9.2
Item Score No. of Learners Earning Each Score Total Scores Earned
4 5 20
3 6 18
2 5 10
1 3 3
0 1 0
Total 51
Average score 51/20 = 2.55
The difficulty index (p) of the item can be computed using the following formula
as suggested by Nitko (2004):
Average score
p
Possible range of score
Using the information from Table 9.1, the difficulty index of the short-answer essay
question can be easily computed. The average score obtained by the group of
learners is 2.55, while the possible range of score for the item is (4 ă 0) = 4. Thus,
2.55
p
4
0.64
The difficulty index (p) of 0.64 means that on average learners received 64 per cent
out of the possible maximum score for the item. The difficulty index can be
interpreted in the same way as that of the multiple-choice question discussed in
subtopic 9.3. The item is of a moderate level of difficulty (refer to Figure 9.2).
Note that in computing the difficulty index in the above example, the scores from
the whole group are used to obtain the average score. However, for a large group
of learners, it is possible to estimate the difficulty index for an item based on only
a sample of learners comprising the „high mark‰ and „low mark‰ groups as in the
case of computing the difficulty index of a multiple-choice question.
Using the information from Table 9.1 and presenting it in the format as shown in
Table 9.2, we can compute the discrimination index of the short-answer essay
question.
Average
Score 0 1 2 3 4 Total
score
High Mark Group (n = 10) 0 0 1 4 5 34 3.4
Low Mark Group (n = 10) 1 3 4 2 0 17 1.7
„n‰ refers to the number of learners
The average score obtained by the upper group of learners is 3.4 while that of the
lower group is 1.7. Using the formula as suggested by Nitko (2004), we can
compute the discrimination index of the short-answer essay question as follows:
3.4 1.7
D
4
0.43
The discrimination index (D) of 0.43 indicates that the short-answer question does
discriminate between the upper and lower groups of learners and at a high level
(refer to Figure 9.3.). As in the computation of the discrimination index of
the multiple-choice question for a large group of learners, a sample of learners
comprising the top 27 per cent and the bottom 27 per cent may be used to provide
a good estimate.
SELF-CHECK 9.3
The following information is the performance of the high mark and the
low mark groups in a short-answer essay question.
Score 0 1 2 3 4
High mark group (n = 10) 2 2 3 1 2
Low mark group (n = 10) 3 2 2 3 0
Figure 9.4: Theoretical relationship between difficulty index and discrimination index
Source: Stanley & Hopkins (1972)
(b) Similarly, when the difficulty index is about 0.1, the discrimination index
drops to about 0.2. What does this mean? The more difficult the question, the
harder it is for that question or item to discriminate between those learners
who know and do not know the answer to the question.
SELF-CHECK 9.4
Example:
Which European power invaded Melaka in 1511?
Generally, a good distractor is able to attract more „low mark‰ learners to select
that particular response or distract „low mark‰ learners towards selecting that
particular response. What determines the effectiveness of distractors? In Figure
9.5, a total of 24 learners selected the options A, B, C and D for a particular
question. Option B is a less effective distractor because many „high mark‰ learners
(n = 5) selected option B. Option D is relatively a good distractor because two
learners from the „high mark‰ group and five learners from the „low mark‰ group
selected this option. The analysis of response options shows that those who missed
the item were equally likely to choose option B and option D. No learners chose
option C. Therefore, option C does not act as a distractor. This is because learners
are not choosing between four answer options on this item, they are really
choosing between only three options as they are not even considering option C.
This makes guessing correctly more likely, which hurts the validity of the item.
The discrimination index can be improved by modifying and improving options B
and C.
SELF-CHECK 9.5
Which British resident was killed by Maharajalela in Pasir Salak?
The answer is B.
(a) Step 1
Arrange the 30 answer sheets from the highest score obtained to the lowest
score obtained.
(b) Step 2
Select the answer sheet that obtained a middle score. Group all answer sheets
above this score as „high marks‰ (mark a „H‰ on these answer sheets). Group
all answer sheets below this score as „low marks‰ group (mark an „L‰ on
these answer sheets).
(c) Step 3
Divide the class into two groups (high and low) and distribute the „high‰
answer sheets to the high group and the „low‰ answer sheets to the low
group. Assign one learner in each group to be the counter.
(d) Step 4
The teacher then asks the class.
Teacher: The answer for Question #1 is „C‰ and those who got it correct,
raise your hand.
Counter from „H‰ group: 14 for group H.
Counter from „L‰ group: 8 from group L.
(e) Step 5
The teacher records the responses on the whiteboard as follows:
(f) Step 6
Compute the difficulty index for Question #1 as follows:
R H R L 14 8
Difficulty index 0.73
30 30
(g) Step 7
Compute the discrimination index for Question #1 as follows:
R H R L 14 8 6
Difficulty index 0.40
1 30 15 15
2
Note that earlier, we took 27 per cent of answer sheets in the „high mark‰
group and 27 per cent of answer sheets in the „low mark‰ group from the
total answer sheets. However, in this approach we divided the total answer
sheets into two groups. There is no middle group. The important thing is to
use a large enough fraction of the group to provide useful information.
Selecting the top and bottom 27 per cent of the group is recommended for
more refined analysis. The method shown in the example may be less
accurate but it is a „quick and dirty‰ method.
SELF-CHECK 9.6
Compare the difficulty index and discrimination index obtained using
this rough method with the theoretical model by Stanley and Hopkins in
Figure 9.4. Are the indexes very far out?
ACTIVITY 9.1
(a) From the discussions in the earlier subtopics, it is obvious that the results of
item analysis could provide answers to the following questions:
(iii) Were the items free from irrelevant clues and other defects?
Answers to the above questions can be used to select or revise test items for
future use. This would help to improve the quality of test items and the test
paper for future use. It also saves time for teachers when preparing test items
for future use because good items can be stored in an item bank.
(b) Item analysis data can provide a basis for efficient class discussion of the test
results. Knowing how effectively each test item functions in measuring the
achievement of the intended learning outcome and how learners perform in
each item, teachers can have a more fruitful discussion with the learners as
feedback based on item analysis is more objective and informative. For
example, teachers can highlight the misinformation or misunderstanding
reflected in the choice of particular distractors on multiple-choice questions
or frequently repeated errors on essay-type questions, thereby enhancing the
instructional value of assessment. If, during the discussion, the item analysis
reveals that there are technical defects in the items or the marking scheme,
learnersÊ marks can also be rectified to ensure a fairer test.
(c) Item analysis data can be used for remedial work. The analysis will reveal
the specific areas that the learners are weak in. Teachers can use the
information to focus remedial work directly on the particular areas of
weakness. For example, based on the distractor analysis, it is found that a
specific distractor has a low discrimination with a high number of learners
from both the high mark and low mark groups choosing the option. This
could suggest that there is some misunderstanding on a particular concept.
Remedial lessons can thus be planned to arrest the problem.
(d) Item analysis data can reveal weaknesses in teaching and provide useful
information to improve teaching. For example, despite the fact that an item
is properly constructed, it has a low difficulty index, suggesting that most
learners fail to answer the item satisfactorily. This might indicate that the
learners have not mastered a particular syllabus content that is being
assessed. This could be due to the weakness in instruction and thus
necessitates the implementation of more effective teaching strategies by the
teachers. Furthermore, if the item is repeatedly difficult for the learners, there
might be a need to revise the curriculum.
(e) Item analysis procedures provide a basis for teachers to improve their skills
in test construction. As teachers analyse learnersÊ responses to items, they
become aware of the defects of the items and what causes them. When
revising the items, they gain experience in rewording the statements so that
they are clearer, rewriting the distractors so that they are more plausible and
modifying the items so that they are at a more appropriate level of difficulty.
As a result, teachers improve their test construction skills.
(a) Item discriminating power does not indicate item validity. A high
discrimination index merely indicates that learners from the high mark
group performed relatively better than the learners from the low mark
group. The division of the high mark and low mark groups is based on the
total test score obtained by each learner, which is an internal criterion. By
using the internal criterion of total test score, item analysis offers evidence
concerning the internal consistency of the test rather than its validity. The
validity of a test needs to be judged using an external criterion, that is, to
what extent the test assesses the learning outcomes intended.
(b) The discrimination index is not always an indicator of item quality. For
example, a low index of discriminating power does not necessarily indicate
a defective item. If an item does not discriminate but it has been found to be
free from ambiguity and other technical defects, the item should be retained
especially in a criterion-referenced test. In such a test, a non-discriminating
item may suggest that all learners have achieved the criterion set by the
teacher. As such, the item does not discriminate between good and weak
learners. Another possible reason why low discrimination occurs for an item
is that the item may be either very easy or very difficult. Sometimes, this item
is necessary or desirable to be retained in order to measure a representative
sample of learning outcomes and course content. Moreover, an achievement
test is usually designed to measure several different types of learning
outcomes (knowledge, comprehension, application and so on). In such a
case, there will be learning outcomes that are assessed by fewer test items
and these items will have low discrimination because they have less
representation in the total test score. Removing these items from the test is
not advisable as it will affect the validity of the test.
(c) This type of traditional item analysis data is tentative. They are not fixed but
influenced by the type and number of learners being tested and the
instructional procedures employed. The data would thus change with every
administration of the same test items. Therefore, if repeated use of items is
possible, item analysis should be carried out for each administration of each
item. The tentative nature of item analysis should therefore be taken
seriously and the results are interpreted cautiously.
An item bank consists of questions that have been analysed and stored because
they are good items. Each stored item will have information on its difficulty index
and discrimination index. Each item is stored according to what it measures
especially in relation to the topics of the curriculum. These items will be stored in
the form of a Table of Specifications indicating the content being measured as well
as the cognitive levels measured. For example, from the item bank, you will be able
to draw items measuring the application of concepts for the topic on „Electricity‰.
You will also be able to draw items from the bank with different difficulty levels.
Perhaps, you want to arrange easier questions at the beginning of the test so as to
build confidence in learners and gradually introduce questions of increasing
difficulty.
With computerised databases, item banks are easy to access. Teachers will have
hundreds of items at their disposal from which they can draw upon when
developing classroom tests. This would certainly help teachers with the tedious
and time consuming task of having to construct items or questions from scratch.
Unfortunately, not many educational institutions are equipped with such an item
bank. The more common practice is for teachers to select items or questions from
commercially prepared workbooks, past examination papers and sample items
from textbooks. These sources do not have information about the difficulty index
and discrimination index of items or information about the cognitive levels of
questions or what they aim to measure. Teachers will have to figure out for
themselves the characteristics of the items based on their experience in teaching
the content.
However, there are some issues with regard to the use of item bank. One of the
major concerns of the item bank is how to place different test items collected over
time on a common scale. The scale should indicate difficulty of the items, one scale
per subject matter. Retrieval of items from the bank is made easy when all items
are placed on the same scale.
The person in charge must also take every effort to add only quality items to the
item pool. To develop and maintain a good item bank requires a great deal of
preparation, planning, expertise and organisation. Although Item Response
Theory (IRT) approach is not a cure all pill for item banking problems, many of
these issues can be solved.
The discrimination index is a basic measure which shows the extent to which
a question discriminates or differentiates between learners in the „high mark‰
group and „low mark‰ group.
Theoretically, the more difficult a question (or item) or easier the question (or
item), the lower will be the discrimination index.
Generally, a good distractor is able to attract more „low mark‰ learners to select
that particular response or distract „low mark‰ learners towards selecting that
particular response.
An item bank consists of questions that have been analysed and stored because
they are good items.
INTRODUCTION
All the data you have collected on the performance of learners will have to be
analysed. In this topic we will focus on the analysis and interpretation of the data
you have collected about the knowledge, skills and attitudes of your learners.
You analyse and interpret the information you have collected about your learners
quantitatively and qualitatively. For quantitative analysis of data, various
statistical tools are used, which we will be focussing on in this topic. For example,
statistics are used to show the distribution of scores on a Geography test and the
average score obtained by learners.
Even the use of percentages may not be meaningful. For example, getting 64 per
cent in the test may be considered „good‰ if the test was a difficult test. On the
other hand, if the test was an easy one, then 64 per cent may be considered to be
only „average‰. In other words, to get a more accurate picture of the scores
obtained by learners in the test, the teacher should:
(a) Find out which learner obtained the highest marks in the class and the
number of questions correctly answered;
(b) Find out which learner obtained the lowest marks in the class and the
number of questions correctly answered; and
(c) Find out the number of questions correctly answered by all learners in the
class.
This illustrates that the marks obtained by learners in a test should be carefully
examined. It is not sufficient to just report the marks obtained. More information
should be given about the marks obtained and to do this you have to rely on
statistics. Some teachers may be afraid of statistics while others may regard it as
too time consuming. In fact, many of us often use statistics without being aware of
it. For example, when we talk about average rainfall, per capita income, interest
rates and percentage increases in our daily lives, we are talking the language of
statistics. What is statistics?
The „frequency‰ column indicates how many learners scored for each mark shown
and the percentage is shown in the „percentage‰ column. You can describe these
scores using two types of measures, namely central tendency and dispersion.
(i) Mean
The mean is the most commonly used measure of central tendency.
When we talk about an „average‰, we usually refer to the mean. The
mean is simply the sum of all the values (marks) divided by the total
number of items (learners) in the set. The result is referred to as the
arithmetic mean. Using the data from Figure 10.1 and applying the
following formula, you can calculate the mean.
Mean
x 35 41 42 75 2148 53.22
N 35 40
(ii) Median
The median is determined by sorting the score obtained from lowest to
highest values and taking the score that is in the middle of the sequence.
For the example in Figure 10.1, the median is 52. There are 17 learners
with scores less than 52 and 17 learners whose scores are greater than
52. If there is an even number of learners, there will not be a single point
at the middle. In this case, you calculate the median by taking the mean
of the two middle points, that is, divide the sum of the two scores by 2.
(iii) Mode
The mode is the most frequently occurring score in the data set. Which
object appears the most often in your data set? In Figure 10.1, the mode
is 57 because 7 learners obtained that score. However, you can also have
more than one mode. If you have two modes it is bimodal.
Figure 10.2 shows a graph with the distribution of Bahasa Malaysia scores.
SELF-CHECK 10.1
(b) Dispersion
Although a mean tells us about the groupÊs average performance, it does not
tell us how close to the average or mean learners scored. For example, did
every learner score 80 per cent in the test or were the scores spread out from
0 to 100 per cent? Dispersion is the distribution of the scores. Among the
measures used to describe spread are range and standard deviation.
(i) Range
The range of scores in a test refers to the lowest and highest scores
obtained in the test. The range is the distance between the extremes of
a distribution.
Based on the raw scores, you can calculate the standard deviation using the
formula given in the following.
x x 2 153
Standard Deviation N 1
9
17
4.12
(a) The first step in computing the standard deviation is to find the mean which
is 390 divided by 10 = 39.
(b) Next is to subtract the mean from each score in the column labelled x x .
(c) This is followed by the calculation in the column on the right labelled
( x x )2 . Note that all numbers in this column are positive. The squared
differences are then summed and the square root calculated.
(c) The standard deviation is 4.12, which is the positive square root of 153
divided by 9.
To better understand what the standard deviation means, refer to Figure 10.4
which shows the spread of scores with the same mean but different standard
deviations.
Note that the smaller the standard deviations, the greater the scores tend to
„bunch‰ around the mean and vice versa. Hence, it is not enough to just examine
the mean alone because the standard deviation tells us a lot about the spread of
the scores around the mean. Which class do you think performed better? The mean
does not tell us which class performed better. Class C performed the best because
approximately two thirds ( 2 3 ) of the learners scored between 38 and 40.
SELF-CHECK 10.2
Skew
Skew refers to the symmetry of a distribution. A distribution is skewed if one of
its tails is longer than the other. Refer to Figure 10.5 which shows the distribution
of the scores obtained by 38 learners on a History test. There is a negative skew
because it has a longer tail in the negative direction. What does it mean? It means
that more learners were getting high scores in the history test which may indicate
that either the test was too easy or the teaching methods and materials were
successful in bringing about the desired learning outcomes.
SELF-CHECK 10.3
A teacher administered an English test to 10 children in her class. The
children earned the following marks: 14, 28, 48, 52, 77, 63, 84, 87, 90 and
98. For the distribution of marks, find the following:
(a) Mean
(b) Median
(c) Range
With just the raw scores, what can you say about ZulindaÊs performance on these
tests or her standing in the class? Well, actually not very much. Without knowing
how these raw scores compare to the total distribution of raw scores for each
subject, it is difficult to draw any meaningful conclusions regarding her relative
performance in each of these tests.
(a) Assume that the score of all three tests are approximately normally
distributed.
(b) The mean and standard deviation of the three tests are as follows:
Based on the additional information, what statements can you make regarding
ZulindaÊs relative performance in each of the three tests? The following are some
conclusions you can make:
(a) Zulinda scored the best in the History test and her raw score of 72 falls at a
point one standard deviation above the mean;
(b) Her next best score is English and her raw score of 40 falls exactly at the mean
of the distribution of the scores; and
(c) Finally, even though her raw score for Science was 80, it falls one standard
deviation below the mean.
Raw scores, like Zulinda's scores, can be converted to two types of standard scores
which are the Z score and T score.
(a) Z Score
Converting ZulindaÊs raw scores into „z scores‰, we can say that she
achieved a:
The formula used for transforming raw scores into z scores involves
subtracting the mean from the raw score and dividing it by the standard
deviation.
x x
z
SD
x x 52 70 18
z 2.4
SD 7.5 7.5
The z score computed for the raw score of 52 is ă2.4 which means that
KumarÊs score for the Geography test is located 2.4 standard deviations
below the mean.
Test 1 Test 2
Seng Huat 30 50
Mei Ling 45 35
Mean 42 47
Standard Deviation 7 8
The teacher could use the mean to determine who is better. But both learners
have the same mean. How does the teacher decide? By using the z score, the
teacher can know how far from the mean are the scores of the two learners
and thus who performed better. Using the formula above, the teacher
computes the z score shown in the following:
Upon examination of the information in the table, the teacher finds that both
Seng Huat and Mei Ling have negative z scores for the total of both tests.
However, Mei Ling has a higher total z score (ă1.07) compared to Seng HuatÊs
total z score (ă1.34). In other words, Mei LingÊs total score was closer to the
mean and therefore the teacher concludes that Mei Ling did better than Seng
Huat.
Z scores are relatively simple to use but many educators are reluctant to use
it especially when test scores are reported as negative numbers. How would
you like to have your Mathematics score reported as ă4? For these reasons,
alternative standard score methods are used such as the T score.
(b) T Score
The T score was developed by W. McCall in the 1920s and is one of the many
standard scores currently being used. T scores are widely used in psychology
and education especially when reporting performance in standardised tests.
The T score is a standardised score with a mean of 50 and a standard
deviation of 10. The formula for computing the T score is:
T = 10(z) + 50
Say for example, a learner has a z score of ă1.0 and to convert it to T score:
When converting z scores to T scores, you should be careful not to drop the
negatives. Dropping the negatives will result in a completely different score.
SELF-CHECK 10.4
z score T score
+1.0
ă2.4
+1.8
Why would you use T scores rather than z scores when reporting the
performance of students in the classroom?
Figure 10.7 shows a normal distribution curve for IQ based on the Wechsler
Intelligence Scale for Children. In a normal distribution, about two-thirds (⅔) of
individuals will have an IQ of between 85 and 115 with a mean of 100. According
to the American Association of Mental Retardation (2006), individuals who have
an IQ of less than 70 may be classified as mentally retarded or mentally challenged
and those who have an IQ of more than 130 may be considered as gifted.
Figure 10.7: A normal distribution curve of IQ based on the Wechsler Intelligence Scale
for Children
as a percentage on the diagram. For example, the area between the mean and
standard deviation +1 is 34.13%. Similarly, the area between the mean and
standard deviation ă1 is also 34.13%. Hence, the area between standard deviation
ă1 and standard deviation +1 is 68.26%. It means that in a normal distribution,
68.26% of individuals will score between standard deviations ă1 and +1.
Note that in Figure 10.7, z scores are indicated from + 1 to + 4 and ă1 to ă4 with the
mean as 0. Each interval is equal to one standard deviation. Similarly, T scores are
reported from 10 to 90 (interval of 10) with the mean set at 50. Each interval of 10
is equal to one standard deviation.
Ć The term „central tendency‰ refers to the „middle‰ value and is measured
using the mean, median and mode. It is an indication of the location of the
scores.
Ć The mean is simply the sum of all the values (marks) divided by the total
number of items (learners) in the set.
The median is determined by sorting the score obtained from the lowest to
highest values and taking the score that is in the middle of the sequence.
The mode is the most frequently occurring score in the data set.
Ć The range of scores in a test is the distance between the lowest score and the
highest score obtained in the test.
Ć Standard deviation refers to how much the scores obtained by learners deviate
or defer from the mean.
Ć The standard score refers to raw score that has been converted from one scale
to another scale using the mean and standard deviation.
Ć Z scores indicate how many standard deviations away from the mean the score
is located.
Ć The normal curve (also called the „bell curve‰) is a hypothetical curve that is
supposed to represent all natural occurring phenomena.
References
Advantages and disadvantages of rubrics. Retrieved from
https://engage.intel.com/ thread/11468
ASCD (2017). How to create and use rubrics for formative assessment. Retrieved
from http://www.ascd.org/publications/books/112001/chapters/What-
Are-Rubrics-and-Why-Are-They-Important%C2%A2.aspx
Blank, W., & Harwell, S. (1997). Connecting high school with the real world.
ERIC Document Reproduction Service No. ED407586 blank.
Brookhart, S. M. (1999). The art and science of classroom assessment: The missing
part of pedagogy. ASHE-ERIC Higher Education Report (Vol. 27, No.1).
Washington, DC: The George Washington University, Graduate. School of
Education and Human Development.
Connie Malamed. (2016). Writing multiple choice questions for higher order
thinking. Retrieved from
http://theelearningcoach.com/elearning_design/multiple-choice-
questions/
Gramm, B. F. (1981). Rules for constructing essay questions. NY: Holt, Rinehart &
Winstoo. Retrieved from
http://www.personal.psu.edu/faculty/s/r/sra113/602/essayexams.htm
Griffith, K., & Nguyen, A. (2005). Are educators prepared to affect the affective
domain? Retrieved from
www.nationalforum.com/Electronic Journal Volumes/Griffith, Kimberly G
Kemp, J., & Teperof, D. (n.d.) Guidelines for portfolio assessment. Retrieved from
http://www.anglit.net/main/portfolio/default.html
Kohn, A. (2006). The trouble with rubrics. English Journal March 2006. Vol 95, no 4.
Phillips, J. A., Ansary Ahmed, & Kuldip Kaur. (2005). Instructional design
principles in the development of an e-learning graduate course. Paper
presented at The International Conference in E-Learning. Bangkok,
Thailand
Wiggins, G., & McTighe, J. (2005). Understanding by design. New Jersey: Pearson
Education.
Wiggins, Grant. (1990). The case for authentic assessment. Practical Assessment,
Research & Evaluation, 2(2). Retrieved July 14, 2016 from
http://PAREonline.net/getvn.asp?v=2&n=2 .
OR
Thank you.