Assessment and Evalaution

Unit 1: Assessment: Concept, Purpose, and Principles
Introduction
This is an introductory unit which is intended to familiarize you with some basic concepts that you will
encounter while studying this course. Specifically the concepts test, measurement, assessment and
evaluation will be elaborated. Following this, the purposes of educational assessment are described.
Next, there is a brief explanation of the role of educational objectives in assessment. This unit also
presents you with the important principles that have to be adhered when assessing students’ learning.
Finally, the importance of involving students in the assessment process is highlighted to be followed by
the most important competencies that professional teachers are expected to possess so as to effectively
assess their students.
1.1 Concepts
There is some confusions and differences in defining the concepts: test, measurement, assessment, &
evaluation.
Test in educational context is meant to the presentation of a standard set of questions to be answered by
students. It is one instrument that is used for collecting information about students’ behaviors or
performances.
Measurement: In education, measurement is the process by which the attributes of a person are
measured and described in numbers. It is a quantitative description of the behavior or performance of
students. As educators we frequently measure human attributes such as attitudes, academic
achievement, aptitudes, interests, personality and so forth. Measurement permits more objective
description concerning traits and facilitates comparisons.
Assessment: In educational literature the concepts ‘assessment’ and ‘evaluation’ have been used with
some confusion. Some educators have used them interchangeably to mean the same thing. Others have
used them as two different concepts. Assessment, according to Cizek (in Phiye, 1997), is
the planned process of gathering and synthesizing information relevant to the purposes of
(a) discovering and documenting students' strengths and weaknesses, (b) planning and
enhancing instruction, or (c) evaluating progress and making decisions about students.
Evaluation: This concept refers to the process of judging the quality of student learning on the basis of
established performance standards and assigning a value to represent the worthiness or quality of that
learning or performance. It is concerned with determining how well they have learned. When we
evaluate, we are saying that something is good, appropriate, valid, positive, and so forth. Evaluation is
based on assessment that provides evidence of student achievement at strategic times throughout the
grade/course, often at the end of a period of learning.
1
Evaluation includes both quantitative and qualitative descriptions of student behavior plus value
judgment concerning the desirability of that behavior. The following simple mathematical arrangement
shows the relationship between measurement and evaluation.
Evaluation = Quantitative description of students’ behavior (measurement) + qualitative description of
students’ behavior (non-measurement) + value judgment
So, we can describe evaluation as the comparison of what is measured against some defined criteria
and to determine whether it has been achieved, whether it is appropriate, whether it is good, whether it
is reasonable, whether it is valid and so forth. Evaluation accurately summarizes and communicates to
parents, other teachers, employers, institutions of further education, and students themselves what
students know and can do with respect to the overall curriculum expectations.
1.2 Importance and Purposes of Assessment

One of the first things to consider when planning for assessment is its purpose. Who will use the
results? How will they use them? We also need to have a clear idea as to what the purposes assessment
serves.
Classroom assessment involves students and teachers in the continuous monitoring of students'
learning. It provides the staff with feedback about their effectiveness as teachers, and it gives students a
measure of their progress as learners. Through close observation of students in the process of learning
and the collection of frequent feedback on students' learning, teachers can learn much about how
students learn and, more specifically, how students respond to particular teaching approaches.
Classroom assessment helps individual teachers obtain useful feedback on what, how much, and how
well their students are learning. The staff can then use this information to refocus their teaching to help
students make their learning more efficient and more effective.
Thus, based on the reasons for assessment described above, it can be summarized that assessment in
education focuses on:
• helping LEARNING, and;
• improving TEACHING.
Concerning the learner, assessment is aimed at providing information that will help us make decisions
concerning remediation, enrichment, selection, exceptionality, progress and certification. With regard
to teaching, assessment provides information about the attainment of objectives, the effectiveness of
teaching methods and learning materials.
2
Overall, assessment serves the following main purposes.
1) Assessment is used to inform and guide teaching and learning: A good classroom assessment
plan gathers evidence of student learning that informs teachers' instructional decisions. It provides
teachers with information about what students know and can do. To plan effective instruction,
teachers also need to know what the student misunderstands and where the misconceptions lie. In
addition to helping teachers formulate the next teaching steps, a good classroom assessment plan
provides a road map for students. Students should, at all times, have access to the assessment so
they can use it to inform and guide their learning.
2) Assessment is used to help students set learning goals: Students need frequent opportunities to
reflect on where their learning is at and what needs to be done to achieve their learning goals. When
students are actively involved in assessing their own next learning steps and creating goals to
accomplish them, they make major advances in directing their learning and what they understand
about themselves as learners.
3) Assessment is used to assign report card grades: Grade reports provide parents, employers,
schools, and other stakeholders including the government, post-secondary institutions and
employers with summary information about student learning.
4) Assessment is used to motivate students: Research has shown that students will be confident and
motivated when they experience progress and achievement, rather than the failure and defeat
associated with being compared to more successful peers.
1.3. The Role of Educational Objectives in Assessment

The first step in planning any good teaching is to clearly define the learning objectives or outcomes. A
learning objective is an outcome statement that captures specifically what knowledge, skills, attitudes
learners should be able to exhibit following instruction. Defining learning objectives is also essential to
the assessment of students’ learning. Effective assessment practice requires relating the assessment
procedures as directly as possible to the learning objectives.
Instructional objectives which are commonly known as learning outcomes play a key role in both the
instructional process and the assessment process. They serve as guides for both teaching and learning,
communicate the intent of instruction to others, and provide guidelines for assessing students learning.
Instructional objectives or learning outcomes are stated in terms of what the students are expected to be
able to do at the end of the instruction. For instance, after teaching them on how to solve quadratic
equations, we might expect students to have the skill of solving any quadratic equation. A learning
outcome stated in this way clearly indicates the kind of performance students are expected to exhibit as
3
a result of the instruction. This situation also makes clear the intent of our instruction and sets the stage
for assessing students learning. Well stated learning outcomes make clear the types of students
performance we are willing to accept as evidence that the instruction has been successful.
1.4. Principles of Assessment
Assessment principles consist of statements highlighting what are considered as critical elements of a
system designed to assess student progress. These principles are expressed in terms of elements for a
fair (reliable and valid) assessment system. Different educators and school systems have developed
somehow different sets of assessment principles. Miller, Linn and Grunland (2009) have identified the
following general principles of assessment.
1. Clearly specifying what is to be assessed has priority in the assessment process.
2. An assessment procedure should be selected because of its relevance to the characteristics or
performance to be measured.
3. Comprehensive assessment requires a variety of procedures.
4. Proper use of assessment procedures requires an awareness of their limitations.
5. Assessment is a means to an end, not an end in itself.
Perhaps the assessment principles developed by New South West Wales Department of Education and
Training (2008) in Australia are more inclusive than those principles listed by other educators. Let us
look at these principles and compare them with those developed by Miller, Linn and Grunland as
described above.
1. Assessment should be relevant. Assessment needs to provide information about students’

knowledge, skills and understandings of the learning outcomes specified in the syllabus.
2. Assessment should be appropriate. Assessment needs to provide information about the

particular kind of learning in which we are interested. This means that we need to use a variety
of assessment methods because not all methods are capable of providing information about all
kinds of learning. For example, some kinds of learning are best assessed by observing students;
some by having students complete projects or make products and others by having students
complete paper and pen tasks. Conclusions about student achievement in an area of learning are
valid only when the assessment method we use is appropriate and measures what it is supposed
to measure.
3. Assessment should be fair. Assessment needs to provide opportunities for every student to
demonstrate what they know, understand and can do. Assessment must be based on a belief that
4
all learners are on a path of development and that every learner is capable of making progress.
Students bring a diversity of cultural knowledge, experience, language proficiency and
background, and ability to the classroom. They should not be advantaged or disadvantaged by
such differences that are not relevant to the knowledge, skills and understandings that the
assessment is intended to address. Students have the right to know what is assessed, how it is
assessed and the worth of the assessment. Assessment will be fair or equitable only if it is free
from bias or favoritism.
4. Assessment should be accurate. Assessment needs to provide evidence that accurately reflects
an individual student’s knowledge, skills and understandings. That is, assessments need to be
reliable or dependable in that they consistently measure a student’s knowledge, skills and
understandings. Assessment also needs to be objective so that if a second person assesses a
student’s work, they will come to the same conclusion as the first person. Assessment will be
fair to all students if it is based on reliable, accurate and defensible measures.
5. Assessment should provide useful information. The focus of assessment is to establish where
students are in their learning. This information can be used for both summative purposes, such
as the awarding of a grade, and formative purposes to feed directly into the teaching and
learning cycle.
6. Assessment should be integrated into the teaching and learning cycle. Assessment needs to
be an ongoing, integral part of the teaching and learning cycle. It must allow teachers and
students themselves to monitor learning. From the teacher perspective, it provides the evidence
to guide the next steps in teaching and learning. From the student perspective, it provides the
opportunity to reflect on and review progress, and can provide the motivation and direction for
further learning.
7. Assessment should draw on a wide range of evidence. Assessment needs to draw on a wide
range of evidence. A complete picture of student achievement in an area of learning depends on
evidence that is sampled from the full range of knowledge, skills and understandings that make
up the area of learning. An assessment program that consistently addresses only some outcomes
will provide incomplete feedback to the teacher and student, and can potentially distort teaching
and learning.
8. Assessment should be manageable. Assessment needs to be efficient, manageable and
convenient. It needs to be incorporated easily into usual classroom activities and it needs to be
capable of providing information that justifies the time spent.
5
1.5. Assessment and Some Basic Assumptions
Angelo and Cross (1993) have listed seven basic assumptions of classroom assessment which are
described as follows:
1. If assessment is to improve the quality of students learning, both teachers and students must be
actively involved in the process.
2. To improve their effectiveness, teachers need first to make their goals and objectives explicit and
then to get specific, comprehendible feedback on the extent to which they are achieving those
goals and objectives. Effective assessment begins with clear goals. Before teachers can assess
how well their students are learning, they must identify and clarify what they are trying to teach.
After teachers have identified specific teaching goals they wish to assess, they can better
determine what kind of feedback to collect.
3. To improve their learning, students need to receive appropriate and focused feedback early and
often; they also need to learn how to assess their own learning.
4. The type of assessment most likely to improve teaching and learning is that conducted by
teachers to answer questions they themselves have formulated in response to issues or problems
in their own teaching. To best understand their students’ learning, teachers need specific and
timely information about the particular individuals in their classes. As a result of the different
students’ needs, there is often a gap between assessment and student learning. One goal of
classroom assessment is to reduce this gap.
5. Systematic inquiry and intellectual challenge are powerful sources of motivation, growth, and
renewal for teachers, and classroom assessment can provide such challenge. Classroom
assessment is an effort to encourage and assist those teachers who wish to become more
knowledgeable, involved, and successful.
6. By collaborating with colleagues and actively involving students in classroom assessment
efforts, teachers (and students) enhance learning and personal satisfaction. By working together,
all parties achieve results of greater value than those they can achieve by working separately.
1.6. Assessment, Learning, and the Involvement of Students

There is considerable evidence that assessment is a powerful process for enhancing learning.
Classroom assessment promotes learning when teachers use it in the following ways:
 When they use it to become aware of the knowledge, skills, and beliefs that their students bring
to a learning task, and;
6
 When they use this knowledge as a starting point for new instruction, and monitor students’
changing perceptions as instruction proceeds.
Learning is also enhanced when students are encouraged to think about their own learning, to review
their experiences of learning and to apply what they have learned to their future learning. Assessment
provides the feedback loop for this process.
Assessment also enhances students’ learning by increasing their motivation. Assessment can enhance
student motivation by:
• emphasizing progress and achievement rather than failure
• providing feedback to move learning forward
• reinforcing the idea that students have control over, and responsibility for, their own learning
• building confidence in students so they can and need to take risks
• being relevant, and appealing to students’ imaginations
• providing the scaffolding that students need to genuinely succeed
Assessment is also an important instrument for implementing differentiated learning. Classes consist of
students with different needs, backgrounds, and skills. Teachers find ways to create a wide range of
learning options and paths, so that all students have the opportunity to learn as much as they can, as
deeply as they can, and as efficiently as they can.
Assessment practices lead to differentiated learning when teachers use them to gather evidence to
support every student’s learning, every day in every class. The learning needs of some students may
require individualized learning plans.
There are two ways in which students can be involved in assessment: self-assessment and peer
assessment. Self-assessment involves students judging their own work. It begins with students
understanding the learning objectives for the particular lesson and the success criteria for the specific
task or activity. It develops into students’ awareness of their own strengths and weaknesses in a
particular subject (and as a learner in general) and the ability to identify their own ‘next steps’ or
targets. Self-assessment allows students to think more carefully about what they know and do not
know, and what they need to know to accomplish certain tasks.
Peer assessment, by contrast, involves students making judgment about other students’ work. Students
learn how to make better sense of assessment criteria if they have to give feedback and/or marks
against them
7
1.7 Assessment and Teacher Professional Competence in Ethiopia
Assessment requires so much of a teachers professional time, both inside and outside the classroom.
Therefore, a teacher should have some basic competencies on classroom assessment so as to be able to
effectively assess his/her students learning.
A teacher's professional role and responsibilities for student assessment can be conceptualized as
falling along a time continuum. Assessment activities occur prior to instruction, during instruction, and
after instruction. Assessment prior to instruction provides a teacher with information about individual
differences among students as well as an understanding of the background or prior knowledge of the
class as a whole. These assessment activities provide the basis for planning instruction.
Assessment during instruction provides information about the overall progress of the whole class as
well as specific information about individual students. These assessment activities provide the basis for
monitoring progress during learning.
Following the teaching of a specific unit, semester, academic year, or the like, decisions must be made
about the achievement of short and long-term instructional goals. This is assessment after instruction.
In addition to these activities, communication skills are needed to interpret and report performance
standards or levels of achievement to students and parents.
There standards articulating teacher competence in the educational assessment of students are described
below.
1. Teachers should be skilled in choosing assessment options appropriate assessment methods in light
of instructional objectives.
2. Teachers should be skilled in developing assessment methods appropriate to give accurate and fair
(valid) instructional decisions.
3. Teachers should be skilled in administering, scoring, and interpreting the results of assessment
methods.
4. Teachers should be skilled in communicating assessment results to students, parents, other lay
audiences, and other educators.
5. Teachers must be well-versed in their own ethical and legal responsibilities in assessment.
In our country, Ethiopia, also the MoE has developed such assessment related competences which
professional teachers are expected to possess.
Unit Summary
8
 Test, measurement, assessment and evaluation are concepts that are frequently used in the area
of educational assessment and evaluation, often with varying meanings and some confusion.
However, although they overlap, they vary in scope and have different meanings.
 Assessment serves many important purposes including: informing and guiding teaching and
learning; helping students set learning goals; assigning report card grades; motivating students.
 Assessment should be designed in such a way that it will elicit information about students
progression towards the educational objectives.
 There are some important principles that professional teachers should be aware of that guide the
assessment process of students’ learning.
 Any assessment process is based on certain basic assumptions.
 Assessment is an integral process of the teaching and learning process and is an important tool
for enhancing learning.
 In order to maximize the benefits students can get out of assessment, they should be involved in
the assessment process.
UNIT TWO
Assessment Strategies, Methods, and Tools
2.1 Introduction
In the previous unit the major concepts of educational assessment and evaluation were defined. The
purposes and principles of assessment were also discussed. In this unit, various assessment strategies
that can be used in the context of secondary education will be dealt with. Besides, planning,
construction and administration of classroom tests will be raised.
2.2 Types of assessment

There are different approaches in conducting assessment in the classroom. Here, we are going to see
three pairs of assessment typologies namely, formal vs. informal, criterion referenced vs. norm
referenced, formative vs. summative assessments.
i. Formative and Summative Assessments

Assessment procedures can be classified according to their functional role during classroom
instruction. One such classification system follows the sequence in which assessment procedures are
likely to be used in the classroom. The most commonly referred to and used categories in this regard
are formative assessment and summative assessment.
a) Formative Assessment: Formative assessments are used to shape and guide classroom instruction.
They can include both informal and formal assessments (which will be discussed later in this
9
section) and help us to gain a clearer picture of where our students are and what they still need help
with. They can be given before, during, and even after instruction, as long as the goal is to improve
instruction.
Formative assessments are ongoing assessments, reviews, and observations in a classroom. They
serve a diagnostic function for both students and teachers. Students receive feedback that they can
use to adjust, improve their performance or other aspects of their engagement in the unit such as
study techniques. Teachers can conduct formative assessment at any point in a unit of study.
Formative assessment is also known by the name ‘assessment for learning’
There is still another name which is associated with the concept of formative assessment,
‘continuous assessment’. Continuous assessment is a teaching approach as well as a process of
deciding to what extent the educational objectives are actually being realized during instruction. In
schools, continuous assessment of learning is usually carried out by teachers on the basis of
impressions gained as they observe their students at work or by various kinds of tests giving
periodically. Therefore, each decision is based on various types of information that are determined
through different assessment methods at different time by teachers.
The following are some of the strategies of assessment you can employ in your classrooms:
o make your students write their understanding of vocabulary or concepts before and after
instruction.
o ask students to summarize the main ideas they've taken away from your presentation,
discussion, or assigned reading.
o make students complete a few problems or questions at the end of instruction and check
answers.
o interview students individually or in groups about their thinking as they solve problems.
o assign brief, in-class writing assignments (e.g., "Why is this person or event
representative of this time period in history?)
Tests and homework can also be used formatively if teachers analyze where students are in their
learning and provide specific, focused feedback regarding performance and ways to improve it.
b) Summative Assessment: Summative assessment typically comes at the end of a course (or unit) of
instruction. It evaluates the quality of students’ learning and assigns a mark to that students’ work
based on how effectively learners have addressed the performance standards and criteria.
10
Assessment tasks conducted during the progress of a semester may be regarded as summative in
nature if they only contribute to the final grades of the students.
The techniques used in summative assessment are determined by the instructional goals. Typically,
however, they include teacher made achievement tests, ratings of various types of performance,
and assessment of products (reports, drawings, etc.).
A particular assessment task can be both formative and summative. For example, students could
complete unit 1 of their Module and complete an assessment task for which they earned a mark
that counted towards their final grade. In this sense, the task is summative. They could also receive
extensive feedback on their work. Such feedback would guide learners to achieve higher levels of
performance in subsequent tasks. In this sense, the task is formative – because it helps students
form different approaches and strategies to improve their performance in the future.
ii. Formal and Informal Assessment

Assessment can also be either formal or informal. Let us try to understand their differences from the
following paragraphs.
a) Formal Assessment: This usually implies a written document, such as a test, quiz, or paper. A
formal assessment is given a numerical score or grade based on student performance. We will deal
more on formal assessment strategies, particularly on tests in a letter section.
b) Informal Assessment: "Informal" is used here to indicate techniques that can easily be
incorporated into classroom routines and learning activities. Informal assessment techniques can be
used at anytime without interfering with instructional time. Their results are indicative of the
student's performance on the skill or subject of interest.
An informal assessment usually occurs in a more casual manner and may include observation,
inventories, checklists, rating scales, rubrics, performance and portfolio assessments, participation,
peer and self evaluation, and discussion. Formal tests assume a single set of expectations for all
students and come with prescribed criteria for scoring and interpretation. Informal assessment, on
the other hand, requires a clear understanding of the levels of ability the students bring with them.
Only then may assessment activities be selected that students can attempt reasonably. Informal
assessment seeks to identify the strengths and needs of individual students without regard to grade
or age norms.
Methods for informal assessment can be divided into two main types: unstructured (e.g., student
work samples, journals) and structured (e.g., checklists, observations). The unstructured methods
frequently are somewhat more difficult to score and evaluate, but they can provide a great deal of
11
valuable information about the skills of the students. Structured methods can be reliable and valid
techniques when time spent creating the "scoring" procedures. Another important aspect of
informal assessments is that they actively involve the students in the evaluation process - they are
not just paper-and-pencil tests.
iii. Criterion-referenced and Norm-referenced Assessments
How the results of tests and other assessment procedures are interpreted also provides a method of
classifying these instruments. There are two ways of interpreting student performance – criterion-
referenced and norm-referenced.
a) Criterion-referenced Assessment: This type of assessment allows us to quantify the extent
students have achieved the goals of a unit of study and a course. It is carried out against previously
specified criteria and performance standards. Where a grade is assigned, it is assigned on the basis
of the standard the student has achieved on each of the criteria. This type of assessment is most
appropriate for quickly assessing what concepts and skills students have learned from a segment of
instruction. Criterion referenced classrooms are mastery-oriented, informing all students of the
expected standard and teaching them to succeed on related outcome measures. Criterion referenced
assessments help to eliminate competition and may improve cooperation.
b) Norm-referenced Assessment: This type of assessment has as its end point the determination of
student performance based on a position within a cohort of students – the norm group. This type of
assessment is most appropriate when one wishes to make comparisons across large numbers of
students or important decisions regarding student placement and advancement. For example,
students’ results in grade 8 national exams in our country are determined based on their relative
standing in comparison to all other students who have taken the exam. Thus, when we say that a
student has scored 80 percentile, it doesn’t mean that the student has scored an average of 80%
score. Rather it is meant to be that the student’s average score stands above 79.9% of the students,
and the remaining 20% of students have scored above that particular student. Students’ assignment
of ranks is also another example of norm-referenced interpretation of students’ performances.
To summarize, the criterion-referenced assessment emphasizes description of student’s
performance, and the norm-referenced assessment emphasizes discrimination among individual
students in terms of relative level of learning.
2.3 Assessment Strategies

Assessment strategy refers to those assessment tasks (methods/approaches/activities) in which students
are engaged to ensure that all the learning objectives of a subject, a unit or a lesson have been
12
adequately addressed. Assessment strategies range from informal, almost unconscious, observation to
formal examinations. Although different subject areas may have some differences on the assessment
strategies they use, generally, however, there are varieties of methods that can be used by most
subjects.
When selecting assessment strategies in our subject areas, there are a number of things that we have to
consider. First of all, it is important that we choose the assessment technique appropriate for the
particular behavior being assessed. We have to use a strategy that can give students an opportunity to
demonstrate the kind of behavior that the learning outcome demands. Assessment strategies should also
be related to the course material and relevant to students’ lives. Therefore, we have to provide
assessment strategies that relate to students’ future work.
There are many different ways to categorize learning goals for students. Categorizing helps us to
thoroughly think through what we want students to know and be able to do. One way in which the
different learning outcomes that we want our students to develop can be categorized is presented as
follows:
 Knowledge and understanding: What facts do students know outright? What information
can they retrieve? What do they understand?
 Reasoning proficiency: Can students analyze, categorize, and sort into component parts?
Can they generalize and synthesize what they have learned? Can they evaluate and justify the
worth of a process or decision?
 Skills: We have certain skills that we want students to master such as reading fluently,
working productively in a group, making an oral presentation, speaking a foreign language,
or designing an experiment.
 Ability to create products: Another kind of learning target is student-created products -
tangible evidence that the student has mastered knowledge, reasoning, and specific
production skills. Examples include a research paper, a piece of furniture, or artwork.
 Dispositions: We also frequently care about student attitudes and habits of mind, including
attitudes toward school, persistence, responsibility, flexibility, and desire to learn.
From among the various assessment strategies that can be used by classroom teachers, some are
described below for your consideration as student teachers.
Classroom presentations: A classroom presentation is an assessment strategy that requires students to
verbalize their knowledge, select and present samples of finished work, and organize their thoughts
about a topic in order to present a summary of their learning. It may provide the basis for assessment
13
upon completion of a student’s project or essay. For example students can be made to present a report
after an educational visit.
Conferences: A conference is a formal or informal meeting between the teacher and a student for the
purpose of exchanging information or sharing ideas. A conference might be held to explore the
student’s thinking and suggest next steps; assess the student’s level of understanding of a particular
concept or procedure; and review, clarify, and extend what the student has already completed.
Exhibitions/Demonstrations: An exhibition/demonstration is a performance in a public setting, during
which a student explains and applies a process, procedure, etc., in concrete ways to show individual
achievement of specific skills and knowledge.
Interviews: You should be familiar with the interviews journalists conduct with different personalities.
An interview can also be used for assessment purposes in educational settings. In such applications
interview is a face-to-face conversation in which teacher and student use inquiry to share their
knowledge and understanding of a topic or problem. This form of assessment can be used by the
teacher to:
 explore the student’s thinking;
 assess the student’s level of understanding of a concept or procedure; and
 gather information, obtain clarification, determine positions, and probe for motivations.
Observation: Observation is a process of systematically viewing and recording students while they
work, for the purpose of making instruction decisions. Observation can take place at any time and in
any setting. It provides information on students' strengths and weaknesses, learning styles, interests,
and attitudes. Observations may be informal or highly structured, and incidental or scheduled over
different periods of time in different learning contexts.
Performance tasks: During a performance task, students create, produce, perform, or present works
on "real world" issues. The performance task may be used to assess a skill or proficiency, and provides
useful information on the process as well as the product.
Portfolios: A portfolio is a collection of samples of a student’s work over time. It offers a visual
demonstration of a student’s achievement, capabilities, strengths, weaknesses, knowledge, and specific
skills, over time and in a variety of contexts. For a portfolio to serve as an effective assessment
instrument, it has to be focused, selective, reflective, and collaborative. Portfolios can be prepared for
different subjects in any educational level.
Questions and answers: Perhaps, this is a widely used strategy by teachers with the intention of
involving their students in the learning and teaching process. In this strategy, the teacher poses a
14
question and the student answers verbally, rather than in writing. This strategy helps the teacher to
determine whether students understand what is being, or has been, presented; it also helps students to
extend their thinking, generate ideas, or solve problems. Strategies for effective question and answer
assessment include:
 Apply a wait time or 'no hands-up rule' to provide students with time to think after a question
before they are called upon randomly to respond.
 Ask a variety of questions, including open-ended questions and those that require more than a
right or wrong answer.
Students’ self-assessments: Self-assessment is a process by which the student gathers information
about, and reflects on, his or her own learning. It is the student’s own assessment of personal progress
in terms of knowledge, skills, processes, or attitudes. Self-assessment leads students to a greater
awareness and understanding of themselves as learners.
Checklists, Rating Scales and Rubrics: These are tools that state specific criteria and allow teachers
and students to gather information and to make judgments about what students know and can do in
relation to the outcomes. They offer systematic ways of collecting data about specific behaviors,
knowledge and skills.
Checklists usually offer a yes/no format in relation to student demonstration of specific criteria. They
may be used to record observations of an individual, a group or a whole class.
Rating Scales allow teachers to indicate the degree or frequency of the behaviors, skills and strategies
displayed by the learner. Rating scales state the criteria and provide three or four response selections to
describe the quality or frequency of student work.
Rubrics use a set of criteria to evaluate a student's performance. They consist of a fixed measurement
scale and detailed description of the characteristics for each level of performance. These descriptions
focus on the quality of the product or performance and not the quantity. Rubrics use a set of specific
criteria to evaluate student performance. They may be used to assess individuals or groups and, as with
rating scales, may be compared over time.
The purpose of checklists, rating scales and rubrics is to:

 provide tools for systematic recording of observations
 provide tools for self-assessment
 provide samples of criteria for students prior to collecting and evaluating data on their work
 record the development of specific skills, strategies, attitudes and behaviours necessary for
demonstrating learning
15
 clarify students' instructional needs by presenting a record of current accomplishments.
One- Minute paper: During the last few minutes of the class period, you may ask students to answer
on a half-sheet of paper: "What is the most important point you learned today?" and, "What point
remains least clear to you?" The purpose is to obtain data about students' comprehension of a particular
class session. Then you can review responses and note any useful comments. During the next class
periods you can emphasize the issues illuminated by your students' comments.
Muddiest Point: This is similar to ‘One-Minute Paper’ but only asks students to describe what they
didn't understand and what they think might help. It is an important technique that will help you to
determine which key points of the lesson were missed by the students. Here also you have to review
before next class meeting and use to clarify, correct, or elaborate.
Student-generated test questions: You may allow students to write test questions and model answers
for specified topics, in a format consistent with course exams. This will give students the opportunity to
evaluate the course topics, reflect on what they understand, and what good test items are. You may
evaluate the questions and use the goods ones as prompts for discussion.
Tests: This is the type of assessment that you are mostly familiar with. A test requires students to
respond to prompts in order to demonstrate their knowledge (orally or in writing) or their skills (e.g.,
through performance). We will learn much more about tests later in this section.
2.4 Assessment in large classes

It is quite obvious that student numbers in a class limit the teaching methods available to teachers.
Similarly, assessment methods are restricted by class size. Due to time and resources constraints,
teachers often use less time-demanding assessment methods which however, may not always optimize
student learning.
The existing educational literature has identified various assessment issues associated with large
classes. They include:
a) Surface Learning Approach
b) Feedback is often inadequate
c) Inconsistency in marking
d) Difficulty in monitoring cheating and plagiarism
16
e) Lack of interaction and engagement
Although these issues can be problems in assessment for any class size, they are worse in large classes
because of the additional limitation and strain on resources. They are problems that are applicable
whether the function of the assessment is to facilitate learning via feedback, or to classify students via
grading.
There are a number of ways to make the assessment of large numbers of students more effective whilst
still supporting effective student learning. These include:
1. Front ending: The basic idea of this strategy is that by putting in an increased effort at the
beginning in setting up the students for the work they are going to do, the work submitted can be
improved. Therefore the time needed to mark it is reduced (as well as time being saved in less
requests for tutorial guidance).
2. Making use of in-class assignments: In-class assignments are usually quick and therefore
relatively easy to mark and provide feedback on, but help you to identify gaps in understanding.
Students could be asked to complete a task within the timeframe of a scheduled lecture, field
exercise or practical class. This might be a very quick task, for example, completing a graph, doing
some calculations, answering some quick questions, making brief notes on a piece of text etc. In
some cases it might be possible to merge the in-class assignment with peer assessment.
3. Self-and peer-assessment: Students can perform a variety of assessment tasks in ways, which both
save the tutor’s time and bring educational benefits, especially the development of their own
judgment skills. These include self assessment and peer assessment strategies.
i. Self-assessment reduces the marking load because it ensures a higher quality of work is
submitted, thereby minimizing the amount of time expended on marking and feedback. The
emphasis on student self- assessment represents a fundamental shift in the teacher-student
relationship, placing the primary responsibility for learning with the student. However, there are
problems involved in self-assessment for grading purposes pertaining to their validity and
reliability.
ii. In a similar fashion to self-assessment, peer-assessment can provide useful learning experiences
for students at the same time as reducing the marking load of staff. The use of peer-assessment
can be effective way of ensuring students get individual feedback that staff may be very busy to
provide in a timely manner given the class numbers involved.
17
However, as with any form of peer-assessment it needs to be carefully designed. Students need to know
what to do and there needs to be a transparent system by which students can appeal their marks
(especially if used in a summative rather than formative context). The benefits of this approach are that:
• students can get to see how their peers have tackled a particular piece of work,
• they can see how you would assess the work (e.g. from the model answers/answer sheets you've
provided) and;
• they are put in the position of being an assessor, thereby giving them an opportunity to
internalize the assessment criteria.
4. Group Assessments: The most obvious advantage of group-based assessment is that it
significantly reduces the marking load if the group submits only one piece of assessable work. The
major problem of course is that group members may not contribute equally, so how are they to be
rewarded fairly? There is probably no easy solution to this but there is a range of possible strategies
which may go at least some way to addressing the problem.
5. Changing the assessment method, or at least shortening it: Being faced with large numbers of
students will present challenges but may also provide opportunities to either modify existing
assessments or to explore new methods of assessment. You might, for example, be able to reduce
the length of the assessment task you are currently using without detracting from your module's
learning outcomes. Alternatively a large class may provide a new opportunity to make use of peer
and self-assessment.
2.5 Selecting and developing assessment methods and tools

A wide variety of tools are available for assessing student performance and there are approaches that
are suitable for essentially any educational objective you want to test. Examples include objective
exams, short answer and essay exams, portfolios, projects, practical exams, presentations, and
combinations of these. Appropriate tools or combinations of tools must be selected and used if the
assessment process is to successfully provide information relevant to stated educational outcomes.
2.5.1 Constructing Tests

There are a wide variety of styles & formats for writing test items. Miller, Linn, & Gronlund (2009)
make distinctions between classroom tests that consist of objective test items and performance
assessments that require students to construct responses (e.g. write an essay) or perform a particular
task (e.g., measure air pressure). Objective tests are highly structured and require the test taker to select
the correct answer from several alternatives or to supply a word or short phrase to answer a question or
18
complete a statement. They are called objective because they have a single right or best answer that can
be determined in advance. Performance assessment tasks permit the student to organize and construct
the answer in essay form. Other types of performance assessment tasks may require the student to use
equipment, generate hypothesis, make observations, construct something or perform for an audience.
For most performance assessment tasks, there is not a single best or right response. Expert judgment is
required to score the performances.
2.5.1.1 Constructing Objective Test Items

There are various types of objective test items. These can be classified into those that require the
student to supply the answer (supply type items) and those that require the student to select the answer
from a given set of alternatives (selection type items). Supply type items include completion items and
short answer questions. Selection type test items include True/False, multiple choice and matching.
Each type of test has its unique characteristics, uses, advantages, limitations, and rules for construction.
True/False Test Items

The chief advantage of true/false items is that they do not require the student much time for answering.
This allows a teacher to cover a wide range of content by using a large number of such items. In
addition, true/false test items can be scored quickly, reliably, and objectively by any body using an
answer key. If carefully constructed, true/false test items have also the advantage of measuring higher
mental processes of understanding, application and interpretation.
The major disadvantage of true/false items is that when they are used exclusively, they tend to promote
memorization of factual information: names, dates, definitions, and so on. Some argue that another
weakness of true/false items is that they encourage students for guessing. This is because any student
who takes such type of tests does have a 50 percent probability of getting the right answer. In addition
true/false items:
 Can often lead a teacher to write ambiguous statements due to the difficulty of writing
statements which are clearly true or false
 Do not discriminate b/n students of varying ability as well as other test items
 Can often include more irrelevant clues than do other item types
 Can often lead a teacher to favour testing of trivial knowledge
The following suggestions might perhaps help teachers to construct good quality true/false test items.
19
 Avoid negative statements, and never use double negatives. In Right-Wrong or True-False
items, negatively phrased statements make it needlessly difficult for students to decide whether that
statement is accurate or inaccurate.
 Restrict single-item statements to single concepts. If you double-up two concepts in a single item
statement, how does a student respond if one concept is accurate and the other isn’t? Take a look at
this confusing item:
 Use an approximately equal number of items, reflecting the two categories tested. If you
typically overbook on false items in your True-False tests, students who are totally at sea about an
item will be apt to opt for a false answer and will probably be correct.
 Make statements representing both categories equal in length. Again, to avoid giving away the
correct answers, don’t make all your false statements brief and (in an effort to include necessary
qualifiers) make all your true statements long. Students catch on quickly to this kind of test-making
tendency.
Matching Items
A matching item consists of two lists of words or phrases. The test-taker must match components in
one list (the premises, typically presented on the left) with components in the other list (the responses,
typically presented on the right), according to a particular kind of association indicated in the item’s
directions.
Matching items can cover a good deal of content in an efficient fashion. Matching items sometimes
can work well if you want your students to cross-reference and integrate their knowledge regarding the
listed premises and responses.
The major advantage of matching items is its compact form, which makes it possible to measure a large
amount of related factual material in a relatively short time.
The main limitation of matching test items is that they are restricted to the measurement of factual
information based on rote learning. Another limitation is the difficulty of finding homogenous material
that is significant from the perspective of the learning outcomes. As a result, test constructors tend to
include in their matching items material which is less significant.
The following suggestions are important guidelines for the construction of good matching items.
 Use fairly brief lists, placing the shorter entries on the right. If the premises and responses in a
matching item are too long, students tend to lose track of what they originally set out to look for.
20
The words and phrases that make up the premises should be short, and those that make up the
responses should be shorter still.
 Employ homogeneous lists. Both the list of premises and responses must be composed of similar
sorts of things. If not, an alert student will be able to come up with the correct associations simply
by “elimination” because some entries in the premises or responses may clearly be noticeable from
the others.
 Include more responses than premises. If you use the exact same number of responses as
premises in a matching item, then a student who knows half or more of the correct associations is in
a position to guess the rest of the associations with very good chances.
 List responses in a logical order. This rule is designed to make sure you don’t accidentally give
away hints about which responses connect with which premises. Choose a logical ordering scheme
for your responses (say, alphabetical or chronological) and stick with it.
 Describe the basis for matching and the number of times a response can be used. To satisfy
this rule, you need to make sure your test’s directions clarify the nature of the associations you want
students to use when they identify matches. Regarding the student’s use of responses, a phrase such
as the following is often employed: “Each response in the list at the right may be used once, more
than once, or not at all.”
 Try to place all premises and responses for any matching item on a single page. This rule’s
intent is to free your students from lots of potentially confusing flipping back and forth in order to
accurately link responses to premises.
Short Answer/Completion Test Items

The short-answer items and completion test items are essentially the same that can be answered by a
word, phrase, number or formula. They differ in the way the problem is presented. The short answer
type uses a direct question, where as the completion test item consists of an incomplete statement
requiring the student to complete. This can be demonstrated by the following examples:
Short answer item: In which year did the Ethiopians defeat the Italian invaders at Adwa?
Completion item: The Ethiopian forces defeated the Italian invaders at Adwa in the year _____.
The short-answer test items are one of the easiest to construct, partly because of the relatively simple
learning outcomes it usually measures. Except for the problem-solving outcomes measured in
Mathematics and Science, it is used almost exclusively to measure the recall of memorized
information.
21
A more important advantage of the short-answer item is that the students must supply the answer. This
reduces the possibility that students will obtain the correct answer by guessing. They must either recall
the information requested or make the necessary computations to solve the problem presented to them.
Partial knowledge, which might enable them to choose the correct answer on a selection item, is
insufficient for answering a short answer test item correctly.
There are two limitations cited in the use of short-answer test items. One is that they are unsuitable for
assessing complex learning outcomes. The other is the difficulty of scoring. This is especially true
where the item is not clearly phrased to require a definitely correct answer and the student’s spelling
ability.
The following suggestions will help to make short-answer type test items to function as intended.
 Word the item so that the required answer is both brief and specific.
Example: An animal that eats the flesh of other animals is _____. Poorly stated
animal that eats the flesh of other animals is classified as _____. Better item
 Do not take statements directly from textbooks to use as a basis for short-answer items. When
taken out of context, such statements are frequently too general and ambiguous to serve as good
short-answer items.
 A direct question is generally more desirable than an incomplete statement.
 If the answer is to be expressed in numerical units, indicate the type of answer wanted. For
computational problems, it is usually preferable to indicate the units in which the answer is to
be expressed.
Multiple-Choice Items
This is the most popular type of selected-response item. It can effectively measure many of the simple
learning outcomes measured by the short-answer item, the true-false item, and the matching item types.
In addition, it can measure a variety of complex cognitive learning outcomes.
A multiple-choice item consists of a problem and a list of suggested solutions. A student is first given
either a question or a partially complete statement. This part of the item is referred to as the item’s
stem. Then three or more potential answer-options are presented. These are usually called alternatives,
choices or options.
There are two important variants in a multiple-choice item:

(1) whether the stem consists of a direct question or an incomplete statement, and
22
(2) whether the student’s choice of alternatives is supposed to be a correct answer or a best answer.
The advantage of the multiple-choice item is its widespread applicability to the assessment of cognitive
skills and knowledge, as well as to the measurement of students’ affect. Another advantage of multiple-
choice items is that it’s possible to make them quite varied in the levels of difficulty they possess.
Cleverly constructed multiple-choice items can present very high-level cognitive challenges to
students. And, of course, as with all selected-response items, multiple-choice items are fairly easy to
score.
The weakness of multiple-choice items is that when students review a set of alternatives for an item,
they may be able to recognize a correct answer that they would never have been able to generate on
their own. In that sense, multiple-choice items can present an exaggerated picture of a student’s
understanding or competence, which might lead teachers to invalid inferences.
Another weakness, one shared by all selected-response items, is that multiple-choice items can never
measure a student’s ability to creatively synthesize content of any sort. Finally, in an effort to come up
with the necessary number of plausible alternatives, novice item-writers sometimes toss in some
alternatives that are obviously incorrect.
Well-constructed multiple-choice items, when deployed along with other types of items, can make a
genuine contribution to a teacher’s assessment arsenal. Here are some useful rules for you to follow.
 The question or problem in the stem must be self-contained. The stem should contain as
much of the item’s content as possible, thereby rendering the alternatives much shorter than
would otherwise be the case.
 Avoid negatively stated stems. Just as with the True/False items, negatively stated stems can
create genuine confusion in students.
 Each alternative must be grammatically consistent with the item’s stem. Well, as you can
see from the next sample item, grammatical inconsistency for three of these answer-options
supplies students with an unintended clue to the correct answer.
 Make all alternatives plausible, but be sure that one of them is indisputably the correct or
best answer. As I indicated when describing the weaknesses of multiple-choice items, teachers
sometimes toss in one or more implausible alternatives, thereby diminishing the item
substantially. Although avoiding that problem is important, it’s even more important to make
23
certain that you really do have one valid correct answer in any item’s list of alternatives, rather
than two similar answers, either of which could be arguably correct.
 Randomly use all answer positions in approximately equal numbers. If you use four-option
items, make sure that roughly one-fourth of the correct answers turn out to be A, one fourth B,
and so on.
 Never use “all of the above” as an answer choice, but use “none of the above” to make
items more demanding.
Students often become confused when confronted with items that have more than one correct answer.
Usually, what happens is they’ll see one correct alternative and instantly opt for it without recognizing
that there are other correct options later in the list. In addition, students will definitely opt for the “all of
the above option” if they realize that two of the alternatives are correct without considering the third
option. However, we can increase the difficulty level of a test item by presenting three or four answer
options, none of which is correct, followed by a correct “none-of-the-above” option.
2.5.1.2 Constructing Performance Assessments

In the previous paragraphs you have been learning on how objective test items should be constructed.
You have learned that well constructed objective tests can measure a variety of learning outcomes,
from simple to complex. Despite this wide applicability of objective-item types, there remain
significant learning outcomes for which no satisfactory objective measurements have been developed.
These include such outcomes as the ability to recall, organize, and integrate ideas; the ability to express
oneself in writing; and the ability to create rather than merely identify interpretations and applications
of data. Such outcomes require less structuring of responses than objective test items, and it is in the
measurement of these outcomes that written essays and other performance-based assessments are of
great value.
In this section, you will be presented with the most familiar form of performance-based assessment –
essay question. The distinctive feature of essay questions is that students are free to construct, relate,
and present ideas in their own words. Learning outcomes concerned with the ability to conceptualize,
construct, organize, relate, and evaluate ideas require the freedom of response and the originality
provided by essay questions.
Essay questions can be classified into two types – restricted-response essay questions and extended
response essay questions. Now let us briefly see these type of questions.
24
Restricted-response essay questions: These types of questions usually limit both the content and the
response. The content is usually restricted by the scope of the topic to be discussed. Limitations on the
form of response are generally indicated in the question.
Extended response Essays: these types of questions allow students:

 to select any factual information that they think is relevant,
 to organize the answer in accordance with their best judgment, and;
 to integrate and evaluate ideas as they deem appropriate.
This freedom enables them to demonstrate their ability to analyze problems, organize their ideas,
describe in their own words, and/or develop a coherent argument.
In addition to the already described capacity in measuring higher order thinking skills, essay questions
have some more advantages which include the following:
 Extended-response essays focus on the integration and application of thinking and problem
solving skills.
 Essay assessments enable the direct evaluation of writing skills.
 Essay questions, as compared to objective tests, are easy to construct.
 Essay questions have a positive effect on students learning.
On the other hand, essay questions also have some limitations which you need to be aware of. Perhaps
the most commonly cited problem of those test questions is their unreliability of scoring. Thus, the
same paper may be scored differently by different teachers, and even the same teacher may give
different scores for the same paper at different times. Another limitation is the amount of time required
for scoring the responses. Still another problem with essay tests is the limited sampling of content they
provide.
The improvement of the essay question requires attention to two problems:

a. How to construct essay questions that call forth the desired student response, and
b. How to score the answers so that achievement is reliably measured.
There are some guidelines for improving the reliability and validity of essay scores. The following are
suggestions for the construction of good essay questions:
 Restrict the use of essay questions to those learning outcomes that can not be measured
satisfactorily by objective items. As we have seen earlier, objective measures have the
advantage of efficiency and reliability. When objective items are inadequate for measuring
25
learning outcomes, however, the use of essay questions becomes necessary despite their
limitations.
 Structure items so that the student’s task is explicitly bounded. Phrase your essay items so
that students will have no doubt about the response you’re seeking. Don’t hesitate to add details
to eliminate ambiguity.
 For each question, specify the point value, an acceptable response-length, and a
recommended time allocation. What this second rule tries to do is give students the
information they need to respond appropriately to an essay item. The less guessing that your
students are obliged to do about how they’re supposed to respond, the less likely it is that you’ll
get lots of off-the-wall essays that don’t give you the evidence you need.
 Employ more questions requiring shorter answers rather than fewer questions requiring
longer answers. This rule is intended to foster better content sampling in a test’s essay items.
With only one or two items on a test, chances are awfully good that your items may miss your
students’ areas of content mastery or non mastery.
 Don’t employ optional questions. When students are made to choose their essay items from
several options, you really end up with different tests, unsuitable for comparison.
 Test a question’s quality by creating a trial response to the item. A great way to determine
if your essay items are really going to get at the responses you want is to actually try writing a
response to the item, much as a student might do.
As we have seen earlier the most serious limitation with essay questions is related to scoring.
Therefore, the following guidelines would be helpful in making the scoring of essay items easier and
more reliable.
1. You should ensure that you are firm emotionally, mentally etc before scoring
2. All responses to one item should be scored before moving to the next item
3. Write out in advance a model answer to guide yourself in grading the students’ answers
4. Shuffle exam papers after scoring every question before moving to the next
5. The names of test takers should not be known while scoring to avoid bias
2.5.2 Table of Specification and Arrangement of Items

Tests are one of the most important and commonly used assessment instruments used in education. If
tests are to be valid and reliable they have to be developed based on carefully designed plans. They also
have to be arranged on principles of test construction.
26
Table of Specification
The development of valid, reliable and usable questions involves proper planning. The plan entails
designing a framework that can guide the test developers in the items development process. This is
necessary because classroom test is a key factor in the evaluation of learning outcomes. The validity,
reliability and usability of such test depend on the care with which the test are planned and prepared.
Planning helps to ensure that the test covers the pre-specified instructional objectives and the subject
matter (content) under consideration. Hence, planning classroom test involves identifying the
instructional objectives earlier stated and the subject matter (content) covered during the
teaching/learning process. This leads to the preparation of table of specification (the test blue print) for
the test while bearing in mind the type of test that would be relevant for the purpose of testing.
To plan a classroom test that will be both practical and effective in providing evidence of mastery of
the instructional objectives and content covered requires relevant considerations. Hence the following
serves as guide in planning a classroom test.
i. Determine the purpose of the test;
ii. Describe the instructional objectives and content to be measured.
iii. Determine the relative emphasis to be given to each learning outcome;
iv. Select the most appropriate item formats (essay or objective);
v. Develop the test blue print to guide the test construction;
vi. Prepare test items that is relevant to the learning outcomes specified in the test plan;
vii. Decide on the pattern of scoring and the interpretation of result;
viii. Decide on the length and duration of the test, and
ix. Assemble the items into a test, prepare direction and administer the test.
The instructional objectives of the course are critically considered while developing the test items. This
is because the instructional objectives are the intended behavioural changes or intended learning
outcomes of instructional programs which students are expected to possess at the end of the
instructional process. The instructional objectives usually stated for the assessment of behavior in the
cognitive domain of educational objectives are classified by Bloom (1956) in his taxonomy of
educational objectives into knowledge, comprehension, application, analysis, synthesis and evaluation.
The objectives are also given relative weight in respect to the level of importance and emphasis given
to them. Educational objectives and the content of a course are the focus on which test development is
based.
27
A table of specification is a two-way table that matches the objectives and content you have taught with
the level at which you expect your students to perform. It contains an estimate of the percentage of the
test to be associated to each topic at each level at which it is to be measured. In effect we establish how
much emphasis to give to each objective or content. A table of specification guides the selection of test
items which in effect ensures that the test measures a representative sample of instructionally relevant
tasks.
Developing a table of specification involves:

1. Preparing a list of learning outcomes, i.e. the type of performance students are expected to
demonstrate
2. Outlining the contents of instruction, i.e. the area in which each type of performance is to be shown,
and
3. Preparing the two way chart that relates the learning outcomes to the instructional content.
Now, let us try to understand how a test blue print is developed using the following table of
specification developed for a Geography test as an example.
Instructional Objectives
Knowle Comprehen Applica Analy Synthe Evaluat Tot Perce
Contents
dge sion tion sis sis ion al nt
Air
2 2 1 1 - - 6 24%
pressure
Wind 1 1 1 1 - - 4 16%
Temperatu
2 2 1 1 - 1 7 28%
re
Rainfall 1 2 1 - 1 - 5 20%
Clouds 1 1 - 1 - - 3 12%
Total 7 8 4 4 1 1 25
Percent 28% 32% 16% 16% 4% 4% 100%
As can be observed from the table, the rows show the content areas from which the test is to be
sampled; and the columns indicate the level of thinking students are required to demonstrate in each of
the content areas. Thus, the test items are distributed among each of the five content areas with their
corresponding representation among the six levels of the cognitive domain. The percentage row and
column also shown the degree of representation of both the contents and levels of the cognitive domain
in this particular test. Thus objectives you consider are more important should get more representation
in the test items. Similarly, content areas on which you have spent more instructional time should be
allotted more test items.
28
There are also other ways of developing a test blue print. One of this is a way of showing the
distribution of test items among the content areas and the type of test items to be developed from each
content area. For example, the table of specification that we have seen earlier can be prepared in the
following way.
Item Types
Contents True/Fa Matchi Short Multiple Tot Perce
lse ng Answer Choice al nt
Air 1 1 1 3 6 24%
pressure
Wind 1 1 1 1 4 16%
Temperatu 1 2 1 3 7 28%
re
Rainfall 1 1 1 2 5 20%
Clouds 1 - 1 1 3 12%
Total 5 5 5 10 25
Percent 20% 20% 20% 40% 100%
Arrangement of test items

There are various methods of grouping items in an achievement test depending on their purposes. For
most purposes the items scan be arranged by a systematic consideration of:
 The type of items used
 The learning outcomes measured
 The difficulty of the items, and
 The subject matter measured
First, the items should be arranged in sections by item type. That is, all true-false items should be
grouped together, then matching items, then all short answer or completion items, and then all multiple
choice items. Extended-response essay questions and performance tasks usually take a lot of time that
they would be administered alone. If combined with some of the other types of items and tasks, the
extended response tasks should come last.
Arranging the sections of a test in this order produce a sequence that roughly approximates the
complexity of the outcomes measured, ranging from the simple to the complex. It is then a merely a
matter of grouping the items within each item type. For this purpose, items that measure similar
outcomes should be placed together and then arranged in order of ascending difficulty. For example the
items under the multiple choice section might be arranged in the following order: knowledge of terms,
29
knowledge of specific facts, knowledge of principles, and application of principles. Keeping together
items that measure similar learning outcomes is especially helpful in determining the type of learning
outcomes causing students the greatest difficulty/simplicity.
If, for any reason, it is not feasible to group the items by the learning outcomes measured, then it is still
desirable to arrange them in order of increasing difficulty. Beginning with the easiest items and
proceeding gradually to the most difficult has a motivating effect on students. Also, encountering
difficult items early in the test often causes students to spend a disproportionate amount of time on such
items. If the test is long, they may be forced to omit later questions that they could easily have
answered. With the items classified by item type, the sections of the test and the items within each
section can be arranged in order of increasing difficulty.
To summarize, the most effective method for organizing items in the typical classroom test is to:
c) Form sections by item type
d) Group the items within each section by the learning outcomes measured, and
e) both the sections and the items within sections in an ascending order of difficulty.
2.6. Administration of Tests

Test Administration refers to the procedure of actually presenting the learning task that the examinees
are required to perform in order to ascertain the degree of learning that has taken place during the
teaching-learning process. This procedure is as important as the process of preparing the test. This is
because the validity and reliability of test scores can be greatly reduced when test is poorly
administered. While administering test all examinees must be given fair chance to demonstrate their
achievement of the learning outcomes being measured. This requires the provision of a physical and
psychological environment which is conducive to their making their best efforts and the control of such
factors such as malpractices and unnecessary threat from test administrators that may interfere with
valid measurement. It is also concerned with selecting convenient and accurate procedures for scoring
the results.
There are a number of conditions that may create test anxiety on students ant therefore should be taken
care of during test administration. These include:
 Threatening students with tests if they do not behave
 Warning students to so their best “because the test is important”
 Telling students they must work fast in order to finish on time.
30
 Threatening dire consequences if they fail.
i) Ensuring Quality in Test Administration

Quality and good control are necessary components of test administration. The following are guidelines
and steps involved in test administration aimed at ensuring quality in test administration.
 Collection of the question papers in time from custodian to be able to start the test at the
appropriate time stipulated.
 Ensure compliance with the stipulated sitting arrangements in the test to prevent collision
between or among the test takers.
 Ensure orderly and proper distribution of questions papers to the test takers.
 Do not talk unnecessarily before the test. Test takers’ time should not be wasted at the beginning
of the test with unnecessary remarks, instructions or threat that may develop test anxiety.
 It is necessary to remind the test takers of the need to avoid malpractices before they start and
make it clear that cheating will be penalized.
 Stick to the instructions regarding the conduct of the test and avoid giving hints to test takers
who ask about particular items. But make corrections or clarifications to the test takers whenever
necessary.
 Keep interruptions during the test to a minimum.
ii) Credibility and Civility in Test Administration
Credibility and Civility are aspects of characteristics of assessment which have day to day relevance for
developing educational communities. Credibility deals with the value the eventual recipients and users
of the results of assessment place on the result with respect to the grades obtained, certificates issued or
the issuing institution. While civility on the other hand enquires whether the persons being assessed are
in such conditions as to give their best without hindrances and burdens in the attributes being assessed
and whether the exercise is seen as integral to or as external to the learning process.
Hence, in test administration, effort should be made to see that the test takers are given a fair and
unaided chance to demonstrate what they have learnt with respect to:
a) Instructions: Test should contain a set of instructions which are usually of two types. One is the
instruction to the test administrator while the other one is to the test taker. The instruction to the
test administrator should explain how the test is to be administered the arrangements to be made
for proper administration of the test and the handling of the scripts and other materials. The
instructions to the administrator should be clear for effective compliance. For the test takers, the
instruction should direct them on the amount of work to be done or of tasks to be accomplished.
31
The instruction should explain how the test should be performed. Examples may be used for
illustration and to clarify the instruction on what should be done by the test takers. The language
used for the instruction should be appropriate to the level of the test takers. The necessary
administrators should explain the test takers instruction for proper understanding especially when
the ability to understand and follow instructions is not part of the test.
b) Duration of the Test: The time for accomplishing the test is technically important in test
administration and should be clearly stated for both the test administrators and test takers. Ample
time should be provided for candidates to demonstrate what they know and what they can do. The
duration of test should reflect the age and attention span of the test takers and the purpose of the
test.
c) Venue and Sitting Arrangement: The test environment should be learner friendly with adequate
physical conditions such as work space, good and comfortable writing desks, proper lighting, good
ventilation, moderate temperature, conveniences within reasonable distance and serenity necessary
for maximum concentration. It is important to provide enough and comfortable seats with adequate
sitting arrangement for the test takers’ comfort and to reduce collaboration between them.
Adequate lighting, good ventilation and moderate temperature reduce test anxiety and loss of
concentration which invariably affects performance in the test. Noise is another undesirable factor
that has to be adequately controlled both within and outside the test immediate environment since
it affects concentration and test scores.
d) Other necessary conditions: Other necessary conditions include the fact that the questions and
questions paper should be friendly with bold characters, neat, decent, clear and appealing and not
such that intimidates test taker into mistakes. All relevant materials for carrying out the demands of
the test should be provided in reasonable number, quality and on time.
All these are necessary to enhance the test administration and to make assessment civil in
manifestation.
Unit Summary
In this unit you were introduced to different types of assessment approaches, namely formal vs.
informal, criterion referenced vs. norm referenced, formative vs. summative assessments. You also
learned about various assessment strategies. These include: classroom presentations,
exhibitions/demonstrations, conferences, interviews, observations, performance tasks, portfolios,
question and answer, students’ self assessment, checklists, rating scales and rubrics, one-minute paper,
muddiest point, students-generated questions and tests.
32
You also learned about the challenges in the assessment of large classes and their consequences and
some of the strategies that we can use to minimize those challenges. These strategies include: front
ending, Making use of in-class assignments, self and peer assessment, group assessment, Changing the
assessment method, or at least shortening it.
Much of this unit was devoted to the construction of the widely assessment techniques, that is tests. In
this regard, tests were classified into two broad categories: Objective tests and performance assessment
tasks (essay tests). Objective tests were further divided into supply type items and selection type items.
Supply type items include short answer and completion items, where as selection type items include
True/false items, matching items and multiple choice items. Essay items were also classified into
restricted essay items and extended essay items. Here you have learned about the strengths and
limitations of these different tests item types. You were also introduced to the major guidelines you
should follow in constructing these test item types.
This unit also covered about the planning or tests and particularly on the preparation of table of
specification or test blue print. You were also familiarized with how test item types should arrange.
Finally you also learned about the techniques and procedures we should follow when during test
administration
Unit 3: Item Analysis

3.2. Sections and sub-sections
Item analysis is an important phase in the construction of tests. It is the process involved in examining
or analyzing testee’s responses to each item on a test with a basic intent of judging the quality of item.
Item analysis helps to determine the adequacy of the items within a test as well as the adequacy of the
test itself. There are several reasons for analyzing questions and tests that students have completed and
that have already been graded. Some of the reasons that have been cited include the following:
1. Identify content that has not been adequately covered and should be re-taught,
2. Provide feedback to students,
3. Determine if any items need to be revised in the event they are to be used again or become part
of an item file or bank,
4. Identify items that may not have functioned as they were intended,
5. Direct the teacher's attention to individual student weaknesses.
The results of an item analysis provide information about the difficulty of the items and the ability of
the items to discriminate between better and poorer students. If an item is too easy, too difficult, failing
33
to show a difference between skilled and unskilled examinees, or even scored incorrectly, an item
analysis will reveal it. The two most common statistics reported in an item analysis are the item
difficulty and the item discrimination. An additional analysis that is often reported is the distractor
analysis. Once the item analysis information is available, an item review is often conducted. In the
following sections you are going to learn the statistical techniques used to analyse responses to test
items.
3.2.1. Item difficulty level index

Item difficulty index is one of the most useful, and most frequently reported, item analysis statistics. It
is a measure of the proportion of examinees who answered the item correctly; for this reason it is
frequently called the p-value. If scores from all students in a group are included the difficulty index is
simply the total percent correct. When there is a sufficient number of scores available (i.e., 100 or
more) difficulty indexes are calculated using scores from the top and bottom 27 percent of the group.
Item analysis procedures
1. Rank the papers in order from the highest to the lowest score
2. Select one-third of the papers with the highest total score and another one-third of the papers with
lowest total scores
3. For each test item, tabulate the number of students in the upper & lower groups who selected each
option
4. Compute the difficulty of each item (% of students who got the right item)
Item difficulty index can be calculated using the following formula:
P= ( )
Where, HSG = High Scoring Groups

– LSG = Low Scoring Groups
– N= the total number of HSG and LSG
The difficulty indexes can range between 0.0 and 1.0 and are usually expressed as a percentage. A
higher value indicates that a greater proportion of examinees responded to the item correctly, and it was
thus an easier item. The average difficulty of a test is the average of the individual item difficulties. For
maximum discrimination among students, an average difficulty of .60 is ideal. For example: If 243
students answered item no. 1 correctly and 9 students answered incorrectly, the difficulty level of the
item would be 243/252 or .96.
34
In the example below, five true-false questions were part of a larger test administered to a class of 20
students. For each question, the number of students answering correctly was determined, and then
converted to the percentage of students answering correctly.
Question Correct responses Item difficulty
1 | | | | | | | | | | | | | | | 15 75% (15/20)
2 | | | | | | | | | | | | | | | | | 17 85% (17/20)
3 |||||| 6 30% (6/20)
4 | | | | | | | | | | | | | 13 65% 13/20)
5 |||||||||||||||||||| 20 100% (20/20)
Activity: Calculate the item difficulty level for the following four options multiple choice test item.
(The sign (*) shows the correct answer).
Response Options
Groups A B C D* Total
High Scorers 0 1 1 8 10
Low Scorers 1 1 5 3 10
Total 1 2 6 11 20
Item difficulty interpretation

P-Value Percent Range Interpretation
> or = 0.75 75-100 Easy
< or = 0.25 0-25 Difficult
between .25 & .75 26-74 Average
For criterion-referenced tests (CRTs), with their emphasis on mastery-testing, many items on an exam
form will have p-values of .9 or above. Norm-referenced tests (NRTs), on the other hand, are designed
to be harder overall and to spread out the examinees’ scores. Thus, many of the items on an NRT will
have difficulty indexes between .4 and .6.
3.2.2. Item discrimination index

The index of discrimination is a numerical indicator that enables us to determine whether the question
discriminates appropriately between lower scoring and higher scoring students. When students who
earn high scores are compared with those who earn low scores, we would expect to find more students
in the high scoring group answering a question correctly than students from the low scoring group. In
35
the case of very difficult items which no one in either group answered correctly or fairly easy questions
which even the students in the low group answered correctly, the numbers of correct answers might be
equal for the two groups. What we would not expect to find is a case in which the low scoring students
answered correctly more frequently than students in the high group.
Item discrimination index can be calculated using the following formula:
D=
( )
Where, HSG = High Scoring Groups and LSG = Low Scoring Groups
In the example below, there are 8 students in the high scoring group and 8 in the low scoring group
(with 12 between the two groups which are not represented). For question 1, all 8 in the high scoring
group answered correctly, while only 4 in the low scoring group did so. Thus success in the HSG –
Success in the LSG (8 - 4) = +4. The last step is to divide the +4 by half of the total number of both
groups (16). Thus, will give us +.5, which is the D-value.

/ ( )
Question Success in the HSG Success in the LSG Difference D value

1 8 4 8–4=4 .5
2 7 2
3 5 6
Activity 2: Calculate the item discrimination index for the questions 2 & 3 on the table above.
The item discrimination index can vary from -1.00 to +1.00. A negative discrimination index (between
-1.00 and zero) results when more students in the low group answered correctly than students in the
high group. A discrimination index of zero means equal numbers of high and low students answered
correctly, so the item did not discriminate between groups. A positive index occurs when more students
in the high group answer correctly than the low group. If the students in the class are fairly
homogeneous in ability and achievement, their test performance is also likely to be similar, resulting in
little discrimination between high and low groups.
Questions that have an item difficulty index (NOT item discrimination) of 1.00 or 0.00 need not be
included when calculating item discrimination indices. An item difficulty of 1.00 indicates that
everyone answered correctly, while 0.00 means no one answered correctly. We already know that
neither type of item discriminates between students.
When computing the discrimination index, the scores are divided into three groups with the top 27% of
the scores in the upper group and the bottom 27% in the lower group. The number of correct responses
for an item by the lower group is subtracted from the number of correct responses for the item in the
36
upper group. The difference is divided by the number of students in either group. The process is
repeated for each item.
The value is interpreted in terms of both:

• direction (positive or negative) and
• Strength (non-discriminating to strongly-discriminating).
These values can range from -1.00 to +1.00.The possible range of the discrimination index is -1.0 to
1.0.
Item discrimination interpretation
D-Value Direction Strength
> +.40 positive strong
+.20 to +.40 positive moderate
-.20 to +.20 none ---
< -.20 negative moderate to strong
For a small group of students, an index of discrimination for an item that exceeds .20 is considered
satisfactory. For larger groups, the index should be higher because more difference between groups
would be expected. The guidelines for an acceptable level of discrimination depend upon item
difficulty. For very easy or very difficult items, low discrimination levels would be expected; most
students, regardless of ability, would get the item correct or incorrect as the case may be. For items
with a difficulty level of about 70 percent, the discrimination should be at least .30.
When an item is discriminating negatively, overall the most knowledgeable examinees are getting the
item wrong and the least knowledgeable examinees are getting the item right. A negative
discrimination index may indicate that the item is measuring something other than what the rest of the
test is measuring. More often, it is a sign that the item has been mis-keyed.
3.2.3. Distractor Analysis

One important element in the quality of a multiple choice item is the quality of the item’s distractors.
However, neither the item difficulty nor the item discrimination index considers the performance of the
incorrect response options, or distractors. A distractor analysis evaluates the effectiveness of the
distracters in each item by comparing the number of students in the upper and lower groups who
37
selected each incorrect alternative (a good distracter will attract more students from the lower group
than the upper group).
Just as the key, or correct response option, must be definitively correct, the distracters must be clearly
incorrect (or clearly not the "best" option). In addition to being clearly incorrect, the distractors must
also be plausible. That is, the distractors should seem likely or reasonable to an examinee who is not
sufficiently knowledgeable in the content area.
If a distractor appears so unlikely that almost no examinee will select it, it is not contributing to the
performance of the item. In fact, the presence of one or more plausible distractors in a multiple choice
item can make the item artificially far easier than it ought to be. Let us try to explain this using the
following table as an example that shows the responses of eight students to five multiple-choice
questions.
A B C D
TEST ITEM NO 1 5** 1 1 1
TEST ITEM NO 2 0 2 6** 0
TEST ITEM NO 3 2** 2 2 2
TEST ITEM NO 4 0 3** 0 5
TEST ITEM NO 5 2 1 0 5**
** Denotes Correct Answer
Over 50% of the students answered question number 1 correctly, and each of the distractors was
selected. The distractors have functioned as they should. The teacher may be less than satisfied with
only 5 of 8 students answering correctly, but a class would generally have more than eight students and
could well have a higher percentage of correct answers while still having effective distractors.
It is not desirable to have one of the distractors chosen more often than the correct answer, as occurred
with question 4. This result indicates a potential problem with the question. Distractor D may be too
similar to the correct answer and/or there may be something in either the stem or the alternatives that is
misleading.
If students do not know the correct answer and are purely guessing, their answers would be expected to
be distributed among the distractors as well as the correct answer, much like question 3. If one or more
distractors are not chosen, as occurs in questions 2, 4, and 5, the unselected distractors probably are not
plausible. If the teacher wants to make the test more difficult, those distractors should be replaced in
subsequent tests.
38
In a simple approach to distractor analysis, the proportion of examinees who selected each of the
response options is examined. The proportion of examinees who select each of the distractors can be
very informative. For example, it can reveal an item mis-key. Whenever the proportion of examinees
who selected a distractor is greater than the proportion of examinees who selected the key, the item
should be examined to determine if it has been mis-keyed or double-keyed. A distractor analysis can
also reveal an implausible distractor. In criterion referenced tests, where the item p-values are typically
high, the proportions of examinees selecting all the distractors are, as a result, low. Nevertheless, if
examinees consistently fail to select a given distractor, this may be evidence that the distractor is
implausible or simply too easy.
3.2.4 Item Banking

Building a file of effective test items and assessment tasks involves recording the items or tasks, adding
information from analyses of students responses, and filing the records by both the content area and the
objective that the item or task measures. Thus, items and tasks are recorded on records as they are
constructed; information form analysis of students responses is added after the items and tasks have
been used, and then the effective items and tasks are deposited in the file. In a few years, it is possible
to start using some of the items and tasks from the file and supplement these with new items and tasks.
As the file grow, it becomes possible to select the majority of the items and tasks from the file for any
given test or assessment without repeating them frequently. Such a file is especially valuable in areas of
complex achievement, when the construction of test items and assessment tasks is difficult and time
consuming. When enough high-quality items and tasks have been assembled, the burden of preparing
tests and assessments is considerably lightened. Computer item banking makes tasks even easier.
Unit Four
Interpretation of Scores
Sections and Sub-sections
Test interpretation is a process of assigning meaning and usefulness to the scores obtained from
classroom test. This is necessary because the raw score obtained from a test standing on itself rarely has
meaning. For instance, a score of 60% in one Assessment and evaluation of learning test cannot be said
to be better than a score of 50% obtained by the same test taker in another test of the same subject. The
test scores on their own lack a true zero point and equal units. Moreover, they are not based on the
same standard of measurement and as such meaning cannot be read into the scores on the basis of
which academic and psychological decisions may be taken.
39
4.1 Kinds of scores
Data differ in terms of what properties of the real number series (order, distance, or origin) we can
attribute to the scores. The most common kinds of scores include nominal, ordinal, interval, and ratio
scales.
A nominal scale involves the assignment of different numerals to categorize that are qualitatively
different. For example, we may assign the numeral 1 for males and 2 for females. These symbols do
not have any of the three characteristics (order, distance, or origin) we attribute to the real number
series. The 1 does not indicate more of something than the 0.
An ordinal scale has the order property of a real number series and gives an indication of rank order.
For example, ranking students based on their performance on a certain athletic event would involve an
ordinal scale. We know who third best, etc is best, second best,. But the ranked do not tell us anything
about the difference between the scores.
With interval data we can interpret the distances between scores. If, on a test with interval data, a
Almaz has a score of 60, Abebe a score of 50, and Beshadu a score of 30, we could say that the
distance between Abebe’s and Beshadu’s scores (50 to 30) is twice the distance between Almaz”s and
Abebe’s scores (60 to 50).
If one measures with a ratio scale, the ratio of the scores has meaning. Thus, a person whose height is
2 meters is twice as a tall as a person whose height is 1 meter. We can make this statement because a
measurement of 0 actually indicates no height. That is, there is a meaningful zero point. However, if a
student scored 0 on a spelling test, we would not interpret the score to mean that the student had no
spelling ability.
4.2 Methods of Interpreting test scores

If a student responds correctly to 65 items on objective tests which each correct item counts one point,
the raw score will be 65. Thus a raw score is simply the number of points received on a test when the
test has been scored according to the directions. We all are familiar with raw scores from our many
years of taking classroom tests. Although a raw score is a numerical summary of a student’s test
performance, it is not meaningful without further information. In general we can provide meaning to a
raw score either by converting it into a description of the specific tasks the student can perform
(criterion referenced interpretation) or converting it into some type of derived score that indicates the
student’s relative position in a clearly defined referenced group (norm referenced interpretation). In
some cases both types of interpretation may be appropriate and useful.
40
Criterion referenced interpretation
Criterion - referenced interpretation is the interpretation of test raw score based on the conversion of
the raw score into a description of the specific tasks that the learner can perform. That is, a score is
given meaning by comparing it with the standard of performance that is set before the test is given. It
permits the description of a learner’s test performance without referring to the performance of others.
Thus, we might describe a pupil’s performance in terms of the speed with which a task is performed,
the precision with which a task is performed, or the percentage of items correct on some clearly defined
set of learning tasks. The percentage-correct score is widely used in criterion-referenced test
interpretation.
Criterion referenced interpretation of test results is most meaningful when the test has been specifically
designed for this purpose. This typically involves designing a test that measures a set of clearly stated
learning tasks. Enough items are used for each interpretation to make it possible to describe test
performance in terms of students’ mastery or non-mastery of learning tasks.
Norm referenced test interpretation
Norm – referenced interpretation is the interpretation of raw score based on the conversion of the raw
score into some type of derived score that indicates the learner’s relative position in a clearly defined
referenced group. This type of interpretation tells us how an individual compares with other persons
who have taken the same test.
Norm – referenced interpretation is usually used in the classroom test interpretation by ranking the test
takers raw scores from highest to lowest scores. It is then interpreted by noting the position of an
individual’s score relative to that of other test takers in the classroom test. The interpretation such as
third position from highest position or about average position in the class provides a meaningful report
for the teacher and the test takers on which to base decision. In this type of test score interpretation,
what is important is a sufficient spread of test scores to provide reliable ranking. The percentage score
or the relative easy / difficult nature of the test is not necessarily important in the interpretation of test
scores in terms of relative performance.
4.2.1 Measures of Central Tendency

It is often important to summarize characteristics of a distribution of test scores. One characteristic of
particular interest is a measure of central tendency. The goal of the measures of central tendency is to
come up with the one single score that best describes a distribution of scores. They let us know if the
distribution of scores tends to be composed of high scores or low scores.
41
There are three basic measures of central tendency – the mean, the mode and the median - and
choosing one over another depends on two different things:
1. The scale of measurement used, so that a summary makes sense given the nature of the scores.
2. The shape of the frequency distribution, so that the measure accurately summarizes the
distribution.
The Mean
The mean, or arithmetic average, is the most widely used measure of central tendency. It is the average
of a set of scores computed simply by adding together all scores and dividing by the number of scores.
The mean takes into account the value of each score, and so one extremely high or low score could
have a considerable effect on it. It is helpful to know the mean because then you can see which
numbers are above and below the mean.
Here is an example of test scores for a Math’s class: 82, 93, 86, 97, 82. To find the Mean, first you
must add up all of the numbers. (82+93+86+97+82= 433) Now, since there are 5 test scores, we will
next divide the sum by 5. (440÷5= 88). Thus, the Mean is 88. The formula used to compute the mean is
as follows:
Where, = Mean
X
X = ∑ = the sum of
N
X = any score
N = Number of scores
The Median
In some circumstances, the mean may not be the best indicator of student performance. If there are
one or a few students who score considerably lower (or higher) than the other students, their scores
tend to pull the mean in their direction. In this case the median is usually considered a better
indicator of student performance. There are also some types of scores that are reported for
standardized tests for which the mean is not appropriate (percentile scores), so the median is used.
The median is a counting average. It is the number that divides a distribution of scores exactly in half.
It is determined by arranging the scores in order of size and counting up to (or down to) the midpoint
of the set scores. The median will usually be around where most scores fall. When the number of
scores is odd, the median is the middle score. If the number of scores is even, the median will be
42
halfway between the two middle most scores. In this case the median is not an actual score earned by
one of the students.
Example 1 Example 2 Example 3 Example 4

Scores Scores Scores Scores
50 50 49 50
48 49 48 49
48 48 48 47
47 46 47 47
45 46 45 45
44 43 44 45
43 43 43 45
42 42 42 44
42 41 42 42
41 41 41 41
38 41
In example 1, our line would be between 44 and 45, so the median would be halfway between them at
44.5. In this case the median is not an actual score earned by one of the students. In example 2, the
distance between the two middle scores (43 and 46) is more than one, so we again find the point
halfway between them for our median of 44.5. If the number of students is uneven, the median is the
one score that is the middle score in the frequency distribution, having equal numbers of scores above
and below it. Thus, the median is 44 in example 3, and 45 in example 4. It does not matter if more than
one student earns that score, as in example 4.
The Mode
This is the score (or scores) that occur most frequently and is determined by inspection. It is the least
reliable type of statistical average and is frequently used merely as a preliminary estimate of central
tendency. A set of scores may sometimes have two or more modes and in such cases are called
bimodal or multimodal respectively.
If the data is categorical (measured on the nominal scale), then only the mode can be calculated. The
mode can also be calculated with ordinal and higher data, but it often is not appropriate. If other
measures can be calculated, the mode would never be the first choice. For example, the following test
scores, 7, 7, 7, 20, 23, 23, 24, 25, 26 have a mode of 7, but obviously it doesn’t make much sense.
43
Remember, measures of central tendency look for the one number which best describe all of the
numbers.
Shape of Distributions: Skewness

There is one important situation in which all three measures of central tendency are identical. This
occurs when a distribution is symmetrical, that is, when the right half of the distribution is the mirror
image of the left half of the distribution. In this case the mean will fall exactly at the middle of the
distribution (the median position) and the value at this central point will be the most frequently
observed data value, the mode. If the values of the mean, the mode and the median are identical, a
distribution will always be
Symmetrical.
Figure 1: Shape of distribution of scores
To the extent that differences are observed among these three measures, the distribution is
asymmetrical or “skewed”. These include positively skewed distributions and negatively skewed
distributions. In a positively-skewed distribution (see figure 1 above) most of the scores concentrate at
the low end of the distribution. This might occur, for example, if the test was extremely difficult for the
students. .In a negatively-skewed distribution, as shown in figure 1 above, the majority of scores are
toward the high end of the distribution. This could occur if we gave a test that was easy for most of the
students.
Points to note
 With perfectly bell shaped distributions, the mean, median, and mode are identical.
 With positively skewed data, the mode is lowest, followed by the median and mean.
 With negatively skewed data, the mean is lowest, followed by the median and mode.
44
4.2.2 Measures of Variability/Dispersion
The measures of central tendency focus on what is typical, average or in the middle of a distribution.
The information provided by these measures is not sufficient to convey all we need to know about a
distribution. Knowing the mean, the median or the mode (or all of these) of a distribution does not
allow us to differentiate between distributions. We need additional information about the distributions.
A set of scores can be more adequately described if we know how much they spread out above and
below the measure of central tendency. For example, we might have two groups of students with a
mean score of 70, but in one group the span of scores is from 60 to 80 and in the other group the span is
from 50 to 100. These represent quite different spreads of performance. We can identify such
differences by numbers that indicate how much scores spread out in a group. These are called measures
of variability or dispersion. The three most commonly used measures of variability are the range, the
quartile deviation, and the standard deviation.
The Range
It is the simplest and crudest measure of variability calculated by subtracting the lowest score from the
highest score. For example, if the score of 10 students in a certain test is: 5, 7, 8, 10, 12, 13, 14, 15, 17,
19, then the range will be 19 -5 = 14. The range provides a quick estimate of variability but is
undependable because it is based on the position of the two extreme scores. The addition of subtraction
of a single score can change the range significantly.
Inter quartile range
Inter quartile range (IQR) is another range measure but this time looks at the data in terms of quarters
or percentiles. IQR is the distance between the 25th and 75 th percentile or the first and third quarter. The
range of data is divided into four equal percentiles or quarters (25%). IQR is the range of the middle
50% of the data. Therefore, because it uses the middle 50%, it is not affected by outliers or extreme
values. The IQR is often used with skewed data as it is insensitive to the extreme scores.
The Standard Deviation
Let us say that two classes took a quiz. There were 10 students in each class, and each class had an
average score of 81.5. Since the averages are the same, can we assume that the students in both classes
have the same performance on the exam?
The answer is… No. The average (mean) does not tell us anything about the distribution or variation in
the grades. So, we need to come up with some way of measuring not just the average, but also the
spread of the distribution of our data. The most useful measure of variability, or spread of scores, is the
standard deviation. It is essentially an average of the degree to which a set of scores deviates from the
mean. If the Standard Deviation is large, it means the numbers are spread out from their mean.
45
If the Standard Deviation is small, it means the numbers are close to their mean. Because it takes into
account the amount that each score deviates from the mean, it is a more stable measure of variability
than either the range or quartile deviation.
The procedure for calculating a standard deviation involves the following steps:
1. Compute the mean.
2. Subtract the mean from each individual’s score.
3. Square each of these individual scores.
4. Find the sum of the squared scores (∑X2).
5. Divide the sum obtained in step 4 by N, the number of students, to get the variance.
6. Find the square root of the result of step 5. This number is the standard deviation (SD) of the
scores.
 X  X 
2
Thus the formula for the standard deviation (SD) is: SD=
N
Now let us take the previous scenario of two groups of students who took a Math quizzes with a mean
score of 81.5 to calculate and compare their standard deviations. The individual scores of group A is:
72, 76, 80, 80, 81, 83, 84, 85, 85, & 89. The individual scores of group B is: 57, 63, 65, 71, 83, 93, 94,
95, 96, 98. Let us start with group A. So, the first step to finding the Standard Deviation is to find all
the distances from the mean. This will be followed by squaring each distance which will give us the
following results.
Scores of Team A Distances from the Mean Distances squared
72 - 9.5 90.25
76 - 5.5 30.25
80 - 1.5 2.25
80 - 1.5 2.25
81 - 0.5 0.25
83 1.5 2.25
84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Then we add up all of the squared distances which will gives us 214.5. This will be divided by the total
number of scores of the group which will result 214.5 /10 = 21.45. This is the variance of the data set.
46
Variance is the average squared deviation from the mean of a set of data. It is used to find the
standard deviation. Finally, we calculate the Square Root of the variance. This will give us 4.63,
which is the standard deviation.
The standard deviation, like other measures of variability, represents a distance. If we move the
distance equal to one SD above and below the mean, we will find that somewhere between 60% and
75% of the scores fall in that region of most distributions of scores. In a normal distribution, 68% of the
scores are included between the mean minus one SD and the mean plus one SD.
Which measure of dispersion to use?
The quartile deviation is used with the median and is satisfactory for analyzing a small number of
scores. Because these scores are obtained by counting and thus are not affected by the value of each
score, they are especially useful when one or more scores deviate markedly from the others in the set.
The standard deviation is used with the mean. It is the most reliable measure of variability, and is
especially useful in testing. In addition to describing the spread of scores in a group, it serves as a basis
for computing standard scores, the standard error of measurement, and other statistics used in analyzing
and interpreting test scores.
4.2.3. Measures of Relative Position

Percentiles
A percentile is a score that indicates the rank of the student compared to others (same age or same
grade), using a hypothetical group of 100 students. . It tells you what percentage of people you did
better than. A percentile of 25 (25 th percentile), for example, indicates that the student's test
performance equals or exceeds 25 out of 100 students on the same measure. A percentile of 87
indicates that the student equals or surpasses 87 out of 100 (or 87% of) students. A percentile must
always refer to a student’s percentile rank as relative to a particular norm group. If you scored at the
80th percentile, what does that mean?
Converting Data Value to Percentile
1. Arrange the data in ascending order
2. Count how many items are below your value. If for example your score is 85 and there are multiple
85’s then count how many are under the first 85.
For example, in the students’ scores of 76, 77, 80, 83, 85, 85, 85, 90, 96 ,97 there are 4 items below 85.
47
Percentile = number of items below your data + 0.5 * 100%
total number of values
So in our data example: Percentile = 4 * 0.5 *100% = 45 Percentile
10
Quartiles
Quartile is another term referred to in percentile measure. The total of 100% is broken into four equal
parts: 25%, 50%, 75% 100%.
 Lower Quartile is the 25th percentile. (0.25)
 Median Quartile is the 50th percentile. (0.50)
 Upper Quartile is the 75th percentile. (0.75)
Standard Scores
Another method of indicating a pupils relative position in a group is by showing how far the raw score
is above or below average. This is the approach used with standard scores. Basically, standard scores
express test performance in terms of standard deviation units from the mean. Standard scores are scores
that are based on mean and standard deviation.
Types of standard scores
Z score: For data distributions that are approximately symmetric, a measure of relative position that is
often used is the z-score. z-score gives us an estimate as to how many standard deviations a particular
score lies from the mean.
We define z score as z = X – X,
S Where, X = the data value in question
X = the sample mean
s = the sample standard deviation
For instance, if a person scored a 70 on a test with a mean of 50 and a standard deviation of 10, then
they scored 2 standard deviations above the mean. So, a z score of 2 means the original score was 2
standard deviations above the mean.
If the z-score is 0 then your data value is the mean
If the z-score > 0 (positive) then your data value is above the mean
If the z-score < 0 (negative) then your data value is the below the mean.
48
Example: Almaz scored a 25 on her math test. Suppose the mean for this exam is 21, with a standard
deviation of 4. Dawit scored 60 on an English test which had a mean of 50 with a standard deviation of
5. Who did relatively better?
Since standardized tests typically have score distributions which are approximately symmetric, we will
find the respective z-scores for Almaz and Dawit.
Almaz= z-score: 25 - 21 =1
4
Dawit's z-score: 60-50 = 2
5
Since Dawit had a higher z-score, we say Dawit did relatively better.
T Scores: This refers to any set of normally distributed standard scores that has a mean score of 50 and
a standard deviation of 10. The T – score is obtained by multiplying the Z-score by 10 and adding the
product to 50. That is, T – Score = 50 + 10(z). A score of 60 is one standard deviation above the mean,
while a score of 30 is two standard deviations below the mean.
Example
A test has a mean score of 40 and a standard deviation of 4. What are the T – scores of two test takers
who obtained raw scores of 30 and 45 respectively in the test?
Solution
The first step in finding the T-scores is to obtain the z-scores for the test takers. The z-scores would
then be converted to the T – scores. In the example above, the z – scores are:
For the test taker with raw score of 30, the Z – score is:
Z – Score = X – M, where the symbols retain their usual meanings.
SD
X = 30, M = 40, SD = 4.
Thus, Z – Score = 30 - 40 = -10 = -2∙5
4 4
The T - Score is then obtained by converting the Z – Score (-2∙5) to T – score. Thus:
T – Score = 50 + 10 (z)
= 50 + 10 (-2∙5)
= 50 – 25= 25
4.2.4 Measures of Relationship

If we have two sets of scores from the same group of people, it is often desirable to know the degree to
which the scores are related. For example, we may be interested in the relationship between the test
49
scores of students for the English Subject and their overall scores of other subjects. The degree of
relationship is expressed in terms of coefficient of correlation. The value ranges from -1.00 to +1.00. A
perfect positive correlation is indicated by a coefficient of +1.00 and a perfect negative correlation by a
coefficient of -1.00. A correlation of .00 indicates no relationship between the two sets of scores.
Obviously, the larger the coefficient (positive of negative), the higher the degree of relationship
expressed.
There are several different measures of relationship expressed as correlation coefficients. One of these
is the product-moment correlation coefficient, which is by far the most commonly used and most useful
correlation coefficient. It is indicated by the symbol r.
The formula for obtaining the coefficient of correlation is: r =

 X  X Y  Y 
NS x S y
Where, X = score of person on one variable

Y = score of same person on the other variable
X = mean of the X distribution
Y = mean of the Y distribution
Sx = standard deviation of the X scores
Sy = standard deviation of the Y scores
N = number of pairs of scores
UNIT 5: Ethical Standards of assessment

5.2. Sections and sub-sections
5.2.1 Ethical and Professional Standards of Assessment and its Use
Ethical standards guide teachers in fulfilling their obligation to provide and use tests that are fair to all
test takers regardless of age, gender, disability, ethnicity, religion, linguistic background, or other
personal characteristics.
Fairness is a primary consideration in all aspects of testing. It:
 Helps to ensure that all test takers are given a comparable opportunity to demonstrate what
they know and how they can perform in the area being tested.
 Implies that every test taker has the opportunity to prepare for the test and is informed about
the general nature and content of the test.
 Also extends to the accurate reporting of individual and group test results.
The following are some ethical standards that teachers may consider in their assessment practices.
50
1. Teachers should be skilled in choosing assessment methods appropriate for instructional decisions.
Skills in choosing appropriate, useful, administratively convenient, technically adequate, and fair
assessment methods are prerequisite to good use of information to support instructional decisions.
Teachers need to be well-acquainted with the kinds of information provided by a broad range of
assessment alternatives and their strengths and weaknesses. In particular, they should be familiar
with criteria for evaluating and selecting assessment methods in light of instructional plans.
2. Teachers should develop tests that meet the intended purpose and that are appropriate for the
intended test takers. This requires teachers to:
 Define the purpose for testing, the content and skills to be tested, and the intended test takers.
 Develop tests that are appropriate with content, skills tested, and content coverage for the
intended purpose of testing.
 Develop tests that have clear, accurate, and complete information.
 Develop tests with appropriately modified forms or administration procedures for test takers with
disabilities who need special accommodations.
3. The teacher should be skilled in administering, scoring and interpreting the results from diverse
assessment methods. It is not enough that teachers are able to select and develop good assessment
methods; they must also be able to apply them properly. This requires teachers to:
 Follow established procedures for administering tests in a standardized manner.
 Provide and document appropriate procedures for test takers with disabilities who need
special accommodations or those with diverse linguistic backgrounds.
 Protect the security of test materials, including eliminating opportunities for test takers to
obtain scores by fraudulent means.
 Develop and implement procedures for ensuring the confidentiality of scores.
4. Teachers should be skilled in using assessment results when making decisions about individual
students, planning teaching, developing curriculum, and school improvement. Assessment results
are used to make educational decisions at several levels: in the classroom about students, in the
community about a school and a school district, and in society, generally, about the purposes and
outcomes of the educational enterprise. Teachers play a vital role when participating in decision-
making at each of these levels and must be able to use assessment results effectively.
5. Teachers should be skilled in developing valid pupil grading procedures which use pupil
assessments. Grading students is an important part of professional practice for teachers. Grading is
defined as indicating both a student's level of performance and a teacher's valuing of that
51
performance. The principles for using assessments to obtain valid grades are known and teachers
should employ them.
6. Teachers should be skilled in communicating assessment results to students, parents, other lay
audiences, and other educators. Teachers must routinely report assessment results to students and to
parents or guardians. In addition, they are frequently asked to report or to discuss assessment results
with other educators and with diverse lay audiences. If the results are not communicated
effectively, they may be misused or not used. To communicate effectively with others on matters of
student assessment, teachers must be able to use assessment terminology appropriately and must be
able to articulate the meaning, limitations, and implications of assessment results. Furthermore,
teachers will sometimes be in a position that will require them to defend their own assessment
procedures and their interpretations of them. At other times, teachers may need to help the public to
interpret assessment results appropriately.
7. Teachers should be skilled in recognizing unethical, illegal, and otherwise inappropriate assessment
methods and uses of assessment information. Fairness, the rights of all concerned, and professional
ethical behavior must undergird all student assessment activities, from the initial planning for and
gathering of information to the interpretation, use, and communication of the results. Teachers must
be well-versed in their own ethical and legal responsibilities in assessment. In addition, they should
also attempt to have the inappropriate assessment practices of others discontinued whenever they
are encountered. Teachers should also participate with the wider educational community in defining
the limits of appropriate professional behavior in assessment.
In addition, the following are principles of grading that can guide the development of a grading system.
1. The system of grading should be clear and understandable (to parents, other stakeholders, and
most especially students).
2. The system of grading should be communicated to all stakeholders (e.g., students, parents,
administrators).
3. Grading should be fair for all students regardless of gender, socioeconomic status or any other
personal characteristics.
4. Grading should support, enhance, and inform the instructional process.
5.2.2. Ethnicity and Culture in tests and assessments

In the previous section you have learned that fairness is the fundamental principle that has to be
followed in teachers’ assessment practices. It has been said that all students have to be provided with
equal opportunity to demonstrate the skills and knowledge being assessed. Fairness is fundamentally a
52
socio-cultural, rather than a technical, issue. Thus, in this section we are going to see how culture and
ethnicity may influence teachers’ assessment practices and what precautions we have to take in order
avoid bias and be accommodative to students from all cultural groups.
Students represent a variety of cultural and linguistic backgrounds. If the cultural and linguistic
backgrounds are ignored, students may become alienated or disengaged from the learning and
assessment process. Teachers need to be aware of how such backgrounds may influence student
performance and the potential impact on learning. Teachers should be ready to provide
accommodations where needed.
Classroom assessment practices should be sensitive to the cultural and linguistic diversity of students in
order to obtain accurate information about their learning. Assessment practices that attend to issues of
cultural diversity include those that
 Acknowledge students’ cultural backgrounds.
 Are sensitive to those aspects of an assessment that may hamper students’ ability to demonstrate
their knowledge and understanding.
 Use that knowledge to adjust or scaffold assessment practices if necessary.
Assessment practices that attend to issues of linguistic diversity include those that
 Acknowledge students’ differing linguistic abilities.
 Use that knowledge to adjust or scaffold assessment practices if necessary.
 Use assessment practices in which the language demands do not unfairly prevent the students
from understanding what is expected of them.
 Use assessment practices that allow students to accurately demonstrate their understanding by
responding in ways that accommodate their linguistic abilities, if the response method is not
relevant to the concept being assessed (e.g., allow a student to respond orally rather than in
writing).
Teachers must make every effort to address and minimize the effect of bias in classroom assessment
practices. Bias occurs when irrelevant or arbitrary factors systematically influence interpretations and
results made that affect the performance of an individual student or a subgroup of students. For
example, bias may occur when variables—such as cultural and language differences and
socioeconomic status—are not fairly accounted for when interpreting results from an assessment.
Assessment should be culturally and linguistically appropriate, fair and bias-free. It may not be possible
to totally eliminate all forms of bias from classroom assessments. However, teachers and others who
assess students’ learning should recognize that bias is an ever-present concern to student assessment
53
and be vigilant and resistant to the sources of bias, including plans for identifying and addressing bias.
For an assessment task to be fair, its content, context, and performance expectations should:
 reflect knowledge, values, and experiences that are equally familiar and appropriate to all
students;
 tap knowledge and skills that all students have had adequate time to acquire;
 Be as free as possible of cultural and ethnic stereotypes.
5.2.3. Disability and Assessment Practices

It is quite obvious that our education system was exclusionary in fully accommodating the educational
needs of disabled students. This has been true not only in our country but in the rest of the world as
well, although the magnitude might differ from country to country. It was in response to this situation
that UNESCO has been promoting the principle of inclusive education to guide the educational policies
and practice of all governments. Different world conventions were held and documents signed towards
the implementation of inclusive education. Our country, Ethiopia, has been a signatory of these
documents and therefore has accepted inclusive education as a basic principle to guide its policy and
practice in relation to the education of disabled students
Inclusive education is based on the idea that all students, including those with disabilities, should be
provided with the best possible education to develop themselves. This implies for the provision of all
possible accommodations to address the educational needs of disabled students. Accommodations
should not only refer to the teaching and learning process. It should also consider the assessment
mechanisms and procedures.
There are different strategies that can be considered to make assessment practices accessible to students
with disabilities depending on the type of disability. In general terms, however, the following strategies
could be considered in summative assessments:
 Modifying assessments: - This should enable disabled students to have full access to the
assessment without giving them any unfair advantage.
 Others’ support: - Disabled students may need the support of others in certain assessment
activities which they can not do it independently. For instance, they may require readers and
scribes in written exams; they may also need others’ assistance in practical activities, such as
using equipments, locating materials, drawing and measuring.
 Time allowances: - Disabled students should be given additional time to complete their
assessments which the individual instructor has to decide based on the purpose and nature of the
assessment.
54
 Rest breaks: Some students may need rest breaks during the examination. This may be to
relieve pain or to attend to personal needs.
 Flexible schedules: In some cases disabled students may require flexibility in the scheduling of
examinations. For example, some students may find it difficult to manage a number of
examinations in quick succession and need to have examinations scheduled over a period of
days.
 Alternative methods of assessment:- In certain situations where formal methods of assessment
may not be appropriate for disabled students, the instructor should assess them using non formal
methods such as class works, portfolios, oral presentations, etc.
 Assistive Technology: Specific equipment may need to be available to the student in an
examination. Such arrangements often include the use of personal computers, voice activated
software and screen readers.
5.2.4 Gender issues in assessment

Teachers’ assessment practices can also be affected by gender stereotypes. The issues of gender bias
and fairness in assessment are concerned with differences in opportunities for boys and girls. A test is
biased if boys and girls with the same ability levels tend to obtain different scores.
Test questions should be checked for:
 material or references that may be offensive to members of one gender,
 references to objects and ideas that are likely to be more familiar to men or to women,
 Unequal representation of men and women as actors in test items or representation of members
of each gender only in stereotyped roles.
If the questions involve objects and ideas that are more familiar or less offensive to members of one
gender, then the test may be easier for individuals of that gender. Standards for achievement on such a
test may be unfair to individuals of the gender that is less familiar with or more offended by the objects
and ideas discussed, because it may be more difficult for such individuals to demonstrate their abilities
or their knowledge of the material.
55

Assessment and Evalaution

Uploaded by

Copyright:

Available Formats

You might also like

Assessment and Evalaution

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assessment and Evalaution

Uploaded by

Copyright:

Available Formats

Unit 1: Assessment: Concept, Purpose, and Principles

1.2 Importance and Purposes of Assessment

1.3. The Role of Educational Objectives in Assessment

1. Assessment should be relevant. Assessment needs to provide information about students’

2. Assessment should be appropriate. Assessment needs to provide information about the

1.6. Assessment, Learning, and the Involvement of Students

2.2 Types of assessment

i. Formative and Summative Assessments

ii. Formal and Informal Assessment

2.3 Assessment Strategies

The purpose of checklists, rating scales and rubrics is to:

2.4 Assessment in large classes

b) Feedback is often inadequate

d) Difficulty in monitoring cheating and plagiarism

2.5 Selecting and developing assessment methods and tools

2.5.1 Constructing Tests

2.5.1.1 Constructing Objective Test Items

True/False Test Items

Short Answer/Completion Test Items

There are two important variants in a multiple-choice item:

2.5.1.2 Constructing Performance Assessments

Extended response Essays: these types of questions allow students:

The improvement of the essay question requires attention to two problems:

2.5.2 Table of Specification and Arrangement of Items

Developing a table of specification involves:

Arrangement of test items

2.6. Administration of Tests

i) Ensuring Quality in Test Administration

Unit 3: Item Analysis

3.2.1. Item difficulty level index

Where, HSG = High Scoring Groups

Item difficulty interpretation

< or = 0.25 0-25 Difficult

between .25 & .75 26-74 Average

3.2.2. Item discrimination index

groups (16). Thus, will give us +.5, which is the D-value.

Question Success in the HSG Success in the LSG Difference D value

The value is interpreted in terms of both:

Item discrimination interpretation

D-Value Direction Strength

> +.40 positive strong

+.20 to +.40 positive moderate

-.20 to +.20 none ---

< -.20 negative moderate to strong

3.2.3. Distractor Analysis

3.2.4 Item Banking

4.2 Methods of Interpreting test scores

4.2.1 Measures of Central Tendency

Example 1 Example 2 Example 3 Example 4

Shape of Distributions: Skewness

Figure 1: Shape of distribution of scores

4.2.3. Measures of Relative Position

total number of values

So in our data example: Percentile = 4 * 0.5 *100% = 45 Percentile

s = the sample standard deviation

4.2.4 Measures of Relationship

The formula for obtaining the coefficient of correlation is: r =

Where, X = score of person on one variable