Download as pdf or txt
Download as pdf or txt
You are on page 1of 184

EPSC 228/311/0228

MEASUREMENT AND EVALUATION


LECTURE NOTES-REVISED MARCH 2021

Prof. Ezra Maritim

1
EPSC 228/311/0228

RECORD OF CLASS REPRESENTATIVES


ACADEMIC YEAR SEMESTER
Name Class
Mobile Venue
Time 1
Time 2
ACADEMIC YEAR SEMESTER
Name Class
Mobile Venue
Time 1
Time 2
ACADEMIC YEAR SEMESTER
Name Class
Mobile Venue
Time 1
Time 2
ACADEMIC YEAR SEMESTER
Name Class
Mobile Venue
Time 1
Time 2
ACADEMIC YEAR SEMESTER
Name Class
Mobile Venue
Time 1
Time 2

2
EPSC 228/311/0228

RECORD OF LECTURES COVERED


S/N TOPIC COVERED DATE NEXT TOPIC

3
EPSC 228/311/0228

TABLE OF CONTENTS
LECTURE 1: INTRODUCTION
LECTURE 2: MEASUREMENT CONCEPTS
LECTURE 3: THE USES OF TESTS
LECTURE 4: EXAMINATION SYSTEM IN KENYA
LECTURE 5: EVALUATION AND PERFORMANCE STANDARDS
LECTURE 6: TESTING CODE OF ETHICS
LECTURE 7: PSYCHOMETRIC CHARACTERISTICS OF A GOOD TEST
LECTURE 8: BLOOM’S DOMAINS OF LEARNING
LECTURE 9: LEARNING OBJECTIVES AND LEARNING OUTCOMES
LECTURE 10: TEST PLANNING
LECTURE 11: TEST SPECIFICATIONS AND TEST CONSTRUCTION
LECTURE 12: ITEM ANALYSIS
LECTURE 13: DEFICIENCIES IN TEACHER-MADE TESTS
LECTURE 14: TEST ADMINISTRATION, SCORING AND INTERPRETATION
LECTURE 15: STATISTICAL ANALYSIS OF TEST SCORES

4
EPSC 228/311/0228

LECTURE 1: INTRODUCTION

Welcome to lecture 1

TOPIC OBJECTIVES
After this lecture, the learner should be able to:
 Discuss how testing developed in China, Europe
and America and its relationship to African context.
 Explain the concept of education with special
reference to United Nations, World Trade
Organization and the Kenyan constitution.

BRIEF HISTORY OF TESTING AND MEASUREMENT


The use of tests for identifying potential employees; professional trainees‘
selection, readiness and placement is a long established practice in
human history.

African Context
Did the pre-modern African society have tests? Yes. Wrestling, jumping,
dancing, tooth extraction/removal, and circumcision, were among the
tests that were used by women for the choice and the selection of
suitors; show of bravery; and readiness for the society‘s defence.

In China, Europe and USA the increase use of testing was attributed to
three major areas of development: civil-service; school; and study of
individual differences.

Chinese Context
 Civil-service testing began in China about 3000 years ago when an
emperor decided to assess the competency of his officials. It has
5
EPSC 228/311/0228

been reported that by the year 2200 B.C, the emperor of China
examined his officials every third year and after three examinations,
he either promoted or dismissed them from Civil Service.
 Later, the Chinese government positions were filled by persons who
scored well on examinations that covered topics such as music,
horsemanship, civil law, and writing.
 Such examinations were eliminated in 1905 and were replaced by
formal educational requirements.
European Context
 Students in European schools were given oral examinations until
after the 12th century.
 In the 16th century the Jesuits (catholic order/organization that vow
for poverty and obedience) started using tests for the evaluation
and placement of their students. By this time, Jesuits were running
schools across Europe. The Jesuits‘ standardized curriculum and
teaching methods became the basis of many education systems to-
day.
 In 19th Century the study of individual difference led by Sir Francis
Galton (1822-1911) began in Great Britain. He was the first
experimental psychologist to look into psychological differences
(nature and nurture) in sensory and motor skills between people
and the application of statistical methods (correlation and
regression) in the analysis and quantification of the individual
differences.
 By 1905 when China was phasing out the use of examinations in
public service, civil-service examinations were being developed in
Britain and the United States as way of selecting applicants for
government jobs.
 In 1905 Alfred Binet (1871-1938) developed the first individual
tests of intelligence as part of his work on the study of individual
differences (M.A/C.A x 100). He was asked by the French

6
EPSC 228/311/0228

government to come up with a test of identifying children who were


not benefitting from school instructions.
American Context
 During War 1(1914-1918) United States Army used tests for
selection of army recruits.

WHAT IS EDUCATION?
Though education is seen as both an investment and a social service to
the community and the nation, it means different things to different
people or organizations.

1. Generic (general) Definition


From a generic perspective, education simply means:
 Transmission of knowledge, skills and attitudes from one person to
another or the transmission of knowledge from a teacher to a pupil.
 The transmission of one‘s culture from generation to generation or
process by which people are prepared to live effectively in their
environment. [Refer to recent Masaai Moran graduation]
Education can therefore be seen as:
 The process of developing the capacities and potentials of the
individual so as to prepare that individual to be successful in a
specific society or culture. Education in this context is a continuous
process which begins at birth and continued throughout life.
 It is a process of bringing about a permanent change in human
behavior.
From these perspectives, education has two dimensions:
(i) The content (what is taught).
(ii) The process / method of imparting knowledge or molding individuals
in society in order to develop their potentials (how it is taught).
This is the process of imparting knowledge e.g. through
imitations, storytelling etc.

7
EPSC 228/311/0228

Education as a process involves teaching and learning which may occur in


a formal setting as in a school, in non-formal settings as in artisan
training centres, or in an informal setting as in families, bush schools and
peer groups.
Ref: Hidden curriculum using iceberg example:

Iceberg
Formal education

_____________________________________________________

Non-formal education
Informal education Hidden curriculum

LEARNING AND TESTING IN AFRICAN INDIGENOUS EDUCATION


CONTEXT

1. Learning Through Productive Work


Children learn the right type of masculine or feminine roles through
doing and working hand in hand with adults.
i. Boys with their fathers
ii. Girls with their mothers

Their roles were given according to the age of the child –from
simple to complex duties.

The learning process involved:

a) Seeing (Observation)
b) Imitation
2. Learning Through Oral Literature
8
EPSC 228/311/0228

Through writing was unknown during pre-colonial society learning


took place through:
i. Myths = tales of imaginary description of things.
ii. Legends = tales of fabricated accounts.
iii. folk-songs and dances = dances during rite of passage,
funerals

3. Learning Through Ceremonies


i. Rite of passage
ii. During marriage, etc

4. Learning Through Hunting


 Hunting: mockery and real hunting
 Trapping animals

5. Learning Through the Inculcation of Fear


 Discouraging females not to eat eggs
 Denial of food for allowing bulls to fight.
 Being punished for misconduct and carelessness
6. Learning Through Formal Teaching
 Though there was no structured curriculum, learning through
apprenticeship was a formal and direct e. g. child being sent
by the parents to work with craftsmen, potters, blacksmiths,
and stool makers. The same was true with the acquisition of
hereditary occupations e.g. a herbalist in handing over his
―secrets‖ about what medicine to use for which disease and
how would instruct his child from time to time until he became
knowledgeable and proficient in its practice.

So long as learning was taking place, testing of the same though not in
writing or by use of paper and pen, was also in place. Punishment and
corrections were indicators of failure of acquisition of expected
9
EPSC 228/311/0228

knowledge, skills and values. Praises and selection for leadership


positions were indicators of one having passed or acquired expected
knowledge skills and values of the society.

[Source: African Indigenous Education: As practiced by the Acholi


of Uganda by J.P. Ociti].

2. United Nations Definition


According to United Nations, education is a common good. It is a
common good in the sense that it is beneficial for all members of a given
community. Education helps a community to escape poverty and create
wealth. Education is a stepping stone through which students/learners
can make something better for their future and their communities. It is a
human right. This is why countries are encouraged to provide free basic
education (primary and secondary education). Comment on World Bank
funding primary instead higher education on the ground higher in primary
education investment.
The late Nelson Rolihlanla Mandela described education as “the most
powerful weapon which you can use to change the world”.
The late Mwalimu Nyerere saw education as a weapon for liberation and
as a weapon to fight poverty and not as a way to escape poverty.

3. World Trade Organization (WTO)


According to World Trade Organization (WTO), education is a saleable
/tradable commodity. It is a commodity like hamburger, goats etc that
are put on market for sale.

4. Our Constitution
Our constitution conceptualizes education as a fundamental human right.
Section 43 (f) states that every person has the right to education. Section

10
EPSC 228/311/0228

53 states that every child has the right to free and compulsory basic
education.

Classification of Education
1. By Levels
Three levels of education in Kenya are:

UNESCO defines these two


 Primary levels as Basic Education
 Secondary
 Tertiary Middle level

University level

2. By Methods
 Formal – School settings
 Non-formal – By attachment
 Informal – Family settings

3. By Programs include:
 Special education
 Science and technology education
 Arts / humanities education
 Vocational education

Economy Rationale for Classifying Education


The classification of modern education by levels, programs and methods
enables an investor (parents, ministry, etc) to set;
i. clear objectives
ii. to make clear assessment of the investment requirements in terms
of:
a. material and human resources, as well as
b.the demand for any particular type or level of education. This
is where assessment or measurement of objectives comes in.
11
EPSC 228/311/0228

LECTURE 2: MEASUREMENT CONCEPTS

Welcome to lecture 2.

TOPIC OBJECTIVES
After this lecture, the learner should be able to:
 Explain the concept of measurement.
 List the attributes of scale of measurement.
 Discuss the relationship between the measurement
scales.
 Differentiate between qualitative and quantitative data.

In this course there are certain terms that are frequently used. These
include:

TEST AND TESTING

What is a test?
A test is a device by which we can use to sample the candidate/students
behavior. In this case behavior is performance.

Population

Sample

12
EPSC 228/311/0228

 Testing is always a matter of sampling. We can never ask all the


questions we could like to ask in test.
 The common kind of test that the teacher is used to is the paper
and pencil (paper and pen) test, i.e. that in which the pupil is
required to write or mark his / her answers on a paper. However,
tests may take various other forms. In some cases, the pupil may
indicate his answer orally (oral tests), in other he / she may be
required to carry out certain activities during which he / she is
observed and scored by an observer.
 A classroom test must be in harmony with instructional
objectives and subject content. What are the instructional
objectives? These are the learning objectives or what the course
can do. For each topic, the teacher provides what the learners are
expected / able to learn at the end of the topic presentation. These
learning objectives are stated in action-verbs (measurable and
observable) e.g. to identify, describe, discuss, explain,
demonstrate, analyze, etc. (To be sure that these learning
objectives are achieved, the preparation of a test should follow a
systematic procedure. We will discuss this procedure later in this
course.

TYPE OF TESTS
Which is your favorite quotation in the bible? Bible is a document with so
many tests. Our earthly tests include but not limited to:

Achievement Tests (Example: KCPE and KCSE)


Achievement tests measure the correct status of learners with respect to
proficiency in given area of knowledge or skill and hence can be used for
selection. They are developed by experts. Achievement test can be:

13
EPSC 228/311/0228

 Standardized test (which are designed to cover content which are


common to many classes of the same type of learners).
 Locally-developed (designed to measure particular set of learning
outcomes, set by specific teacher. These are also called teacher
made test-intended by teachers to test the areas they covered in
class.

Aptitude Tests
Aptitude tests are measures of potential. They are used to predict how
well someone is likely to perform in future. Tests of general aptitude are
variously referred to as scholastic aptitude tests, intelligence tests, and
tests of general mental ability. Aptitude tests are also available to predict
a person‘s likely level of performance after following some specific future
instruction or training.
Aptitude tests are standardized.
Example: When Kenya Science Teachers College was established the
only institution for training secondary school science teachers then it used
aptitude test to selection the trainees.

Personality Tests
Personality tests for diagnosis of behavior problems. They are designed to
measure characteristics of individuals along a number of dimensions
including:
 Attitudes towards self and others
 Team building
Most of these tests are self-report measures where an individual is asked
to respond to a series of questions or statements and are available online.

14
EPSC 228/311/0228

Standardized Tests (Example: KCPE and KCSE)

A standardized test is one that is developed by experts or subject matter,


specialists, administered interpreted using uniform procedures and
performance standards (norm-referenced and criterion-referenced
standards). That standardization here implies:

 Uniform setting standards

 Tests developed by subject matter specialists

 Uniform administration standards

 Uniform scoring procedure

Most standardized tests are objective, written tests requiring written


responses.

Locally-Developed Test (Teacher-Made Tests)

The opposite of standardized test is non- standardized test. Such tests are
usually developed locally for a specific purpose. The tests are used by
teachers in the classroom are examples of locally developed tests. Such
test may:

 Be as good as a standardized test, but not often,

 Be more practical and more appropriate

 More likely to reflect what was taught in class to a greater degree


than standardized tests

Assessment

Assessment implies / involves getting information. It is getting


information on a learner‘s ability to demonstrate acquisition and
application of outcomes of a course/program of learning, leading to the
15
EPSC 228/311/0228

award of a qualification or transition to the next grade. E.G. reading


ability, reading ability, and numeracy.

Assessors are expected to answer questions such as ―Is John making


progress? “Is Jane a good learner?” In school settings or classroom
situation, we get information about the learners through the following
modes / techniques/methods:

Formal vs Informal testing

Formative vs Summative testing

Coursework vs Examination

In summary assessment is the systematic collection and


analysis of information to improve student learning.

Evaluation

In terms of measurement evaluation is defined as the ability to make


judgment on the basis of given information. Judging the
worthiness of a programme, project or course. It is the use of
measurement in making decisions.

Examination

A written, oral or practical assessment, as the case may be, in accordance


with general education policy.

MEASUREMENT AND DATA TYPES


Measurement is part of our lives and daily activities. We measure great
variety of things (e.g. weight, temperature, time, distance etc). We are
also being measured by doctors, teachers, supervisors etc.

What is measurement?

16
EPSC 228/311/0228

There two common definitions of measurement, namely:


 Measurement is the act of assigning numbers or symbols to
characteristics of objects (people, events, etc) according to rules.
 Measurement is the process of quantifying the degree to which
someone or something possesses a given trait, i.e. quality,
characteristic of feature, e.g. aggression, sociability.

Benefits of measurement
A great advantage in using measurement is that one may quantify and
apply the powerful tools of statistic/ mathematics to the study and
describe a numbers of relationships:
i. The relation between mental ability and achievement. For example,
the mental ability and achievement can be measured by assigning
numbers and calculation of an index of relation between the two
variables (correlation coefficient). This cannot be done through
observations or descriptions.

Studies show that children who are carried at back while are infants are
slow in language and emotional development. They have no opportunity
to observe emotional expressions of the mothers unlike the children who
are placed in front or being carried facing the mothers‘ faces.

17
EPSC 228/311/0228

One can compute correlation coefficient between the number of months


the child was carried at the back and scores achievement on verbal test
(see hypothetical example below).
No. of Months Carried Scores of
verbal/emotional tests
10 1
9 2
8 3
7 4
6 5
5 7
4 8
3 9
2 9
1 10
ii. The relationship between the quality of food eaten and obesity; or
no excise and obesity; or numbers of hour sitting watching TV and
obesity. Both variables involve quantification or quantifiable
measures.

Number of Hours Watching TV per day Weight gain


per month

18
EPSC 228/311/0228

12 10 kg
10 9 kg
9 8 kg

What is Scale of Measurement /Levels of Measurement?


Scales of measurement (sometimes called ―level‖ of measurement) refers
to the units in which a variable is measured: centimeters, seconds, IQ
points, etc.
There are four types or basic measurement scales which can be arranged
in order of information provided about the values along the scale. These
measurement scales are:
 Nominal
 Ordinal
 Interval
 Ratio

What are variables?


In measurement we measure variables. What are variables?
 Variables are indicators, attributes or characteristics that take on
different numerical values among sample of people, objects etc.
 Variables are therefore measurable entities using specific units and
any nominal, ordinal, interval and ratio.

A variable can be changed or can be by itself cause change. That which


can be changed is called dependent variable (e.g. performance in
examination). The one that causes change is called independent variable
(e.g. school type, parental income; parental level of education). The one
that intervenes or comes between IV and DV is called intervening variable
[eg home chores]
Examples of variables are age (5 yrs, 15 yrs, 30 yrs, 60 yrs, etc), sex
(male or female), scores (20, 40, 50, etc), weight (45 kg, 65 kg, etc),
19
EPSC 228/311/0228

income (high, medium, low), educational level (primary, secondary,


university).

Intervening Variable
(E.g. Home Chores)

Independent
Variable Dependent Variable
(E.g.SchoolType) (School Performance)

We measure these variables by assigning numbers.

Nominal Scale
Which marriage option do you prefer? (Tick one choice).
1. Church wedding 27%
2. Civil wedding 4%
3. Traditional wedding 37%
4. Come-we-stay 33%
Source: Daily Nation, Saturday, July 8, 2017, p.3
 Nominal scales are the simplest form of measurement.
 A nominal scale entails the assignment of numbers or labels to
objects or classes of objects. In other words, numbers are used to
substitute for names.

20
EPSC 228/311/0228

 This scale involves identification, classification or categorization


based on one or more distinguished characteristics where all objects
must be placed into mutually exclusive and exhaustive
categories.

Examples of Nominal
- We classify and label people objects and places according to:
- Sex: male vs female and can be assigned numbers 1 = male, 2 =
female.
- Place of residence: Nakuru or Nairobi– assign Nakuru =1, Nairobi
=2
- Jobs: teacher, lawyer, driver, nurse; 1 = teacher, 2= lawyer, 3 =
driver, 4 = nurse.
- Size: 1 = big, 2 = medium, 3=small
- Colour: 1 = Red, 2 = Black, 3 = Yellow.
- Shape: 1=round; 2=circular, 3=triangular, 4=rectangular
- Numerals on sports uniforms: The player represented by 45 is not
―better‖ or ―more than‖ the player represented by 32.
Use of symbols:
A=Nakuru
B=Nairobi
C= Kisumu

M= Male
F= Female

21
EPSC 228/311/0228

-Political parties: KANU, NASA, JUBILEE, KADU, SDP, KENDA


 A nominal scale is the simplest kind of scale because its rule for
assigning numbers (or other labels) to objects or events is the
simplest. The rule is that objects or events of the same kind get
the same number and objects or events of a different kind get a
different number.
 People sometimes think that a nominal scale is too primitive to be
considered as proper scale.
 Has no ―zero value‖

Ordinal Scale
 Ordinal scales are typically measures of non-numeric concept like
satisfaction, happiness, discomfort, birth order etc.
 Like nominal, ordinal scale permits classification. However, in
addition to classification, rank-ordering on some characteristics is
also permissible with ordinal scales.
 It ranks objects or events in order of their magnitude.
 Nominal + Rank ordering = ordinal scale
 No absolute zero point
 Measures quality rather than quantity.

Examples of Ordinal Scale


How do you feel today?
- 1. Very unhappy.
- 2. Unhappy
- 3. Ok.
- 4. Happy
- 5. Very happy
How much difference is there between very unhappy and unhappy?
UNKNOWN. We cannot quantify.

22
EPSC 228/311/0228

How satisfied were with EPSC 223?


- 1. Very unsatisfied
- 2. Somewhat unsatisfied
- 3. Neutral
- 4. Somewhat satisfied
- 5. Very satisfied.
How proud are you to be a Kenyan?
1. Very proud.
2. Quite proud
3. Proud
4. Somehow proud
5. Not proud at all.

Can abortion be justified?


1. Always justified.
2. Sometimes justified.
3. Never justified.

23
EPSC 228/311/0228

Age (when measured using the value categories):


- Elderly
- Middle aged
- Young adult
- Adolescent
- Pre-adolescent
- Child.
Educational attainment (when measured by the highest degree):
- University degree
- Diploma
- Secondary certificate
- Primary certificate
- Illiterate.
Social class standing (when measured) by
- Rich
- Poor.
Self-esteem
Students‘ self-esteem may be scored along an ordinal scale as:
- low,
- moderate, or
- high.

Ranking textbooks (when measured by clarity of writing style):


Students may rank – order textbooks with respect to clarity of writing
style.
 Textbook (Nominal) + Rank order
 Book 1
 Book 2
 Book 3
 Book 4
 Book 5

24
EPSC 228/311/0228

 If for example, students were asked to judge 5 textbooks on clarity


of writing style, they might assign:

 The best textbook 1


 The second textbook 2
 The third textbook 3
 The fourth textbook 4
 The least favourite 5
You cannot say that textbook 1 is 5 times clearer than textbook 5.
Kinder person assigned 1; a kinder person assigned 2.
 Likert scale is an interval scale. E.g.
 Strongly agree
 Agree
 Fairly agree
 Disagree
 Strongly disagree

 Another characteristic of ordinal scales is that they have no


absolute zero point. When using ordinal scales, no implications
are made about how much greater one ranking is than another.
 Intelligence, aptitude and aggression test scores are basically
ordinal. They do not indicate the amount of intelligence, aptitude
and personality traits of individuals, but rather the rank-order
position of individuals.

Interval Scale
Interval scale meets all the criteria for ordinal level measurement and one
additional one-the exact distance between categories of the variable are
known and are equal to each other. The distance between points on the
scale is fixed.

25
EPSC 228/311/0228

 Interval scale contains equal intervals between numbers.


Interval between the numbers is equal. ―Equal interval‖ means that
the distance between the things represented by 2 and 3 is the same
as the distance between the things represented by 3 and 4.
 The distances between each interval in the scale are equal.

Examples of Interval Scale


i. Measure of temperature
ii. Calendar Year

 Two points next to each other on the scale, no matter whether they
are high or low are separately by the same distance.
i. Distance between 00c and 100c = 100c
ii. Distance between 900c and 1000c=100c
Cf 100c and 1000c. Does not mean 1000c is not 10 times hotter than
something measuring 100c. This is because there is no absolute zero, the
zero is arbitrary.

Why arbitrary zero?


Zero value is taken as the point at which water freezes and 1000c value
when water begins to boil and between these extremes values the scale is
divided into a 100 equal divisions.

100 equal divisions

00c / / / 1000c

 Nominal + ordinal + equal intervals = Interval scale. Each unit on


the score is exactly equal to any other unit on the scale. We know
exact difference between the values.

26
EPSC 228/311/0228

 As with ordinal scales, interval scales contain no absolute zero


point, but arbitrary zero.
 This is a numeric scale.
100 C + 100C =200C, BUT 200C is not twice hotter than100 C.

Difference between 20oC and 30oC is the same as the difference between
30oC and 40oC. Each interval is 10oC.
 While the difference between 20oC and 30oC and 40oC is the same
we cannot say that 40oC is twice as hot as 20oC.
 Allows us to make interpretation of how warm or cold on a given
day:
- Monday = 75oC
- Tuesday = 650C
Monday was warmer than Tuesday
 An interval scale has a zero point that does not indicate the absence
of the quality. We say it has arbitrary zero, not absolute zero. That
is zero degrees C is not indicative of the complete absence of
temperature. That is 00c does not mean that the temperature
does not exist at that point. Example Kenya‟s athlete whose
legs were amputated because of walking frozen snow.
 Fallacy: 80 0C is twice as hot as 40 0C. The statement is fallacious
because the zero point on the Celsius scale has been arbitrarily set
at the freezing of water.
The case study of Kenyan athlete Cheseto amputated
legs

That is zero degrees C is not indicative of the complete absence of


temperature. That is, 00c does not mean that the temperature
does not exist at that point. Example: Kenya’s athlete (Cheseto)
whose legs were amputated because they were frozen by snow.

27
EPSC 228/311/0228

On one freezing November weekend in 2011, Marko Cheseto, having


finished his day‘s classes, decided to go for a run at around 7 p.m.

He set off on a familiar trail, and, initially all was well as he jogged into
the woods. He was not wearing a jacket or gloves to beat the chilling cold.
What happened next remains a mystery to date.

―I remember the first 10 minutes of my run and can‘t recall anything


else,‖ he told Lifestyle.

It was soon all over the news that Cheseto had gone missing for three
days.

―Missing runner sparks massive search‖, the Anchorage Daily News


headline screamed as Alaska police mounted a huge search aided by the
Alaska Mountain Rescue Group‘s helicopters, dogs, and volunteers who
included Cheseto‘s Kenyan colleagues and fellow students.

But after 50 hours of trawling through 15 inches of snow in sub-zero


temperatures, police called off the search. Cheseto‘s colleagues on
campus were devastated. They cried. Cheseto said:

―I gained consciousness at night covered deep in the snow and I did not
know where I was,‖ Cheseto, already a household name in Alaska for his
running exploits, recalls. ―I tried to get up but couldn‘t because my legs
were frozen... I tried to stretch but could not.‖

(b). Calendar Year

Calendar years are another example of an interval measurement. An


arbitrary 0 or 1 for hindus) was assigned when Christ was born, and time
before this is given the prefix BC.

28
EPSC 228/311/0228

2019 AD 0 BC 3000

When Christ was born (starting counting point)

AD (In Latin= Anno Domini= in the year of our Lord.

Other than these examples of temperature and calendar years, interval


measurement are rare.

Ratio Scale
 Ratio scale has all the properties of nominal, ordinal and interval
measurement + true zero point.
 Ratio = Nominal + Ordinal + Interval + Zero or absolute
Point.
 Ratio scale has a true or absolute zero point. Zero-point means that
no amount of the attributes measured.
 When you are measuring the number of responses in an experiment
or in observation, you are using a ratio scale. Zero responses mean
literary that there are no responses.
 This scale has a true or absolute zero point. Zero indicates an
absence of the quality measured. Zero means that no amount of the
attribute measured. Zero means ―none‖. Zero height, zero weight,
zero time means no amount of these variables is present.

Examples of Ratio:
i. Measures of weight
On a measurement of weight, it is meaningful to say that an object
weighing 30 kgs is three times as heavy as one weighing 10 kgs.
ii. Measures of height

29
EPSC 228/311/0228

A person who is 180 cm fall is 1.2 times taller than one who is 150 cm
tall.
iii. Measure of time
Employees are expected to report to their work stations at 8.00 am.
Employee A reported at 9.00 am = 1 hour late
Employee B reported at 10.00 am= 2 hours late
Employee C reported at 11.00 am= 3 hours late
Employee D reported at 12.00 am= 4 hours late

Employee D arrived 4 hours late than employee A. You can count how
many employees came four hour late.

iv. Income
Person A= earns kshs. 60,000/=
Person B= earns kshs. 20,000/=
Person A earns three times person B.
The value of zero income means no earned income.
v. Length

vi. No of children in the family


- Family A =No child
- Family B =2 children;
- Family C =3 children.
vii. How many times a politician has been married
- None
- Once
- Twice
- Thrice
viii. How many organizations have supported needy
students at the university?
- None

30
EPSC 228/311/0228

- One
- Twice
- Three
 Summary: A ratio scale is one that has a meaningful zero point as
well as meaningful differences between the numbers on the scales.

SUMMARY OF MEASUREMENT SCALES


 The 4 types of measurement scales are nested within one another.
o The properties of the former are retained and additional properties
are gained.
o One can always go back to a lower measurement level from a
higher one. The reverse is not generally true.
 If feasible, always aim for the highest measurement level possible

Nominal Ratio
Identity
Identity
+ Order
+ Equal intervals
+ Absolute zero

Ordinal Interval

Identity
Identity
+ Order
+ Order
+Equal intervals

31
EPSC 228/311/0228

SUMMARY OF DATA TYPES


From the measurement scales we collect two types of data, namely:
i. Quantitative data (numeric data)
ii. Qualitative data (non-numeric data)

Data Types

Quantitative Qualitative

Ratio Interval Nominal


Ordinal

Statistics that can be computed include:

“aka” means “also known as”

32
EPSC 228/311/0228

Precision

Ratio
Interval
Ordinal
Nominal

33
EPSC 228/311/0228

LECTURE 3: THE USES OF TESTS:


FUNCTIONS OF MEASUREMENT
AND EVALUATION
Welcome to lecture 3.

TOPIC OBJECTIVES
After this lecture, the learner should be able to:
 Discuss the various uses of tests in educational settings.
 Explain the side effects of school-based testing.

The ultimate purpose of testing is to enable the decision-making process


so that improvement can be made. Testing can have many purposes.
General purposes that facilitate the decision-making process that are
applicable to pre-school to secondary teachers include:
1. Placement
Tests can be used to place students in homogenous classes or groups or
streams according to their abilities. This is used to judge the capacity of
students to learn and to plan for instruction that fit the learner and his
capacity. That is, placement is meant for maximum teaching efficiency.
Hence tests are used to identify exceptional children. This was the
primary purpose of testing when school testing was introduced in 1905 in
France (Ref. Binet IQ test- for ability grouping/streaming). – below
average; average; bright
 In physical education adult fitness tests are used to determine
current status so that an individualized programme can be
prescribed.

34
EPSC 228/311/0228

 Tests are used to determine the grade or year the pupil he/she
should be enrolled.
 Putting learners in either homogenous or heterogeneous groups for
purposes ot instructions.

2. Diagnosis
Tests can be used to diagnose weaknesses or the learning difficulties of
pupils while placement usually involves the status of the individual
relative to others, diagnosis is used to isolate specific deficiencies that
make for low or undesirable status. In pre-school to secondary school
settings, tests can identify areas where students need to make
improvements.
 For purposes of inclusive education and provision of Least Restricted
Environment diagnosis with respect to hearing difficulties; visual
difficulties; epilepsy and other challenges is necessary. With effect
from January 2018, Grade 1 admission is accompanied by medical
reports.
 In physical education / exercise or health setting, test results are
used to diagnose a health problem. In PE a treadmill stress test is
used to screen for heart disease. (Cite 2016 business men climbing
Longonot). One died.
 In case of reading and writing, a child who suffers from dyslexia
writes in mirror images:
18 for 81
81 for 18
F for 7
b for d
5 for 2
A child who suffers from dyslexia speaks one thing and writes another
thing.

35
EPSC 228/311/0228

3. Evaluation of Achievement/Instruction
One goal of testing is to determine whether individuals have achieved the
course objectives.
Placement, diagnosis and the evaluation of achievement together form
the basis of individualized instruction. In pre-school to secondary school
settings, this can be the achievement of instructional objectives. Testing
is therefore used to assess and improve teaching.
4. Prediction
The test results can be used to predict the pupil‘s level of achievement in
future activities or predict one measure from another e.g. from KCPE to
KCSE performance. Prediction often seeks information on future
achievement from a measure of present status, and it may help students
to select the activities they are mostly likely to master.
5. Readiness
Pre-school tests are used to measure the child‘s readiness for Standard 1
tasks.
6. Personal/Guidance and Counselling
 The test results can be used in making decisions about the future
e.g. subject choice/ career choice.
 Assist individual to make wise decision for themselves (personality
tests, aptitude tests, etc.)
7. Grading Classification
The test results can be used to assign pupils to a particular achievement
classification e.g. first class, second class and pass.

8. Programme Evaluation
Test results of participants can be used as one bit of evidence to evaluate
the programme. By comparing the KCSE results of a County school
against national norms or standards, important decisions can be made.
Comparing changes in class performance between tests can provide
evidence of the effectiveness of teaching.

36
EPSC 228/311/0228

X1 X2
(Test 1) (Test 2)

 In this regard, tests are used to evaluate the effectiveness of the


teaching which again can lead to specific action. How effective is the
teaching at Kilimo Sec. School?

Kilimo Sec KCSE Cf National KCSE Results

Results

 Used to check the progress of learning.


 How effective is the teaching at Kilimo Sec. School?

9. Motivation
Test scores can be motivating. Achievement of important standards can
encourage one to achieve higher levels of performance or to participate
regularly in physical activity. When children are given feedback or positive
remarks on their learning progress, they are motivated.

10. Selection and Award of Scholarships


The primary purpose of national examination in developing countries
including Kenya is selection-to determine [who are] the most suitable
candidates for a course, a class or university. Equity Bank uses KCPE
results to award scholarships to the best candidates.
 Tests are given to determine students ‗ability/suitability to enter a
particular school/college. What does this mean with respect to
recently introduced 100% transition for KCPE candidates?

37
EPSC 228/311/0228

11. Evaluation of Effectiveness of Teaching Method


Which method is more effective? If the County or Country has to choose
for implementation one of the methods, learners must be subjected on
experimental basis to the two methods. The test results will then be used
to make a decision.
Teaching Method 1 (Use of audio only)
X1 X2
Pretest scores Post-test scores = Mean= 50

Teaching Method 2 (Use for both Audio and visual)

X1 X2
Pretest scores Post-test scores =M=80

The question is which alternative teaching technique is efficient. From


learners‘ scores, it appears teaching method 2 is better and hence can be
adopted.

SIDE EFFECTS OF TESTING


While testing has individual benefits, it has also side effects, including:
(i) The results may adversely influence the teacher‘s subsequent
judgments. For example, even where bright student makes
mistakes, the teacher can ignore.
(ii) Teachers may allow their knowledge of a child‘s ability in one
area to affect their judgment in another area (the “halo
effect”). See article below that appeared in Daily Nation of Sept
13, 2014. In a country with high level of ethnicity, use of random
numbers for both the candidate and the school may the remedy
of ―halo effect‖ in national examinations.
In short, ―halo effect‖ is a consistent error/bias that occurs when the
examiner‘s general impression biases his/her rating/scoring of the

38
EPSC 228/311/0228

candidates/rates. This can be on the basis of gender, ethnicity, class


ranks or knowledge of the candidate‘s previous performance.
The case of “Halo Effect”

39
EPSC 228/311/0228

Mitigation by KNEC
i. With effect from 2019, KNEC no longer require candidates
to use Index Number based on mock or other school
examinations but rather user the Registration Numbers
given when they entered Form 1.
ii. One other way is to randomize the candidates‘ numbers.

(iii) Labeling/disabling/ stereotyping: Labeling students as a


result of testing lead to disabling them and creating stereotypes.
E.g. Former KCE grading into Division 1; Division 11; Division
111; and Division IV. No matter how good DIV.IV were good in
some subjects they were disabled by their label DIV. 4
(iv) Racism justification. Can be used to discriminate people on
the basis of race. Americans and Boers of South Africa used test
results to discriminate blacks [See Arthur Jensen article
attached].
(v) Destiny determination. Can bring about stress on the learners
and may lead to suicide in extreme case, particularly where
results are of high stake in society. Eg Japan, where pupils
commit suicide because of poor results. Like getting a D in KCSE
in this country.
(vi) Political misinterpretation of results:
a. Kenya is notorious in accusing senior officials of KNEC when
the areas where they come from did well in national
examinations on the ground that they manipulated exam in
favour of their home areas. The same accusation is now
shifting to IEBC. Why? Everything in this country is seen from
ethnic lines.
b. 1983 CPE History Paper. The following question was set in
1983 CPE History Paper:

40
EPSC 228/311/0228

“I was an African leader. I was overthrown by a


military coup. I went to exile. I died while in exile. My
remains were brought home for re-burial. Who am I?”
A. Gamal Nasser
B. Kabaka Mutesa 11
C. Haile Selassie
D. Idi Amin Dada
Though the question was well set and good in terms of being analytical
and drawn from the syllabus, the question was misinterpreted by KANU
politicians as having being designed in bad faith-that KNEC was
inculcating overthrowing government in children‘s minds.

(vii). “self-fulfilling prophecy”. The teacher‘s knowledge or


expectation of a child‘s ability may affect the achievement of that child;
the child who is expected to be bright makes more progress than the child
of whom there are lower expectations of success. This has come to be
called “self-fulfilling prophecy”.
The teachers‘ expectations can create self-fulfilling prophecies. In
general, self-fulfilling prophecies recur when false beliefs create their
own reality. In the classroom, self-fulfilling prophecy occurs when a
teacher holds an initially erroneous expectation about a student, and who,
through social interaction, cause the student to behave in such a manner
as to confirm the originally false (but now true) expectation.

 In other words, this is a situation where a student becomes


what the teacher expects him/her to become.

 A situation where the teacher’s knowledge/expectations of


the learner’s ability affects the future achievement of the
learner.

In the Pygmalion study, class teachers were told that the intelligence test
that was administered to the children has identified 20% of the children
41
EPSC 228/311/0228

as ―late bloomers‖. This was not true. The 20% of the children identified
to the teachers were actually selected at random.

Results after 1 year: Teacher expectations created a self-fulfilling


prophecy. Behold the 20% of the children actually ―bloomed‖ and gained
more IQ points than the rest. Why? Possibly the teachers interacted
more with those pupils and paid more attention to them in class
work.

(viii). The cost of internal testing in Kenya is passed on to parents even


though it should be covered under free primary education. Even under
free primary education parents are paying for these tests.
(ix). “Leniency/severity errors‖ is common in essay test. –This is
where examiners/raters give ratings which that are consistently too high
or too low.
(x). “Error of central tendency”. Some examiners tend to avoid
extreme categories or giving high scores, concentrating instead on
categories around the midpoint or average of the scales. This is called
error of central tendency. Also common in essay test.

How are some of these errors minimized?


1. Training on marking or application of the scale.
2. Adequate coordination of examiners.
3. Clear marking scheme.
4. Retiring erratic/generous/hard markers.
5. Randomizing candidates‘ Index Numbers
6. Blind making: Names candidates to be deleted or hidden.
7. Using school registration of candidates – do not reflect ability of
learners.

42
EPSC 228/311/0228

LECTURE 4: EXAMINATION SYSTEM IN


KENYA

Welcome to lecture 4.

TOPIC OBJECTIVES
After this lecture, the learner should be able to:
 Discuss the current place of national examinations in
Kenya.
 Discuss public criticisms of national examinations in
Kenya.
 Explain non-normality in examinations.
 Discuss the proposed Competency Based Education.

Since the colonial era, examinations have been part and parcel of
education in Kenya. During the colonial period secondary school
examinations for Form IV and Form VI were conducted by Cambridge
Examinations Syndicate.

Why do we need national assessments?


1. We need to know how well students are learning especially in
regard to the:
(a) National education goals; (what are our national education
goals).
(b) Aims of the curriculum.
(c) Preparation for further learning and lifelong learning.

In 2016, only 15% achieved between A and C+. This is a group that
demonstrated mastery of subject content and the readiness for further

43
EPSC 228/311/0228

learning. The implication of 15% is that affirmative action policy will not
be applied this year in university admission. If applied the C grade will be
admitted.

2. The examination should act as an assessment of students‘


knowledge and skills. A tool for testing that a certain level of
education has been achieved by the learner. This is an important
aspect for diagnosing learning competencies and placement of
students in the right career and/or academic tracks; identify
students who need remedial teaching.
3. Assessment can help us understand inadequacies of learning
outcome by subgroups such as sex, geographical region and social
class. It would be of grave concern to the public and Ministry of
Education if, for example, the 6% who achieved grade E and the
45% between D and D- belongs to certain geographical regions –
those from ASAL poor.
4. The examination results could be used to interrogate what factors
are associated with performance such as student facilities, student-
teacher ratio, student-textbook ratio, school leadership, etc. The
2016 KCSE results clearly indicated that some of these factors
require scrutiny.

Colonial Period
1925: Nairobi European School did poorer than Indian schools in KCE. The
results were suppressed. The white community did not want to hear that
Indian students have done better than the whites.
1940: Alliance High School and Mangu High School did better than Prince
of Wales School. The white community was unhappy.
Post-Colonial
After independence, Kenya, Uganda and Tanzania under the umbrella of
East African Community established the East African Examinations Council
to take over from Cambridge Examinations Syndicate.

44
EPSC 228/311/0228

Due to ideological differences between the three countries, particularly


between Kenya and Tanzania (capitalism of Kenya and socialism of
Tanzania) East African Community was disbanded in 1977 and each
country established its own examination board.
Briefly, between 1977 and 1980, examinations were conducted by the
examinations Department in the Ministry of Education.

KNEC Legal Mandate


In 1980, KNEC was legally established. KNEC is legally mandated inter
alia to:
(i) Conduct all school examinations.
(ii) Authorize other foreign examining boards to conduct their
examinations in the country.
(iii) Issue certificates and diplomas to successful candidates.
(iv) Ensure security of examinations
(v) Take legal and disciplinary actions against those involved in
examination malpractices, including prohibiting a candidate who
cheats in examination from taking an examination for a period
not exceeding three years. Universities should also consider this
option.
(vi) The 1980 KNEC Act was repealed in 2012 and further
amendments made in 2015. The 2015 Amendment created
National Examinations Appeal Tribunal to deal with appeals from
aggrieved parties on withdrawing, nullification and cancellation of
results.
Since independence, the various education commissions that have
been set to look into educational reforms have all made
recommendations on the examination system. These recommendations
included:
(i) Change of examination name
 Kenya African Preliminary Examinations (Colonial name)

45
EPSC 228/311/0228

 Kenya Primary Education (post-colonial name)


 Certificate of Primary Education (post-colonial name)
 Kenya Certificate of Education (For Form IV)
 Kenya Advanced Certificate of Education (For Form VI)
 Kenya Certificate of Secondary Education from 1984.
 Kenya Certificate of Basic Education (KCBE) for 2-6-3-3
system. To be administered in Form IV.
(ii) Number and combinations of subjects to be offered in the
examination.
The current main examinations that are conducted by KNEC are:
1. KCPE
2. KCSE
3. Primary Teachers Examinations (PTE), Technical and Business at
craft level, certificate, diploma and higher diploma.

Discuss triangular relationship between MoE (policy issues), KICD


(curriculum development and review) and KNEC (examination of
the achievement of the curriculum).
MoE (Policy Issues)

KICD KNEC
(Curriculum Development) (Examining the Achievement of Curriculum)

46
EPSC 228/311/0228

Organizational Structure of KNEC


On policy decision, KNEC has a Council. The council members are
appointed by the Cabinet Secretary in accordance with KNEC Act. Besides
general administration, the main departments at KNEC are:
 Test development department.
 Test administration department.
 Research department.
The secretariat is headed by Council Secretary (CEO) who is responsible
for day-to-day running of KNEC. The key officers in KNEC are subject
officers/subject experts. The subject officers are people who have been
teachers. The responsibilities of the subject officers include:
(i) Being in charge of specific subject, e.g. CRE, biology, English,
mathematics
(ii) Participating in curriculum development at Kenya Institute of
Curriculum Development
(iii) Identifying test developers.
(iv) Development of the table of specifications.
(v) Organizing moderation and Awards Sessions.

KNEC‟s Role in Conducting School-Examinations


With respect to examinations, KNEC has four primary roles in ensuring
that the national examinations are fair to all candidates, namely:
 Examinations development
 Examination Administration
 Examination processing and issuing certificates and diplomas
 Preparation of post-mortem reports

Examination Development
Examination development is a process that starts with test planning to
moderation of draft questions. The subject experts are guided through the

47
EPSC 228/311/0228

table specifications by the subject officer. The subject expert uses the
syllabi to prepare the test items. Once the test items have been prepared,
they are subjected to moderation by another team of subject experts.
Moderation is a process of ensuring that test items are not ambiguous
and are free of errors.

Item Banking
Discuss the concept of Item Banking and its importance and
challenges in a corrupt society like Kenya.
 Globally used by examination boards in Europe
 KNEC used item banking until 2016 when CS ignorantly directed
KNEC to abolish it.
 Used at the Open University of Tanzania
 Useful when the curriculum has stabilized
 In case of major disaster like plane crush while bringing exams from
UK where they are printed and mass leakage that may lead to
cancellation of the entire examination item banking is useful

Examination Administration
This is a process of field administration of the examinations to ensure that
no candidates receive unfair advantage. This involves:
(i) Recruitment of examiners, supervisors and invigilators.
(ii) Briefing of supervisors and invigilators.
(iii) Distribution of question papers to all examination centres
(iv) Ensure that transport for examinations and personnel including
security during examination period in place.
(v) Training of examiners on marking skills and co-ordination of the
entire examination processing.
(vi) Compiling records from field reports of examination irregularities
encountered.

48
EPSC 228/311/0228

(vii) Security of examination for the purpose of integrity of


examinations.

Examination Processing
Examination processing is an exercise that involves:
(i) Marking being the main activity. KNEC examinations
marking/scoring is in two parts, namely:
(a) Manual marking for non-objective test items (essay)
(b) Mechanical marking for objective test items using
Optical Mark Readers (OMR)/ scanners. Also referred to as
Optical Mark Recognition.
Advantages of Optical Mark Readers (OMR)
 Speed
 Accuracy
 Cost –efficiency - no markers

(ii) Grade awards. Determining grade levels for various subjects.

Post-mortem Reports
After the examination, KNEC carries out post facto analysis of the
candidates‘ performance per subject. The purposes of this exercise are:
 To identify the general performance trend of the candidates.
 To identify the test items that attracted poor performance and that
reflect poor teaching.
 To assist teachers, know the areas where candidates performed
poorly.
What will happy to 2016 KCSE results since the government is hiding
subject by subject performance?

49
EPSC 228/311/0228

Criticisms and Effects of Examinations


The criticisms and the effects of KNEC examinations, particularly in the
case of KCPE and KCSE include:
 Teachers‘ use of examinations to teach and hence corrupting
teaching.
 Promotion of teachers on the basis of good results.
 Demotion of Head teachers on poor results.
 Transfer of teachers and Head teachers on poor results
 Blocking of Head teachers‘ offices by parents on poor results
 That the examinations are promoting rote learning.
 Elitism. That examination favour children from privileged
backgrounds.
 That the examination setters come from high performing schools.
 Makes education examinations oriented. Since independence this is
the criticism that has led to education reform.
 Brings unhealthy competitions between schools whereby private
schools use examination results for commercial purposes. Poor
learners are directed to do examinations outside their own schools.
Cheating has also been associated with pressure to get good
grades.
 Whole destiny of the child‘s life is determined by the national
examinations results
 Increase coaching and drilling of students.
 Motivation for subject teachers. Teachers are rewarded in form of
provision of trips, money or when their students perform in their
subjects.
 That the examinations are manipulated to favour areas in regions
where powerful politicians come from.
 That ethnic bias may enter into marking through the examiners
knowledge of the candidates‘ names or school.

50
EPSC 228/311/0228

 School ranking: Schools are not the same in terms of teaching


and learning resources and hence ranking is unrealistic.
 Testing primarily cognitive domain: Skills and values and
attitude are rarely tested.
 Seat-time education/time-based education. There is a
perception that:
- 8.4.4 measures ―seat time‖- the amount of time students have sat
in a classroom- than what students have actually learned or failed
to learn.
-In 8.4.4 students‘ advances are based on their age, regardless of
what they have learned.
NOTE: The proposed CBE is not tied to seat time.
 Item Banking: Up to 2016 KNEC used item banking system. This
an international practice. In 2016, CS directed KNEC to abolish item
banking on the assumption that it is the source of examination
leakages.

A number of these criticisms are misconceptions and ignorance of KNEC‘s


strategies to achieve fair examinations.

RECENT DEBATE ON KCSE PERFORMANCE


The Concept of a Normal Curve in Examination Results

In 19th century while studying individual differences, a psychologist by the


name of Sir Francis Galton (1822-1911) came up with a concept of
normal distribution of human behaviors. A normal curve is a statistical
model that is used throughout the physical and social sciences. The graph
below shows the scores of Scholastic Aptitude Test (SAT) in 2010. SAT is
used by USA universities for selection/admission for undergraduate
programmes.

51
EPSC 228/311/0228

The 2016, 2017 and 2018 KCSE examination results did not reflect a
normal distribution. Abilities or behaviors of a large population assume a
normal distribution, unless there are biases. When a distribution is
normal, it should be like the bell-shaped figure given below.

.13% .13%

-3 -2 -1 0 +1 +2 +3

Low Scores High scores


Average

In a normal distribution, the proportion of students below -3 standard


deviations is the same as the number of students above +3 standard
deviations.

52
EPSC 228/311/0228

KCSE PERFORMANCE TRENDS 2014-2019

Year No. A No. C+

BEFORE MILITIRAZATION OF EXAMINATION ADMINISTRATION

2014 3,073 149,717

2015 2,636 165,766

MILITIRAZATION ERA

2016 141 88,929

2017 142 70,073

2018 315 90,377

2019 627 125,746

Source: Daily Nation, Thursday, December 19,2019, p.4.

2019 KCSE

2019 KCSE candidates=697,222

Overall national grade summary for 2019 KCSE Examination

sex A A- B+ B B- C+ C C- D+ D D- E

F 269 2172 5145 9803 14961 21425 32084 43083 51813 69809 76198 12936

M 358 3624 8221

53
EPSC 228/311/0228

A_ to A =6,423 =0.921

A-=5,796

C+= 125,746=18%

C- and below=541,476= 78.66%

Conclusion=Skewed

Gender:

Boys=355782 (51%)

Girls-341,440 (49%)

2016 KCSE

The 2016 KCSE examination results reflected on extremely positively


skewed distribution (Source: Economic Survey 2017, p. 51-52).

The extreme ends of the curve are not comparable – 141 (0.02%) vs
33,399 (5.85%). A and A- =4,786 (0.93%) and D_ and E= 183,328
(32.10%).

54
EPSC 228/311/0228

Causes of Non-Normality
1. Poorly set examination. This is possible in 2016 examination
where the setters copied test items from some commercial revision
texts. Newspapers printed out the exact papers of the commercial
texts. Hence those schools that used those books had unfair
advantage. To hide this flaw no extra examination papers were left
in schools. Because KNEC was directed to disregard Item Banking
practice the examinations were developed in hurry.
2. Lack of content validity. This means the examination items were
drawn from outside the content taught.
3. Difficult examination – leading to the majority of students
performing poorly. Difficulty examination does not discriminate
bright students from dull students.
4. Poor marking conditions. This is a possibility. Reports indicate
that markers worked for long hours (6.00 am – 1.00 am) and
experienced fatigue. Under these hostile conditions most makers
are not willing to participate in 2017.
5. Poor coordination of examiners and hence failure to adhere to the
marking scheme. Evidence gathered by KNUT shows that there was
no standardization and moderation of the marking schemes.
6. The ignoring of award stage – where the chief examiners of
every paper look at the performance in their subjects across the
country and then propose the grading system. This is meant to
normalize the grades. In 2016, this stage was ignored in a hurry to
release the results for political mileage. It is not possible to use a
grading system for Mathematics on English. Performance varies
and therefore it must be moderated and standardized.
Ignoring award ceremony/stage means that one grading system
was applied and thus leading to erroneous conclusion that the poor
performance is attributable to elimination of cheating.
55
EPSC 228/311/0228

7. Lack of score As in English for the first time in KCSE history is an


indication of flawed marking. English is not one of those subjects
where cheating has been rampant in the past. So this cannot be
attributed to past cheating.
8. KCSE is not criterion-referenced examination and hence
predetermined grade cannot be used for mastery or non-mastery of
skills. It is a norm-referenced examination. This is how KCSE has
been treated in the past. It appears that it was treated as criterion
test in 2016.

MASS FAILURE IN 2016, 2017 AND 2018 KCSE


EXAMINATIONS

THE ABNORMALITY OF THE 2016, 2017 AND 2018 KCSE


EXAMINATION RESULTS

The issues surrounding the debate and the politics of the 2016 and 2017
KCSE are:

1. Absence of a normal curve distribution of scores


2. Impact of examination cheating
3. Impact of absence of examination cheating
4. Norm-referenced testing
5. Criterion referenced testing
6. Non- standard practices

Normal Curve

In presenting the 2017 results to the President on December 20, 2017,


the Education Cabinet Secretary was reported by the print media as
having told him that the results were normally distributed. This statement
is false. A normal curve must take a bell-shaped distribution as shown
below.

56
EPSC 228/311/0228

In a normal distribution:

a) Proportion of those achieving A and A- and D- and E


should have been equal – approximately 0.13 % on
each side of the curve. This was not the case in 2017
KCSE:
A and A- = 0.5%
D- and E = 35%
b) Assuming that students were well taught and well
tested, they should get at least 50% of required
knowledge and skills and that would translate to C
average. This was not the case in 2016 and 2017 KCSE,
where the average was D+.

2016 KCSE Distribution and 2017 KCSE Distribution

a. 2016 KCSE

57
EPSC 228/311/0228

b. 2017 KCSE

Low Scores High Scores

Why Skewed Distribution?

The Cabinet Secretary for Education described KCSE skewed performance


as the reality of the ability of the candidates. There are many factors that
can bring out this trend including as discussed earlier in this course:

 Difficult questions
 Poor teaching
 Poor content validity
 Rushed marking for political expedience.
 Poor marking
 Use of commercial test items that give advantage to
some schools. This was the case in 2016.
 Fatigued examiners
 Lack of standardization and moderation of test items
 Non-adherence to standard practices/procedures
 Treating KCSE as criterion-referenced test rather than as
normed-referenced test.

Impact of Poor Grades Vs Future Opportunities


58
EPSC 228/311/0228

Labels that individuals achieve in examination are either enablers or


disablers. In a society that suffers from what has been referred to as
“diploma disease”, achievement of poor grades denies individuals future
training opportunities. For 2016 and 2017, approximately 1 million four
leavers achieved less than C+ grade. These individuals have bleak future
as their low grades deny them opportunities for admission in training
institutions. This is one indicator of poor efficiency of an education system.
This is a very high wastage rate in our education system. In the last two
years, 1,062,000 candidates achieved D grade.

Where is the Future of this D Group?

The government is promising them opportunities in polytechnics and


progression from certificates through diploma to degree level. This is
unrealistic and unachievable promise:

i. Polytechnics do not have absorption capacity. In 2016,


all polytechnics admitted only 28,000.
ii. Not all students have the interest and the aptitude for
vocational and technical training.
iii. Progression from certificate level to degree is both
expensive and time consuming.
iv. Unrealistic to assume that opportunities for further
education and training for these large numbers will be
available in future.
v. Erratic policy changes on progression qualifications
where earlier grades are used as the basis for admission
for further education and training. Meaning those with
poor grades at early stages will permanently be assigned
to oblivion. Those who achieved four under KCE system
of education have found it difficult to advance their
education and training.

59
EPSC 228/311/0228

vi. The country has a poor record of non–recognition of


experiential and prior learning for distance learners.

Impact of Examination Cheating

When cheating is a factor in an examination, the results


cannot be normally distributed. The results will be
negatively skewed- a situation where more students will have
higher scores.

Low Scores High Scores

Impact of Absence of Examination Cheating


In 2016 and 2017 cheating was minimized to a non-significant level. In
the absence of cheating, a true measurable behavior of learners
should be normally distributed. But this has not been reflected in
2016 and 2017 KCSE examinations. Instead, the results reflected positive
skewness, a situation where more candidates achieved low scores- only
10% of the candidates (70,000 out of 651,000) had good grades for
university. In non-cheating examination situation, the results
should yield normal distribution, yet it is not the case in 2016 and
2017 KCSE. Is there something wrong with the whole assessment? Read
Ollows article on ―Is it the teaching and learning that is inadequate?‖
(Daily Nation, Sunday, December 24, 2017, p. 30). (See next
page/below).

60
EPSC 228/311/0228

But for KCSE we have

Low Scores High Scores

Treating KCSE as Criterion-Referenced Test rather than as Norm-


Referenced Test

KCSE examination is a norm-referenced test and the results are expected


to be evenly distributed according to the bell curve. The test aims at
comparing the students with each other and not against specified /pre-
set standards. There are no specified or declared standards for KCSE to
qualify as criterion-referenced test. Read Ollows article given below.

61
EPSC 228/311/0228

(i) Norm-Referenced

100%

75

50

25

0
10 20 30 40 50 60 70 80 90 100%

Criterion-Referenced

100%

75

50

25

0
10 20 30 40 50 60 70 80 90 100%

62
EPSC 228/311/0228

The criterion referenced assessor would be reasonably pleased


with the skewed distribution of the scores because his intention
was to show whether pupils had learned well enough to attain the
objectives. The norm referenced assessor prefers to set his test
to ensure a wider spread of scores over a broader band of
objectives (normal curve), such as is necessary for grading results
in public examinations. He will be able to tell whether one student
is either more or less expert than another, but cannot in the end
say what aspect of the subject matter has been mastered or
expertise acquired. In other words, he will not be able to say
whether this or that objective has been attained.

63
EPSC 228/311/0228

64
EPSC 228/311/0228

NON-STANDARD PRACTICES/PROCEDURES

The non-standard procedures adopted may have affected 2016 and 2017
KCSE examination results. Read Daily Nation, Saturday, December 23,
2017, p. 1 on the Teachers Pain in Marking KCSE. These non-standard
procedures include inter alia:

1. Fatigueness of the Examiners


i. Standard procedure: Examiners as human beings
require time for breaks and relaxation. In the past,
marking would start at 8.00a.m and end between
4.00p.m. and 7.00p.m.

ii. Abnormal procedure: It was reported that


―examiners who handled English Papers 3, one of the
most taxing examinations, worked from 4.00 a.m. to
10.00 p.m. (16 hours each day).

a) This lead to wide-spread fatigue. Fatigueness


allows mistakes to pass and wrongly awarding of
marks.
b) Examiners who required medical attention were
given within the marking centre. The death of one
examiner was attributed to marking pressure
(Daily Nation, Saturday, December 23, 2017, p.4-
5.

65
EPSC 228/311/0228

2. Lack of established marking standards or uniformity


i. Standard procedure: Traditionally, the examiners
are required to take two days to familiarize
themselves with the marking scheme by going
through dummies. Marking scheme must be
congruent with the questions.

ii. Abnormal Procedure: In 2017 KCSE the


familiarization only took a half a day. Therefore
inconsistencies in marking were bound to happen.

3. Mismatch between the some test items (questions)


and expected responses (answers)
i. Standard Procedure: Validity and reliability can
only be achieved if the question set and the answer
match. One examiner reported that in 2017 KCSE
there were errors in the marking schemes but these
could not be corrected because there was no time to
do it. The emphasis appears to have been on
the markers to finish marking and have the
results released in record time (Daily Nation,
Saturday, December 23, 2017, p.4)

ii. Abnormal procedures: It was reported that ―at Moi


Girls School in Nairobi, where History Paper 2 was
seeing marked, examiners realized that one question
had the wrong answer. However, nobody cared to do
the correction as the push was to finish marking as
quickly as possible. The implication is that

66
EPSC 228/311/0228

candidates were disadvantaged. Moderation of


marking would have taken case of this error-through
compensation. The Ministry has removed this key
process. The Cabinet Secretary (CS), unfortunately
called moderation process ―massaging‖ of results.

4. Lack of Inter-examiners‟ Reliability Process


i. Standard Procedure:
During marking, examiners are put in a pool of seven
with a team leader. For every 10 scripts marked, the
team leader has to review (re-mark) at least two scripts
which are picked randomly to verify if they have been
marked well. The marking errors allowed is – 2 or +2
and if an examiner goes beyond this margin, he/she is
forced to re-mark the entire scripts. If she/he
consistently makes the same mistake he/she is retired
(expelled) from marking exercise.

ii. Abnormal Procedure:


In 2017 KCSE, it was reported by the examiners that the
inter-examiners‘ process was not done. Skipping this
process enabled many examiners go for volume of
scripts since payment is based on the number of scripts
marked.

COMPETENCE BASED EDUCATION/CURRICULUM


(CBE)
Kenya plans to phase out 8-4-4 and introduce CBE. Why have the
previous reforms in Kenya failed?
 We abolished ‗A level‘ segment saying it did NOT serve Kenya well.
‗A level‘ serves Britain well and enables Britain to advance

67
EPSC 228/311/0228

technologically. So why is it not good for Kenya? Why is it that it


develops Britain and under develops Kenya?
 We changed to 8-4-4. It is an education structure that serves USA
and Canada well? Why is it that it develops USA and Canada and
under develops Kenya? Education reforms in Kenya are guided by
politics rather professionalism.

What is Competence–Based Competence (CBC) and what


makes it different?

The most important characteristic of competency-based education is that


it measures learning rather than time. Students‘ progress by
demonstrating their competence, which means they prove that they have
mastered the knowledge, skills and attitudes (called competencies)
required for a particular course, regardless of how long it takes. While
more traditional models can and often do measure competency,
they are time-based — courses last about eight years, and students
may advance only after they have put in the seat time. This is true even
if they could have completed the coursework and passed the final exam in
half the time. So, while most colleges and universities hold time
requirements constant and let learning vary, competency-based learning
allows us to hold learning constant and let time vary.

68
EPSC 228/311/0228

Head + Hand + Heart= Competent Professional

Competence= combination of knowledge, skills and attitude


required to become a competent professional. [Refer Nelson
Mandela on mother tongue].

Benefits of Competency-Based Learning

There are many reasons schools are moving away from seat time and
toward competency-based learning. These reasons include:

 This is the type of education that combines the above in a graduate.


 Performance based/outcome-based education with clear
performance indicators.
 Education that incorporates real life assignments and assessment
(problem solving skills, meaningful, etc).
 Education that incorporates the requirements of the labour market
and industry.
 Education that gives students to learn at their own pace.
 Learner-centred education.
 Competency-based learning keeps each student from getting bored.
69
EPSC 228/311/0228

 Competency-based learning allows each student to work at his or


her own pace.
 A greater mastery of the subject, done in a less frustrating and
more invigorating way. Students move at their own pace with the
curriculum.

In summary, it is argued by the proponents of CBE that Competency-


based education (CBE) is a new model in education that uses learning,
not time, as the metric of student success. This student-centered,
accelerated approach redefines traditional credit-based requirements in
learning and stresses competencies derived from the skills proven to be
the most relevant by educators and employers.

With competency-based education, institutions can help students‘


complete credentials in less time, at lower cost — with a focus on real-
world learning that leads to greater employability. This versatile model
benefits the student, the instructor, the institution, and the economy.
Very thing sounds good with CBE but………

Where has CBC succeeded?

CBE has been successfully implemented in South Korea, Japan, Finland,


and Netherland. This is primarily because of the culture of the people. In
South Korea children spend up to 14 hours in school (8.00 am to
11.00pm) per day. Since most children do not get home until midnight,
dinner is served in school? Why? to get into good college.

South Korea Education Structure


1. Pre- schools (3-6 years old children= 2 years –(optional)
2. Primary Education = 6 years
3. Lower Esc (Middle School) =3 years
4. Upper Sec (High School) =3 years:
Academic Stream (62%)
70
EPSC 228/311/0228

Vocatioal Stream (38%)

Where has CBC failed?

CBE has had little success, if any, in Africa. South Africa and Malawi tried
and abandoned. It was tried for 12 years in South Africa. Why it failed in
South Africa?

 It involves too much administration and record keeping.


 Very expensive.
 Exodus of teachers. Many teachers left the teaching profession.
 For 12 years‘ teachers never understood how to implement CBE.
 Assessment very subjective.

Challenges of CBC
1. Failed system
It is also important to note that the proposed Competence-Based
Curriculum education was tried for 12 years in South Africa and it
failed/abandoned. In South Africa it was called Objective-Based Education
(OBE). It was also tried in Malawi and abandoned.
The proposed Kenyan CBE was borrowed from Japan and South Korea.
Kenya does not have the culture of these countries and is wrong to
assume that we will succeed. Do we have the Asian culture that has been
responsible for the success of CBE? No.

2. Assessment of student in CBC is subjective


In Kenya 70% of assessment will come from CA. The proposed
competence-based curriculum education system proposes the
replacement of national examinations as a tool for testifying the
achievement of an educational level with continuous assessment. As a
country we are at a stage where the government does not respect and
trust teachers. The invigilation and the marking of the 2016 KCSE
examinations is a clear demonstration of this mistrust. In pointing out his
71
EPSC 228/311/0228

level of mistrust of teachers the CS in charge of Education said, ―With


the use of ICT, we eliminated people who were changing marks”
(Standard, January 12, 2017: p.8). The people being referred to are
teachers. These are same people that the proposed CA system will rely
on. Will ICTs be able check CAs awards?

Is this CA policy direction not paradoxical? Will the teachers‘ assessment


be a fair tool for assessment in a country with limited resources? Will this
approach eliminate competition in the job market place? What about
corruption? What about the ―halo effect‖?
The use of CA works well in countries where recruitment is not strictly
based on certificates but on what a candidate can do. This approach of CA
failed when it was used by KNEC in Primary Teachers Examinations in
early 1980s. At this time 40% of the final grade came from CA and 60%
from KNEC examination. Teacher trainees were passing CA but failing the
KNEC examinations.
3. Low public participation. This proposed system is being driven
politically. No sessional paper was produced before rolling it out.
This ie expected in June 2019- three years after its implementation.
4. Too futuristic and impractical to implement in Kenya in its current
form.

5. Expensive to implement.
6. Overloaded syllabus. Between 11-12 subjects to be taken.
7. Pupils‟ progression from primary to secondary unclear.
8. Teacher-based subjective assessment in determining movement
from primary to secondary is bias and hence pupils cannot go to
their school of choice in the absence of a standard, unifying national
examination.
9. Damage national integration because children will remain in
their neighborhoods.
10. Demand literacy of parents.
72
EPSC 228/311/0228

LECTURE 5
Welcome to lecture 5.

EVALUATION AND PERFORMANCE


STANDARDS
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
 Define evaluation
 Discuss the purpose of evaluation.
 Discuss types of educational evaluation.
 Discuss standards of evaluation.

What is evaluation?
In terms of measurement evaluation is defined as the ability to make
judgment on the basis of given information. Judging the
worthiness of a programme project or course. It is the use of
measurement in making decisions.

The Birth of Evaluation


STORY 1: A story about the “birth” of evaluation” (They story of
creation and the concept of Evaluation)
Genesis Chapter 1: In the beginning God created the heavens and the
earth‘
Genesis Chapter 31: God saw all that he had done, and it was very good.

A story is told to illustrate the birth or the concept of evaluation. ―In the
beginning God created the heavens and the earth (Genesis Chapter 31).
73
EPSC 228/311/0228

And God saw everything that He made, and said ―behold, it is very good‖.
And the evening and the morning were created on the sixth day. And on
the seventh day God rested from all His work. His archangel came then
to Him asking ―God, how do you know that what you have created is
―very good?‖ What are your criteria? On what data do you base your
judgement? Aren‘t you a little too close to the situation to make a fair and
unbiased evaluation? God thought about these questions all day and His
rest was greatly disturbed. On the eight-day God said, ―Lucifer, go to
hell‖. Thus was evaluation born in a blaze of glory …. a legacy under
which we continue to operate.

STORY 2: A story about what evaluation is about: The Story of lost


key and the concept of Evaluation
Evaluation is about making judgement and asking questions. A story is
told of a man who once found his neighbor down on his knees under a
street lamp looking for something.
―What have you lost my friend?‖
―My key‖, said the man on his knees.
After minutes of searching the neighbor asked,
―Where did you drop it?‖
―In that dark pasture‖ replied his friend.
―Then why for heaven‘s sake are you looking or it here?‖
―Because there is more light here‖.

The foregoing story serves to illustrate what evaluation is out to achieve.


Evaluation is simply an attempt to throw some light on the dark
parts of a project or programme.

SUBJECTS FOR EVALUATION: WHAT DO WE EVALUATE?


There are many areas that can be evaluated including:
1. Students
74
EPSC 228/311/0228

For students‘ evaluation, we cover such variable as:


 Achievement
 Aptitude
 Intelligence
 Personality
 Attitudes
2. Curriculum
For curriculum evaluation, we cover such areas as:
 Instructional materials
 Instructional strategies
 Textbooks
 Audio-visual materials
 Organisation

3. School
For school evaluation, we focus on how education programme with
respect to its functioning:
 School activities
 School resources
4. Programme
For programme evaluation, we look at the following:
 Needs assessment
 Initial objectives of the programme
 Activities
 Impact
5. Personnel
For personnel evaluation (staff evaluation) we look at:
 Performance
 Productivity

75
EPSC 228/311/0228

Evaluation Design
Assignment:

 Evaluate the teaching of national and cohesion at the university

 Evaluate school feeding programme in ASAL areas

To do these exercises you need a design.

A design is a road map, an architect or plan of doing something. How do


you go about doing it? Under evaluation design, you need to discuss:

 Purpose of the evaluation-why carry out evaluation?

 Evaluation objects (areas and questions)

 Evaluation model or methodology

Evaluation Objects: Questions you ask when carrying out an


evaluation of a Project

What major areas, also known as evaluation objects, do you focus on


when you are carrying out an end of the project evaluation exercise?
What questions do you ask when you are carrying out an evaluation? An
evaluator is seeking answers for the following questions:

 What were the intentions/objectives of the programme/project?


Examine objectives.

 What have been achieved (the results)? Examine the results.

 What is still there (impact)? Examine the impact (what is still there
on the ground).

 What is the future of the project (sustainability)?

76
EPSC 228/311/0228

The Evaluation Model/Methodology

The design of the study is to a large extent set by the frames of the
assignment. The data has to be collected.

Methodology refers to how you collect data for evaluating a


programme/project. Evaluation leans on collection of both qualitative and
quantitative data. The instruments you can use to collect evaluation data
include:

 Review of Documents/ reports and policy documents. These provide


questions to be asked in either the interview or questionnaire.

 Interview with stakeholders/ actually participants in the project/key


players.

 Observations. If it was classroom-based project e.g. teaching


programme, carry out classroom observations.

 A questionnaire. This is used to tap some quantitative data


regarding the programme/project.

Purposes of Evaluation

Why carry out an Evaluation? / What is the purpose of evaluation?


Evaluations are initiated for many reasons, including: -

i. To assess programme improvement

ii. To assess accountability

iii. To assess an educational process [Tracking Students


Progress]

iv. To assess achievement of interventions and for knowledge


generation

v. For hidden agendas


77
EPSC 228/311/0228

1. Programme Improvement

This is an evaluation that is intended to provide information for guiding


programme improvements. The purpose for this evaluation is to help or
shape the programme to perform better.

2. Accountability [e.g. Evaluating School Feeding


Programme]

This is an evaluation conducted to determine whether the project


expectations have been met. The purpose for this evaluation is to render
a summary judgment on the programmes performance.

Example

The investment in school feeding programme is justified by the


presumption that the programme will attract children to school or benefit
children in those areas. Program managers are thus expected to use
resources effectively and efficiently and actually produce the intended
benefits.

First examine the objectives of the feeding programme – why was the
feeding programme introduced?
 Objective: To assist the children from the poor background
to attend school.
 Question/results: Has higher school attendance been
achieved? Are there now more children from poor background
in school than before the project was introduced?
 If yes, then the project has met the objectives.
 If no, then the project has failed.
Is there still higher school attendance since the project ended?
(Impact).
78
EPSC 228/311/0228

How is high attendance that has been achieved going to be


maintained in the future? (Sustainability).

Evaluating a course e.g. geography. First state the objectives of why


geography should be taught in high school. Ask the learners geographical
facts to evaluate the achievement / failure of geography.

3. Evaluation a tool for Tracking Students Progress

A primary purpose of teaching is to produce a measurable change in


behavior. Evaluation as an educational process is an assessment on how
well learners are doing and what effect the project is having on them.

Student Test 1 Test 2 Test 3 Average

Student A 6 6 6 6

Student B 4 6 8 6

Here you can see that both students have an average grade of 6. But
what does it really say? If you just show your student this grade, it means
nothing. If you dig deeper and take a look at the process, you can see
that one student actually did better than the other. Student B shows
progress and improvement in the learning material. Student A has the
same grade as Student B, but he‘s stuck. He doesn‘t get a complete grip
on the learning material, and only masters some parts of it.

4. Knowledge Generation through interventions

 Sometimes evaluations are undertaken to describe the nature and


effects of an intervention as a contribution to knowledge.

 To determine the extent to which the goals of interventions have


been achieved.

79
EPSC 228/311/0228

Intervention

Demonstration/ intervention
program or on a new
approach to a social problem
Smoking Addiction e.g.Therapy on drug Stop Smoking
addiction such as giving
them Coffee every one hour.

An evaluation can be carried out to find out if it can be implemented more


widely.

5. Hidden Agendas

Sometimes the true purpose of the evaluation has little to do with actually
obtaining information about the programme‘s performance.

i. Program administrators or boards may launch an evaluation


because they believe it will be good public relations and might
impress funders or political decision makers.

ii. Occasionally, an evaluation is commissioned to provide a rationale


for a decision that has already been made behind the scenes to
terminate a program, fire an administrator etc.
iii. The evaluation may be commissioned as a delaying tactic to
appease critics and defer difficult decisions e.g. Salaries and
Remuneration Commission job evaluation delaying tactic on
teacher‘s salaries.

80
EPSC 228/311/0228

EVALUATION PHASES AND PROJECT CYCLE


Evaluation is a continuous process. The evaluation process is cyclic with
feedback from one cycle guiding the next. We do not just evaluate
outcomes: every stage of the process is subject to evaluation beginning
with the objectives.

Example 1: Evaluation of student achievement

A teacher uses performance data not only to evaluate student progress


but also to evaluate his/her own instruction. That is, the process of
evaluating students provides feedback to the teacher.
The components of the evaluation model are:

Step 1 Objective

Step 2 Pretest

Step 3 Instruction

Step 4 Measurement

Step 5 Evaluation

The figure given below is an evaluation model appropriate for use in


primary and secondary educational settings.

81
EPSC 228/311/0228

Step 1: Objective
Preparation of the objectives is the first step in the evaluation process,
because objectives determine what we will seek to achieve. The
objectives give direction to instruction and define what behaviors we want
to change.

Step 2: Pretest
With some type of pretest, we can answer three questions:
1. How much had already been learned?
2. What are the individual‘s current status and capabilities?
3. What type of activity should be prescribed to help achieve the
objectives?

Step 3: Instruction
Sound instructional methods are needed to achieve the agreed-on
objectives. Different instructional procedures may be needed to meet
students‘ individual needs.

Step 4: Measurement
This involves the selection or development of a test to gauge the
achievement of the objectives. It is crucial that the test be designed to
measure the behavior specified in the objectives. The objectives can be
cognitive, affective, or psychomotor. The key element is to select or
develop a test that measures the objectives. Often, teachers will need to
develop their own tests, because standardized tests may not be
consistent with their instructional objectives. This is a common method
used to provide evidence of test validity in educational settings.

Step 5: Evaluation

82
EPSC 228/311/0228

Once the instructional phase has been completed and achievement has
been measured, test results are judged (i.e. evaluated) to find whether
the desired changes achieved the stated objective.

What happens if students do not achieve the desire objectives? The figure
given above shows a feedback loop from evaluation back to each
component of the model. Failure to achieve the stated objectives may be
due to any segment of the model. First, it may be discovered that the
objectives are not appropriate and may need to be altered. The
instruction may not have been suitable for the group, or the selected test
may not have been appropriate. The educational evaluation model is
dynamic. The evaluation model provides information needed to alter any
aspect of the educational process.
Example 2: Project Evaluation Phases

Other situations where evaluation is applied include projects. The main


stages that have been identified on project cycle are:

1. Project Identification. This involves:


 Baseline survey** [Situation analysis]
 Information / data gathering
 Prioritization of the challenges
2. Project Planning. This involves:
 Identification of resources
 Setting project objectives expressed in terms of quantifiable,
measurable outcomes.
 Identification of risks and contingencies (E.g. political instability,
donor withdrawal)
3. Project implementation. This involves:
 Role division
 Team work
 Monitoring activities (periodic review of progress)

83
EPSC 228/311/0228

 Budget control
4. Project Evaluation. This involves:
 Effect / impact assessment
 Identifying side effects
 Accountability

Donors like the World Bank and African Development Bank use Project
Cycle given above in evaluating the projects they fund.

** Baseline survey is the survey carried out before a project is initiated


thus providing a basis for comparative judgment on the situation after
intervention.

Project Risks
Any project has risks that need to be pointed out. In any project we have:
Manifest effects= Intended effects e.g, encouraging school
enrollment
Latent effects= Unintended effects and these include project
risks. E.g warriors also became interested in going to school.
In Kenya project risks may include:
i. Political instability.

84
EPSC 228/311/0228

ii. Luck of funds to complete the project.


iii. Change of leadership.

The concepts of manifest and latent effects are sociological concepts first
formulated by Robert Merton.

TYPES OF EDUCATIONAL EVALUATION


There are two types of educational evaluation, namely:
 Formative evaluation
 Summative evaluation

Summative Evaluation
 End of programme
Formative Evaluation Interaction
 During
programme

Formative Evaluation
This is the judgment of achievement during the formative stages of
learning. Feedback is one of the most powerful variables in learning.

85
EPSC 228/311/0228

Formative evaluation is used to provide feedback to learners throughout


the instructional process.
The main purpose of formative evaluation is to determine the degree of
mastery of a given learning task and to pinpoint the part of the task not
mastered. The strength of formative evaluation is that it is used to
provide feedback throughout the instructional unit.

Summative Evaluation
This is the judgment of achievement at the end of an instructional unit,
and typically involves the administration of tests at the conclusion of an
instructional unit or training period. Summative means “totaling up” to
indicate the level a learner has reached in a subject.

Similarities and Differences between Formative and Summative


Evaluation
Formative Summative

Purpose Feedback to student Certification or grading


and teacher on student at the end of a unit,
progress throughout an semester or course.
instructional unit.

Time During instruction. At the end of a unit,


semester, or course.

Emphasis in Evaluation Explicitly defined Broader categories of


behavior. behaviors or
combinations of several
specific behaviors.

Performance Standard Criterion – referenced. Norm-referenced or


criterion-referenced.

86
EPSC 228/311/0228

PERFORMANCE STANDARDS
Performance standards are the criteria to which the results of
measurement are compared in order to interpret them. A test score in
and of itself means nothing. There are two ways with which we can
interpret the results of a test. The two most widely used types of
performance standards are:
 Norm – referenced standards
 Criterion – referenced standards
These are the two bases for the comparison of performance or
interpretation of performance.

Norm – Referenced Standards

A norm–referenced standard or test is used to judge an individual‘s


performance in relation to the performances of other members of a well-
defined group – for example, 14 years old. KCPE is an example of a norm
– referenced standard or test. That is, norm – referenced tests, yield
information regarding the student‘s performance in comparison to a norm
or average of performance by similar students.

Norm – referenced standards are developed by testing a large number of


people of a defined group. Descriptive statistics are then used to develop
standards. A common norming method is to use percentile ranks.
This is where comparison in performance is made between individuals.
The comparison is governed by norms e.g. they have all followed the
same curriculum and are of the same age.
In summary norm-referenced standard involves:
i. Comparing individual performance to the performance of the group.

87
EPSC 228/311/0228

ii. Grading on a normal curve. Norm-referenced evaluation test items


are chosen to produce variability among test scores so that a
normal curve can be derived.
iii. Plotting all scores for a group against the number of students who
achieve each score and any individual‘s standing is interpreted
according to its variation from the group norm.
iv. Used for determining an individual‘s relative position in a group.

Criterion – Referenced Standards


A criterion-referenced standard or test is used to determine if someone
has attained a specific level. [e. g. breaking world record in 100 metres].
CRS is also referred to as objective-referenced or competence-based
evaluation.
A criterion-referenced standard is a predetermined standard of
performance that shows the individual has achieved a desired level of
performance. It is unlike a norm-referenced standard in that performance
of the individual is not compared with that of other individuals, but rather
just against the standard. That is:
 Criterion referenced tests provide information about a student‘s
level of proficiency in or mastery of some skills or set of skills.
 It tells us whether a student needs more or less work on skills, but
says nothing about the student‘s performance relative to other
students.
 Grading on predetermined standards
Unlike a norm-referenced standard, which uses a continuous variable, a
criterion – referenced standard is a dichotomy. Terms such as pass-fail,
master-non-mastery or positive negative are used to describe the
dichotomy. A positive test means the presence of the disease and a
negative test means the absence of the disease.
Nursing examination is criterion – referenced test. Driving test is also
criterion – referenced test. The criterion referenced test is a mastery

88
EPSC 228/311/0228

test, designed to establish how many pupils have achieved a certain


standard or whether an individual pupil has performed a given task.
Criterion – referenced test reliability examines the consistency of
classification. For example, what percentage of people was consistently
classified as passing or failing a test that has been administered two
times?
Criterion – referenced test validity refers to the accuracy of the
classification. That is, are the participants who are classified as passing
or failing by the test classified correctly when compared to their true
state?

The histograms shown below illustrate the different characteristics of the


score distribution in norm and criterion – referenced assessment. The
norm referenced assessment has distributed the candidates over the
range of marks or grades available while the criterion – referenced
assessment has provided a measure of the level of mastery attained by
the group.

In summary criterion-referenced standard involves:


i. Comparing individual‘s performance with absolute standards- a
specification of competence in a given field.
ii. Independent of the group performance. Performance of the
individual is not compared with that of other individuals or the
group, but rather just against the standard.
iii. Interpretation of an individual‘s score by comparing his/her
performance to empirically derived present standards of
competence.
iv. Derivation of a normal curve in CRS is unnecessary because
discrimination between students is not important

89
EPSC 228/311/0228

KNUT criticized the wrong use of criterion-referenced standards to grade


2016 KCSE.

(i) Norm-Referenced

100%

75

50

25

0
10 20 30 40 50 60 70 80 90 100%

Criterion Referenced

100%

75

50

25

90
0
10 20 30 40 50 60 70 80 90 100%
EPSC 228/311/0228

The criterion referenced assessor would be reasonably pleased with the


distribution of the scores because his intention was to show whether
pupils had learned well enough to attain the objectives. The norm
referenced assessor prefers to set his test to ensure a wider spread of
scores over a broader band of objectives, such as is necessary for grading
results in public examinations. He will be able to tell whether one student
is either more or less expert than another, but cannot in the end say what
aspect of the subject matter has been mastered or expertise acquired. In
other words, he will not be able to say whether this or that objective has
been attained.

91
EPSC 228/311/0228

LECTURE 6
Welcome to lecture 6.

TESTING CODE OF ETHICS


TOPIC OBJECTIVES
After this lecture, the learner should be able to:
 Describe ethical issues applicable to testing.

Tests can be useful tools, but they can also be dangerous if misused. It is
our professional obligation to ensure that we use tests as accurately and
as fairly as possible.
Like in doing research, testing requires that the test administrator and
test user should adhere to the following code of ethics and standards.
1. Consent from parents / guardians.
2. Consent from school and local administration. These are called gate
keepers.
3. Be conducted in a fair and ethical manner which include:
 Security of testing materials before, during and after testing to
ensure fairness. KNEC hires police to guard exams materials.
 Security of scoring.
 Confidentiality.
 Testing to cover materials taught.
 Training staff on testing and scoring.
 Using tests that are developmentally appropriate.
 Interpreting results within acceptable norms. There must be a
rationale for decision based on test scores.

92
EPSC 228/311/0228

 Interpreting results within acceptable norms.


 Consideration of political impact of the publication of the test
results.
4. Establishment of conditions for test administration that enable all
examinees to do their test.
5. Competence to administer each test used for decision-making
purposes.
6. Anonymity of the identities of the test participants.
7. Administration of culture free or culture fair test.
8. Administration of non-harmful tests.
Compliance with Secrecy Act of 1968. This Act applies to those
handling public examinations at the stage of development,
production/printing, administration and scoring.
Section 39 of KNEC Act 2012, also prohibits an officer of KNEC from
disclosure of information related to the examinations.
9. Declaration of Conflict of Interest
What happens when if you are KNEC official and you have a child or
a spouse who is a candidate in such an examination? You are
required by Section 36 of KNEC Act 2012 to disclose or declare your
interest to the Council as soon as practical and the office (you)
ceases to perform such duties.

93
EPSC 228/311/0228

LECTURE 7
Welcome to lecture 7.

PSYCHOMETRIC CHARACTERISTICS OF A
GOOD TEST

TOPIC OBJECTIVES
After this lecture, the learner should be able to:
 List and describe the characteristics of a good test.
 List the factors that affect validity of a testing
instrument.
 List the factors that affect reliability of a testing
instrument.
 Explain the different ways of estimating reliability.

WHAT MAKES A TEST GOOD?

Achievement tests, as measuring instruments, have unique characteristics


by which their usefulness is judged. One principle, however, comes first
in any discussion of what makes a test good. Even the very best test is
good only when used for its specific purpose with the kinds of students for
whom it was intended. A good test for standard 6 is a bad test for
standard 3.
There are two broad considerations of a ―good test‖, namely:
(i) Practical Considerations
From practical standpoint, a test can be said to be good if it has the
following psychometric characteristics:

94
EPSC 228/311/0228

 Relevance to the target students.


 The clarity of the instructions for administrating the test;
 The clarity of the guidelines for interpreting the results;
 Offers economy in the time it takes to administer, score, and
interpret it. Image a test that takes 8 hours.

(ii) Technical Considerations


From technical standpoint, a test can be said to be good if it has the
following qualities:
 Validity
 Reliability
 Objectivity
 Difficulty and Discrimination
 Comprehensiveness
 Efficiency
 Fairness
 Norms

1. VALIDITY
A test is said to be valid when it measures what is intended to measure.
That is, a good test measures what it is intended to measure.
What are you looking for when you are establishing the validity of an
instrument? You are looking for:
i. Trustworthiness of the instrument. Is the instrument trustworthy?
ii. Credibility of the instrument. Is the instrument credible?
A valid classroom test measures what has been taught (or should have
been taught). There are several aspects of validity:
These are:
 Content validity
 Construct validity
 Face validity
95
EPSC 228/311/0228

 Concurrent validity
 Predictive validity
The two of these validities that are of particular importance with respect
to teacher – made tests are content and construct validity.

Content Validity
This is the most important validity for practicing teachers. It measures the
extent to which a test adequately covers the syllabus to be tested.
Content validity refers to the extent to which a test ―covers‖ the content.
 If the test does not cover the content that has been taught, then it
is not valid.
 An essay test intended to measure knowledge is likely to lack
content validity because of the severe limitation on the number of
topics which can be included in one such test. In other words, the
sample of possible learning is small. Hence essay test lacks content
validity. It lacks content balance.
 A valid test provides for measurement of a good sampling of
content and is balanced with respect to coverage of the various
parts of the subject.

FACTORS THAT AFFECT CONTENT VALIDITY


To have a good content validity of a test, consideration must be given to:
(i) Length of test
A short test could not adequately cover a year‘s work. A syllabus needs
to be sampled and a representative selection made of the most important
topics to be tested.

(ii) Topic Coverage


Test questions are prepared in such a way that they reflect the way the
topic was treated during the course. How many hours were spent
teaching the topic? 8, 4, or 2? More test items drawn from topics where
more hours were spent.
96
EPSC 228/311/0228

(iii) Test blueprint or specifications


Better content validity can be achieved by the use of a specification grid.
The teacher has to decide on the weighting (% of the total marks) to be
attached to each ability and the most suitable number of questions;
namely:
 Recall questions. How many?
 Knowledge questions. How many?
 Application question. How many?
 Analysis questions: How many?
 Evaluation questions: How many.
We will discuss more in test construction lecture.
(iv) Test Level
A third aspect of validity that is more concern to teachers has to do with
level of the test. Example: Test on Kenya‘s Colonial History. A single test
of Kenya‘s colonial history is not equally valid for learners / students in:
[What strategies did freedom fighters used to destabilize the colonial
regime?]
 Form 3
 Standard 8
 University
Why?
 Course objectives are different
 Course coverage is different.
 Learners‘ abilities are different.
Thus validity is specific to purpose, to subject matter, to objectives; and
to learners / students; it is not a general quality.
(v). Teacher‟s Bias
A teacher is human and this means the test may be open to human
errors. The teacher may influence the outcome of the content of the test
because of his/her expectations regarding the results.

97
EPSC 228/311/0228

In primary schools, where teachers are compared at the end of the term
on the performance of their classes, teachers may set easier questions for
their classes in order that their pupils perform better.
Construct Validity
This is another aspect of validity that is important in teacher – made
tests. It refers to the kinds of learning specified or implied in the course
/learning objectives. That is based on learning objectives.
For example, if the course learning objective specified that at the end of
the course the learner must:
i. Identify four characteristics of living things, test tasks are required
to measure identification of these characteristics. Each kind of
learning objective must be tested to provide a valid measurement of
achievement.
ii. Identify the methods freedom fighters used destabilize the colonial
regime.
A test which measures only knowledge (e.g. ability to recall important
historical events) must lack validity if the objectives specify other learning
objectives.

Face Validity
 The test should look as if it is testing what is intended to test. This,
however, is the starting point.
 Describes how well a measurement instrument appears to measure
what is intended to measure as judged by its appearance, what it
was designed to measure. For example, a test of mathematical
ability would have face validity if it contained math problems.
 A test is said to have face validity if it appears to be measuring
what it claims to measure.
For example:
 If we are trying to select pilots from highly trained personnel
face valid tests of rapid reaction time will ensure full
98
EPSC 228/311/0228

cooperation because subjects believe them (tests) to be valid


indicators of flying skills.
 If, however, a test required them to make animal noise or add
up numbers while distracted by jokes many subjects would
refuse even if the tests were valid. They will think this is not
appropriate test for them (pilot).
 If you set a test for STD 3 or FORM IV and give another
teacher of STD 3 or FORM 3 to look at whether it is
appropriate for the level. The expert opinion provided is taken
as a face-validity. On the face of it, it looks like it is
appropriate for STD 3 or FORM 3.
 Face validity is a weak form of validity in that an instrument may
lack validity.

Concurrent Validity
 This is where test results are compared with another measure of the
same abilities at the same time or taken at the same time or about
the same time. For example, comparing mock results and actual
KCSE results. These examinations are taken on about the same
time – one in July (Mock) and the other in November (KCSE). In
1970‘s, mock results were used to select students to join Form 5 in
January before the KCE results were released on the belief that
mock was a good measure of the final examination or had good
concurrent validity.

O1 O2 O1 O2

High High Low Low

99
EPSC 228/311/0228

 A test is said to possess concurrent validity if it can be shown to


correlate highly with another test of the same variable which was
administered at the same time or about the same time.

Predictive Validity
 This is where test results are compared later with another criterion,
such as success in a particular job or in higher education. E.g. KCSE
Grade D is a good predictor of doing well in Police.
 Predictive validity is good support for the efficiency of a test.
For example,
(i) Good KCPE results predicting good KCSE results or
(ii) Good KCSE results predicting good GPA at the university.

2. RELIABILITY
When you say that a friend is reliable, what do you mean?
Reliability refers to the degree to which a test or an assessment tool
produces consistent results. A reliable test gives consistent or dependable
results. That is, a good test, a good measuring tool or instrument is
reliable. That is, with repeated testing, each student will maintain about
the relative rank in his/her group (or achieve about the same test scores)
each time he/she takes the test.
X1, X2, X3, X4, X5 = Y1=Y2=Y3=Y4=Y5

It also refers to a situation where if the same test is marked by different


examiners or even when the same examiner marks the test at different
times, the test will give similar results.
E1, E2, E3, E4, E5= Same scores.

If the same pupils take a vocabulary test two times/twice within a short
period of time, their scores on the two occasions should be similar or
same.
100
EPSC 228/311/0228

FACTORS THAT AFFECT RELIABILITY

Factors that lower reliability of a test include:


(i) Sampling error
Since the items of a test constitute a sample of the students‘ knowledge,
they are subject to a sampling error. What is a sample error? When a
sample deviates from its population. If a test of 50 items is given to a
class, then followed by a second set of 50 different items, there will
undoubtedly be some differences in the two scores obtained by some
students. Thus a single test score based upon a sample of test items is
not a perfectly dependable (or reliable) measure of achievement.

Population (Total coverage)

Sample (Subset of pop.)

Suppose 15 topics were covered in the course (N=15)


Sample 1: Mean of results of first test of 50 items=78
Sample 11: Mean of results of second test of 50 items=75
Sample 111: Mean of results of third test of 50 items=81
In all three samples you get different means. All these averages
represent the population mean, u.
The sample means are only different because of use of different test
items of the population (topics covered). This is called sampling error.

101
EPSC 228/311/0228

The bigger the n, the closer it is to N. The larger the test items developed
or drawn from the course coverage, the better the test. Less test items
are low in reliability because they fail to cover the course coverage.

(ii) Chance factors


Chance factors such as guessing contribute to lack of reliability in the case
of true false items and (to a lesser extent) of multiple – choice items.
There is more guessing in true false test items. The more a student
guesses, the less likely he is to obtain the same score if the test is
repeated; hence the reliability of a test is low if many answers are
guesses.

(iii) Test Difficulty


If a very easy test is given to a class, the sores will tend to bunch at the
high end of the scale; differing levels of achievement cannot be
distinguished. Scores on a very difficult test will tend to bunch at the low
end and similarly obscure the differing levels of achievements. If the test
is easy or difficult, the scores will not be separated, hence test difficulty is
related to reliability. In short, items which are poorly written and where
too many items are very easy or very hard lower reliability of a test.

102
EPSC 228/311/0228

([[

(a) (b)
(c)
A positively skewed distribution (c) reflects a very difficult test while a
negatively skewed distribution (a) reflects an easy test.
(iv) Length of a Test
The length of a test affects reliability. A very short test (or five items, for
example) cannot spread the scores sufficiently to give consistent results.
Five items are too few to provide a reliable measure. In general, the
longer the test, the more reliable.
(v) Erratic or inconsistent Marking/Scoring
If the markers are erratic in scoring the award of scores will be unreliable.
Inconsistency in scoring leads to low reliability of results. This is why
KNEC train examiners on marking so that if the same test or script is
marked by different examiners or even when the same examiner marks
the same test at different times the scores will be similar. See the article
of September 13, 2014 on ―Train every teacher on setting, marking
exams‖

103
EPSC 228/311/0228

104
EPSC 228/311/0228

(vi). Testing environment


Reliability requires noise free environment. If the testing environment is
noisy it affects reliability.

(vii). Number of performers. Large numbers are required to establish


reliability.
(viii). Forgetting and Fatigue. Between first testing and second testing
he/she may forget because of interruption between the two situations or
become tired.
(ix). „Halo effect‟. Examiners allowing their knowledge of the candidates‘
index numbers or ability or position in class or ethnic background to
influence their awarding of marks or their judgment. [See article below of
Daily Nation of Sept. 13, 2014).
(x). “Leniency/severity errors‖ is common in essay test. –This is
where examiners/raters give ratings which that are consistently too high
or too low.
(xi). “Error of central tendency”. Some examiners tend to avoid
extreme categories or giving high scores, concentrating instead on
categories around the midpoint or average of the scales. This is called
error of central tendency. Also come in essay test.

105
EPSC 228/311/0228

“Halo Effect” and subjectivity

106
EPSC 228/311/0228

ESTIMATING THE RELIABILITY OF AN


INSTRUMENT/TEST
Test reliability can be estimated by use of several procedures. The
following are the statistical procedures/methods of estimating the
reliability of an instrument/test.
1. Test-Retest Reliability

The procedure for estimating test-retest reliability utilizes the method


of administration of the same test twice to a group of students and
correlating the two sets of scores. A correlation coefficient suggests the
amount of agreement between the two sets of scores. The higher the
(positive) correlation derived, the higher the reliability estimate.
2. Parallel Forms Reliability

This is also called ―alternative forms reliability‖ or ―equivalent forms


reliability.

An estimate of parallel forms reliability involves administration of two


forms of the same test to the same participants. If the scores on the two
forms of the same test are identical or nearly identical, parallel forms
reliability has been demonstrated. Parallel forms of a test are developed
in such a way that no matter which form (variation) of the test a person
completes, the score should be the same.

One student can take one form of a test, and the students sitting to the
right and left could have different variations of the same test. None of
the three students would have an advantage over the others; their
respective scores would provide a fair comparison of the variable being
measured. For example, measuring mathematical ability of standard 8
pupils.

107
EPSC 228/311/0228

Form A Form B

Mathematics Test Mathematics Test

The two forms A and B are developed from the same curriculum and
subjected to the due process of the development of a good test and the
two tests can be taken currently or at different times.

3. Inter-Scorer Reliability

This is also called “inter-rater reliability”. An estimate of inter-scorer


reliability involves two independent scorers scoring the qualitative content
of the learners‘ responses e. g. learners‘ essays. The two scores from the
two independent examiners are statistically compared. If there is high
agreement, then the scoring is reliable. This is used by KNEC in essay
tests during co-ordination of examiners (dummy marking) and during the
marking of life scripts.

4. Inter-Observer Reliability

Where the measurement involves observation rather than paper-and


pencil test, an estimate of the reliability of the measurement process is
required. Inter-observer reliability estimates the degree to which two or
more observers agree in their measurement of a variable. For example,
variable to be measured through observation is ―aggression‖ among pre-
school boys.

Examples: Two observers went to Kilimo Nursery School to observe


aggression among the pre-school boys and recorded their observations as
follows: -

Time Observer 1 Observer 2 Total Variance


Observations

108
EPSC 228/311/0228

10.00 / // 3 1

10.01 / / 2 -

10.02 // / 3 1

10.03 // /// 5 1

10.04 /// // 5 1

10.05 / / 2 -

10.06 // // 4 -

10.07 / / 2 -

10.08 // /// 5 1

10.09 // // 4 -

Total 17 18 35 5

Aggression was measured by such indicators as: kicks, destroys, fights,


hurts, slaps‘

Inter-observer reliability =No. of agreed observations x100


Total No. of observations

Inter-observer reliability = 30 x 100 = 85.7%


35

The higher the % agreement between the observers or interviewers, the


greater the reliability.

109
EPSC 228/311/0228

5. Inter-Item Reliability

Another way of estimating the reliability is to assess inter-item reliability.


Inter-item reliability is the extent to which different parts of a
questionnaire, or test designed to assess the same variable attain
consistent results. Scores on different items designed to measure the
same construct should be highly correlated.

There are two approaches to estimating inter-item reliability. These are:


-

i. Split Half Reliability

This involves splitting the test into two halves and computing coefficient
of reliability between the two halves (odd numbered questions and even-
numbered questions).

This is one of the measures of internal consistency of a test that is


estimated by comparing two independent halves of a single-test. This
procedure gives correlation between scores on the odd - numbered and
even numbered items of a single test.

Split half reliability estimates are widely used, because of their simplicity.
The procedure is that you split the test into two halves and compute
coefficient of reliability between the two halves.

ii. Coefficient Alpha ()

This is also called internal consistency or Cronbach alpha (). Cronbach


alpha involves evaluating the internal consistency of the whole set of
items.

 An evaluation of internal consistency is most often used where we


have created a multiple-items questionnaire to measure a single
construct variable like intelligence, need for achievement or anxiety.
The individual items on standardized test of anxiety will show a high
110
EPSC 228/311/0228

degree of internal consistency if they are reliably measuring the


same variable.

 Internal consistency is high if questions – measure the same


variable

6. Kuder-Richardson (K-R). This measures internal consistency. This is


widely used to estimate test reliability from one administration of a
test. Correlation is determined from a single administration of a test
through a study of scores variances.

DIFFERENCES BETWEEN RELIABILITY AND VALIDITY

i. Can a test/instrument be reliable, but not valid?


ii. Can a test/instrument be valid, but not reliable?

1. Reliability

Reliability is always a statement of probability.

Example 1: Inter-scorer reliability. Like asking a question: What is the


probability that the two independent examiners will agree on answers
given by a candidate?

a) Yes, probability will be high if the two examiners are trained so that
they can mark consistently.
b) No, how probability if the two are not trained.

Example 2: Reliability of KZY 788. What is the probability that my old


KZY 788 will reach Nairobi from Njoro without breaking down?

i. Yes, probability is high, if the car is well serviced – servicing


increases reliability.
ii. Probability is low if the car is not well serviced.

111
EPSC 228/311/0228

Example 3: You know your actual weight to be 92 kg. You take your
weight three times in a day using the same machine and you find:

Morning = 85 kg

Lunch = 92 kg

Evening = 87 kg

Decision: Your scale is unreliable. The scale should read at all times/
whenever you step on it 92 kg.

Example 4: If your three consecutive weights are:

Morning = 85 kg

Lunch = 85 kg

Evening = 85 kg

Decision: The scale is reliable (consistently giving 85kg), but not


correct/valid. The scale does not have to be right/correct to be reliable;
it just has to provide consistent results. This instrument is reliable without
being valid for the purpose of giving accurate weight.

Which is more important in a test? Validity or reliability?

“A valid test is always reliable but a reliable test is not


necessarily valid”
 A good test is a valid test, and a test is considered to be valid if it in
fact measures what it purports to measure. A test of intelligence is
a valid test if truly measures intelligence.
 If the instrument/test is valid, it must be reliable. If every time
pupils take the same test, they get different results, the test is not
able to predict anything. However, if a test is reliable, that does not
mean that it is valid.
112
EPSC 228/311/0228

 Reliability is necessary, but not sufficient, condition for validity. A


valid instrument must have reliability, but reliability in itself does
not ensure validity: that is, reliability is said to be necessary but not
sufficient condition for validity

2. Validity

Validity is a statement of suitability. Whether something meets the


requirement or is fits for the purpose. E. g. Is the test meeting the
requirements or objectives of the course?

Illustration by use of a shotgun


 The purpose of a gun is to kill.
 A shotgun is valid if it kills. However, it might not be reliable
because it does not hit the target all the times.
 If every time we use the same gun to shot the target, we get
different results (not hitting the bull-eye), the gun is not able to
predict the hitting of the bull-eye. Hence, the gun is valid with
respect to killing (it does what is intended to do), but not reliable
in hitting the bull-eye.

Target

Seeing validity as archery target and reliability as shots of the target.

113
EPSC 228/311/0228

Target A Target B Target C Target D

Target A: Target B: Target D:


Poor validity, but Shots within the area Good validity and good reliability.
good reliability but not hitting one point
Source: Google validity as archery target

EXAMPLES: Where the test is reliable but not valid.


1. Homework
Homework giving consistent results, but the homework itself is not
relevant to classroom lesson- what was taught. This means the test is
reliable but not valid.

2. Measuring Intelligence
Suppose you want to measure the intelligence of smart students and you
decide to use a tape measure the circumference of their heads in
centimeters and you consistently receive the same results/value of the
heads of these students-that is the tape measure is reliable. But the test
of using tape measure is not valid measure because we do not use tape
measure to measure intelligence. We use intelligence test to measure
intelligence of people.

Which is more important in a test? Validity or Reliability?

114
EPSC 228/311/0228

“A valid test is always reliable but a reliable test is not


necessarily valid”
 A good test is a valid test, and a test is considered to be valid if it in
fact measures what it purports to measure. A test of intelligence is
a valid test if truly measures intelligence.
 If the instrument/test is valid, it must be reliable. If every time
pupils take the same test, they get different results, the test is not
able to predict anything. However, if a test is reliable, that does not
mean that it is valid.
 Reliability is necessary, but not sufficient, condition for validity. A
valid instrument must have reliability, but reliability in itself does
not ensure validity: that is, reliability is said to be necessary but not
sufficient condition for validity

Validity of a test is more important than reliability


Note:
 A test may be reliable (give dependable results) without being valid
for any purpose but the reverse is not true.
 A test may give dependable scores, well spread out, but fail to
measure the kind of achievement which should be measured.
However, an unreliable test cannot be valid since its results are
undependable.
 Validity and reliability differ in that the former refers to the purpose
of measurement and the later refers to consistency or dependability
of the measurement. Both are important in the design of a
classroom test.

115
EPSC 228/311/0228

3. OBJECTIVITY

A third characteristic of a good test is objectivity or freedom from


subjective judgment in scoring answers.
 A student‘s score in an objective test will not be affected by mood
or identity or judgment or personality of the person who computes
the score (see article on Daily Nation, September 13, 2014, on ―No
need for names on examination papers (See paper on halo effect).
The argument is that examiners may mark them down on the basic
of ethnicity and class ranking / index numbers – giving more marks
to those who have index 001 and fewer marks to those with lower
index numbers. If this situation happens, then objectivity is lost.
 There is a new proposal to give candidates numbers that do not
reflect their place of origin or school.
 True-false and multiple-choice tests are completely objective; the
correctness of response is determined by comparing answer with a
scoring key.
Charles Darwin was a socio-biologist? T/F the answer is F; Isaac
Newton was a physicist? T/F. The answer is T.
 Essay examinations are the least objective, since each answer must
be read, interpreted, evaluated and (usually) scored.
 Inconsistency in scoring leads to low reliability of results.

4. DIFFICULTY AND DISCRIMINATION


 Reliability of a test depends in part upon its difficulty. Dependable
results are obtained when the test yields a wide spread of scores
and is neither too difficulty nor too easy for the students being
tested.
 Validity depends in part, upon test difficulty. If an achievement test
is made so difficult that none of the students can answer any of the
questions, the test is completely invalid! To be valid, a test must
116
EPSC 228/311/0228

measure what it is intended to measure – and must be neither too


difficulty nor too easy.

5. COMPREHENSIVENESS
For a test to be comprehensive, it should sample major lesson objectives.
It is neither necessary nor practical to test every objective that is taught
in a course, but a sufficient number of objectives should be included to
provide a valid measure of student achievement in the complete course.

6. EFFICIENCY
A good test is efficient. Efficiency in measurement requires saving of time
for the students as well as for the teachers.
 Efficient items require the least reading and responding time
(objective items and yet providing wide sampling of content).
 The use of easy questions to measure knowledge is inefficient; too
few specifics can be tested in a given time.
 Saving time on scoring is a measure of efficiency.
 Providing clear directions on answering questions.
 Free of ambiguity (unambiguous).
Tests can be designed for efficiency
(i) By selecting the form of items which will measuring what should be
measured in the shortest time and
(ii) By constructing the test so as to save time for both the students
and the teachers.

7. FAIRNESS AND CONDITIONS OF


ADMINISTRATION
A good test must be fair to all those who are taking the test. The test
must be:

117
EPSC 228/311/0228

1. Gender sensitive. Items should not reflect things that favor one
gender e.g. boys. Studies show that if a test is set on activities that
boys normally do they will perform better than girls.
2. Taken under the same condition by all candidates e.g. free of noise.
3. Free of locational benefits. For example of 1983 CPE Guided English
Composition:
In 1983 CPE guided composition, candidates were given a photo
whish was showing:
 A matatu that had hit a pupil at a zebra crossing.
 A policeman was on site and
 A crowd of people were around.
Candidates were required to write a composition of what was happening.
This scene is unfair for rural pupils who might not have seen a policeman
or even think that a matatu can cause an accident.
4. Culture-fair or culture free. That is there should be no cultural
biases / influences in the test. If a multiple-choice question asks:
Children are named according to:
A. Religious practices
B. Natural incidences
C. Random sampling of names
D. Day of the week
All the above four choices are possible answers. This is both a bad
question and unfair question. The question is culture loaded.

8. NORMS FOR SCORING AND INTERPRETATION


A good test is a test that contains norms or normative data. Simply
stated, norms provide us with a standard with which we can compare and
interpret the results; a test can provide us with a sample of behavior, but
it is normative information that gives meaning to that behavior sample.
 Basis of comparison and interpretation.

118
EPSC 228/311/0228

Illustration
Suppose on a ―Mock Test‖, two students cored:
Student A: Biology-72
Student B: Physics- 70
 What could you conclude?
 The answer is NOTHING
 You need to know test‘s reliability, validity and norms. The norms
would tell how student A and B scored relative to other students like
themselves (norm group).
 We might find that both student A and B scored in the average, or
that one scored below average and one above. The point here is
that interpretation of the test scores cannot be made in a vacuum.
A good test must provide direction of scoring and interpretation.
 If you are given the class mean or class mastery level, then you can
make a better judgment the performance of student A and B.

-3 -2 -1 0 +1 +2 +3

Low Scores High scores


Average

119
EPSC 228/311/0228

Unless it has been established that average Mock test results for the same
age-group over the years has over the years been:
 Biology 85
 Physics 65

120
EPSC 228/311/0228

LECTURE 8
Welcome to lecture 8.

BLOOM’S DOMAINS OF LEARNING AND TAXONOMY OF


EDUCATIONAL/LEARNING OBJECTIVES

TOPIC OBJECTIVES
After this lecture, the learner should be able to:
 Explain the term taxonomy.
 Distinguish between and discuss three types of
Bloom‘s domains of learning.
 Explain the six categories of cognitive domains.

Taxonomy means classification. Taxonomies of educational objectives


serve as sources for deriving instructional objectives and classifying
learning behaviours. The taxonomies are classified into three domains of
learning, namely:
 The Cognitive domain which deals with intellectual development
(Bloom, 1956);
 The Affective Domain which deals with the development of
feelings, values and attitudes. Issues of integrity and corruption are
within the affective domain (Krathwohl, 1964); and
 The Psychomotor domain which deals with motor / practical /
manual skills development (Simpson, 1972).
From the Ecology of Human Development perspective, the diagram
below summarizes the influences of the children‘s learning of values and
attitudes.

 Child vs. John Locke concept of tabula rasa (blank slate)


 Society plants hate virus in children
121
EPSC 228/311/0228

 From the diagram, where do you put Kenya‘s leaders hate


speeches?

Of the three, the Cognitive domain is often the most emphasized


especially in formal school settings but the other two are also used
whenever the subject to be learnt can be used to teach skills, attitudes,
values and feelings. Here, below, are some action verbs which
demonstrate observable behaviors under the three domains of learning.
For cognitive domain: copy, classify, apply, analyse, define, identify,
compare
For affective domain: accept, discuss, dispute, join, defend,
For psychomotor domain: adapt, adjust, assemble, build, operate
122
EPSC 228/311/0228

SETTING EDUCATION OBJECTIVES BASED ON COGNITIVE


DOMAINS

Blooms Six Categories of Cognitive Domains

Bloom‘s taxonomy contains six categories of cognitive skills ranging from


lower-order skills that require less cognitive processing to higher-order
skills that require deeper learning and a greater degree of cognitive
processing (see Figure 1 below).

Bloom and his colleagues categorized cognitive domain into 6 categories


from simple to complex or from factual to conceptual. Whenever you are
setting instructional objectives that are intended to measure cognitive
domain you apply the following Blooms 6 categories of cognitive domain.

1. Knowledge (recalling information). This is the lowest level of


objectives e.g. memorizing mathematics facts or formula, scientific
principles. Knowledge objectives require students to reproduce facts
e.g. formula for calculating area of a triangle.
2. Comprehension (translating, interpreting). Comprehension,
objectives require that students show an understanding of
information as well as the ability to use it. Examples include:
interpreting the meaning of a diagram, graph or parable.
3. Application (using principles or abstraction to solve novel or real
life problem). Application objectives require that students apply
theoretical knowledge or information to solve a real-life problem.
(KALRO research activities fall here)
4. Analysis (breaking brown complex information or ideas into
simpler parts to understand how the parts relate or are organized).
Examples of analysis objectives include: contracting schooling in
Kenya with education in Japan or Tanzania.
5. Synthesis (creation of something that did not exist before).
Synthesis objectives involve using skills to create completely new
123
EPSC 228/311/0228

products. Examples include designing as science experiment to


solve a problem. Further example, coming up with an innovation.
Converting ideas into a product. Coming up with an innovation eg.
Converting ideas into a product
6. Evaluation (judging something against a given standard).
Evaluation objectives require the students to make value judgments
against some criterion or standard.

Knowledge Testing lower level abilities


Comprehension
Application Testing medium level abilities
Analysis
Synthesis Testing higher level abilities
Evaluation

In testing try to balance test items that measure lower level abilities and
those that measure higher level abilities.

124
EPSC 228/311/0228

Bloom‘s Six Categories of Cognitive Domain

ACTION VERBS TO USE FOR EACH COGNITIVE DOMAIN

Knowledg Comprehensi Application Analysis Synthesis Evaluatio


e on n

copy classify apply analyse arrange appraise

define describe change appraise assemble argue

duplicate explain demonstra calculate collect assess


te
label express categorize combine attach
discover
list identify compare compose choose
dramatize
match indicate contrast create compare

125
EPSC 228/311/0228

memoriz interpret draw criticize design conclude


e
locate employ diagram devise defend
name
outline extend differentiat formulate estimate
order e
relate illustrate manage evaluate
quote distinguish
report manipulate manipulat judge
recognize examine e
respond modify justify
recall experimen modify
restate operate predict
t
record organize
review perform rate
explain
repeat originate
rewrite prepare score
illustrate
reproduc plan
select produce select
e predict
propose
translate show support
tell question
set up solve value
underline test
use

REVISED BLOOM’S TAXONOMY [NOT FOR


DISCUSSION]

Based on findings of cognitive science following the original publication, a


later revision of the taxonomy changes the nomenclature and order of the
cognitive processes in the original version. In this later version, the levels
are remember, understand, apply, analyze, evaluate, and create.
This reorganization places the skill of synthesis rather than evaluation at
the highest level of the hierarchy. Furthermore, this revision adds a new
dimension across all six cognitive processes. It specifies the four types of

126
EPSC 228/311/0228

knowledge that might be addressed by a learning activity: factual


(terminology and discrete facts); conceptual (categories, theories,
principles, and models); procedural (knowledge of a technique, process,
or methodology); and metacognitive (including self-assessment ability
and knowledge of various learning skills and techniques).

ACTION VERBS TO USE FOR AFFECTIVE DOMAIN

accept discuss

accept display

attempt dispute

challenge follow

change form

commend initiate

comply integrate

conform join

defend judge

ACTION VERBS TO USE FOR PSYCHOMOTOR DOMAIN

Adapt Duplicate Move Select

Adjust Fix Operate Service

Assemble Generate Perform Set up

Bend Grasp Pickup Shorten

Build Handle Point to Show

127
EPSC 228/311/0228

Calibrate Hear Practice Slide

Close Identify Press Sort

Combine Illustrate Pull Stretch

Construct Load Push Touch

Copy Locate Remove Transport

Design Loosen Repair Write

Diagram Manipulate Replace

Disconnect Measure Rotate

Draw Modify See

SELF ASSESSMENT EXERCISE TO BE DISCUSSED IN CLASS

Underline the verb phrases which best demonstrates observable


behaviors:
Know State Describe

Give examples of Understand Really know

Fully understand Suggest reasons why Explain

Evaluate Be familiar with Become acquired with

Pick out Distinguish between Have a good grasp of

Appreciate Analyse Carry out

Summarize Compare Acquire a feeling for

Summarize Learn the basics of Realize the significance

128
EPSC 228/311/0228

Believe in Demonstrate of

Show diagrammatically

SAMPLES OF AIMS AND LEARNING OBJECTIVES

Learning objectives must be stated in action verbs.

Sample 1: Course in Literature in English

Aims: To develop skills in literary appreciation through reading novels.

Learning Objectives: After reading Chinua Achebe‘s novel, Things Fall


Apart, a learner should be able to:
 Explain the title of the novel
 Compare the character of Okonkwo with that of his father
 Give his/her opinion about Okonkwo‘s killing of Ikemefuna.

Sample 2: Course in Business Management


Aims: To help the non-financial manager use financial information in
planning, controlling and making decisions.
Learning objectives: At the end of this course, a learner should be able
to:
 Construct a cash flow forecast.
 Use a profit and loss account.
 Interpret a balance sheet.
 Describe ways of classifying expenditure.

129
EPSC 228/311/0228

LECTURE 9
Welcome to lecture 9.

INSTRUCTIONAL/LEARNING OBJECTIVES
AND LEARNING OUTCOMES

TOPIC OBJECTIVES
After this lecture, the learner should be able to:
 Explain the concept of:
 Learning objectives.
 Learning outcomes.
 Explain why action verbs and SMART principle are
necessary in stating learning objectives.
 Write your own learning objective.

Whenever you plan for class instruction, you must set instructional
learning objectives of the course.
Refer to your EPSC 228 course objectives.

What are learning objectives? Objectives are specific statements of


what is to be accomplished and how well, and are expressed in terms
of action-verbs; quantifiable, measurable outcomes. Objectives address
the questions:
 What the course/lesson/programme can do or expects the learner
to do at the end the course or training.
 What are learning outcomes? What the learners are able to do
after exposure to a period of learning or training. Contextually
demonstrated knowledge, skills and values after the lesson/course.
130
EPSC 228/311/0228

How do we measure the achievement of course objectives? By testing the


learners.

Objective Performance

o
P

After teaching the course you may have the following scenarios:
i. If the two triangles do not intersect, the course objectives were not
met.

O P

ii. The course objectives and performance may interact at 10%


meaning 10% of the objectives were achieved.

O P

iii. The course objectives and performance may intersect at 50%


meaning that 50% of the objectives were achieved.

131
EPSC 228/311/0228

O P

iv. The course objectives and performance may intersect 100% (in
total congruence) meaning that 100% of the objectives were
achieved.

O/P

1. What are the instructional/Learning objectives?


 Instructional objectives describe desirable outcomes – the skills,
knowledge, abilities or attitudes – learners should possess or
demonstrate after they complete the course or training. Statements
of intend. What the course/program can do. In short, they are tools
for describing student outcomes.
 Learning objectives focus on student performance. Must be
measurable, seen or heard; expressed in action verbs and must be
SMART. A learning objective is a statement that specifies in
behavioural (measurable – seen or heard) terms what a learner will
be able to do (performance) as a result of exposure to

132
EPSC 228/311/0228

instruction. That is, after completing the course / training. The


number of learning objectives will determine the content of the
subject matter to cover and exercises / activities to include. Too
many objectives confuse the learner and too few, wouldn‘t
communicate adequately what you intend for the learner. A rule of
thumb should be to formulate as many learning objectives as would
lead a learner to attain the knowledge, abilities and skills you intend
to cover in a given unit / topic.

2. Importance of instructional objectives


 Provides Direction to Learning. Why are instructional objectives
key to effective instructions? Instructional objectives provide both
the teacher and the learner with direction for instruction.
Instructional objectives focus on learning, not teaching activity.
They state the instructional intend in terms of learning.
 Old saying: The importance of stating objectives has been
succinctly expressed in the statement that if people do not know
where they are going or what they are doing, they will be unable to
find out whether they have arrived or what they have achieved.
 Assist in Planning Instruction: Instructional objectives are
important for planning instruction in the following way:
Tell or inform the learners what is expected of them, and hence serve as
guide for them. Objectives give direction to learning. Objectives help a
teacher plan instruction, guide student learning and provide criteria for
evaluating student outcomes.
 Guide Teaching: Involve the determination of the media, materials
and procedure that is necessary to facilitate the learning rational
sequencing of instruction.
 Facilitate Evaluation: Assist in determining the appropriate
methods for evaluation. [Evaluating students‘ learning activities,
instructions, students and curriculum]
133
EPSC 228/311/0228

3. Writing Instructional Objectives


In formulation of objectives, it will be helpful to ask such questions as:
(i). What will the student be able to do as the result of instructions? First,
determine the learning outcomes for the lesson. That is, what should the
learner be able to do when the lesson is over (outcome)?
(ii). How can they demonstrate achievement with pencil-and-paper
behavior?
(iii) How to define objectives.

Instructional objectives MUST meet two major criteria:

(i) They must be stated in ACTION VERBS.


(ii) They must be SMART. SMART is acronym for:

S = Specific: Ensure that the objectives are specific. Specification of the


end results of learning – the abilities which learners will have acquired
when they have achieved the objectives e.g. will be able to add 1 + 1 or
identify the characteristics of living things.

M = Measurable: Can the objectives you have set be measured?


Objectives must be expressed in behavioural terms – indicating what the
learner will be able to do at the end of the instruction.

A = Achievable: Are the objectives achievable or realistic?

R = Relevant: Are the objectives relevant to the topic or programme‘s


purpose?

T – Time-bound: What is the time-line for achieving the set objectives?


Eg. at the end of the course, training, semester, academic year?

134
EPSC 228/311/0228

Examples of poorly stated objectives

Course / instructional objectives provide the basis for test construction.


That is, the content and structure of a test must be based upon the
objectives of instruction. For example: Topic of osmosis. At the end of
the lesson the learners should be able describe the process of osmosis.

Poorly stated objectives cannot be measured or fail to provide the basis


for measurement of pupil achievement. Examples of poorly stated
objectives are:

At the end of the course, learners must be able to

1. ―Know the important facts about osmosis‖.


 How can we determine when a learner ―knows?‖
 What must a pupil be able to do in order to demonstrate what
he knows?
2. To ―understand Newton‘s Law of Motion‖
 How can it be determined when a learner ―understands‖ or the
depth and breadth of his understanding?
 What must the pupils be able to do in order to demonstrate his
understanding?
3. To ―become familiar with the services rendered by the Teachers
Service Commission‖.
 How ―familiar‖ should a student be?
 How can we measure ―familiarity‖ with ideas or information?

4. Classification of Instructional Objectives


Teachers should classify objectives because the type of objectives dictates
the selection of instructional methods, materials, media and evaluation to
be used in the lesson. Objectives may be classified according to the
primary learning outcomes that take place. These learning outcomes

135
EPSC 228/311/0228

typically are classified into three domains / categories as discussed earlier


in the course:
 Cognitive
 Affective
 Psychomotor

LEARNING OUTCOMES
 Learning outcomes are what students are able to do after a period
of learning or lesson.

 Contextually demonstrated knowledge, skills and values after the


lesson/course.

 Change in behavior that an individual demonstrates after


instructional period.

 Ability of a learner to demonstrate knowledge, skills, attitudes that


he /she has acquired after undergoing a learning process.

136
EPSC 228/311/0228

LECTURE 10
Welcome to lecture 10.

TEST PLANNING
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
 Explain Why? What? How? Of a test

One of the most important aspects of the work of a Test Development


examiner or developer is his/her role in planning test he/she works on
and in development of detained specifications for it. The test itself is only
the tool or instrument used to obtain data for measurement and
evaluation; the measurement cannot be valid or the evaluation
meaningful unless the instrument which provides the data is appropriate
for its purpose. The development of an appropriate test begins with well-
defined testing objectives followed by the translation of those objectives
into a blueprint from which the test can be built – the test specifications.

The importance of developing comprehensive test specifications cannot be


overemphasized. Basically, there are two major dimensions to the test
specifications:
1. The content to be sampled.
2. The abilities or levels of thinking to be tested.

137
EPSC 228/311/0228

Planning the Test


In planning a test, there are three basic questions which must be
answered: Why? What? How? The answer to each of these has
implications for the test specifications.

1. Why are we testing? In other words: what is the purpose of the


test? Who is to be tested? How are the results to be used? The
answers to these questions lead to general decisions on the range
of subject matter to be sampled, the kinds of abilities to be tested,
and the statistical distribution most useful in fulfilling the purpose of
the test. The requirements for national tests for selection, for
example, are quite different from those for a classroom test as part
of the learning-teaching process.
2. What should be tested? The answer to this question leads to the
selection of the specific content to be covered and the specific
abilities to be tested, including decisions on the relative emphasis to
be placed on the different topics and abilities.

A diagnostic test in a specific subject-matter area would call for


very detailed delineation of content, and a relatively narrow range
of abilities, whereas a survey type test covering a broad subject-
matter field would more likely have its content described in terms of
major topics to be sampled over a wide range of cognitive
processes. In so far as possible, for fairness and good coverage,
the test should be placed so that there are questions in each major
topics that are at different levels of difficulty and that require
different levels of thinking.

3. How should each element in the test specifications be


tested? The answer to this question leads to decisions on the
number and kinds of questions to be used. The purpose of the test,
138
EPSC 228/311/0228

the requirements of the testing program, the testing time available,


the problems of test administration and of score reporting and
feedback to schools, colleges and candidates – are all factors which
influence these decisions and place constraints upon them.

Developing Content and Ability Specifications


It is impossible to give detailed guidelines for how specifications
should be developed, because the procedures will vary with
different subjects. A useful technique is to ask such questions as:
i. ―What are the important things you would expect a person
who has studied this subject to know?‖
ii. ―What different types of thought processes should the
examinee be required to demonstrate?‖
iii. ―What skills should he/she have acquired?‖
iv. ―What is the relative importance of these various elements?‖

139
EPSC 228/311/0228

LECTURE 11
Welcome to lecture 11

TEST SPECIFICATIONS AND CONSTRUCTION

TOPIC OBJECTIVES
After this lecture, the learner should be able to:
 Define table of specifications.
 Describe the role of table of specifications in test
construction
 List and discuss the steps/procedure in test
construction.
 Explain the merits and demerits of easy and
objective tests.
 Briefly describe the guidelines for the construction of essay and
objectives tests.

What Content-Ability Specifications Look Like (Table of


Specifications)
Probably the most typical set of specifications is a two – dimensional grid
with the subject-matter topics listed on one axis and the abilities to be
probed on the second. Typically, the aces are orthogonal to each other;
that is to say, they are independent of each other to the extent that any
ability listed can be tested in connection with any of the content
categories.

The number, nature, and specifically of the categories will depend on the
purpose of the test. Although the two-dimensional grid for content and
abilities is most common, there will be cases in which three or more are

140
EPSC 228/311/0228

needed. For example, in history it may be important to insure that


certain time periods are covered by stipulated proportions of the test
questions. Unfortunately, it is not feasible to put workable three or more
dimensional grids on paper, so it may become necessary to have two
different grids to be used in conjunction with each other.

Once the list of topics and abilities has been decided on, the next task is
to determine the relative emphasis to be given to each topic and ability
and to enter into each cell either a percentage of the test or the actual
number of questions to be tested in that cell. It may be that certain
topics by their nature are essentially limited to certain abilities or test
plan may lead to insights into what has been previously untested and to
ingenious solutions or new approaches to writing items that test what
may be long considered untestable in an objective format.

What is a test? (Discussed earlier)


A test is a device by which we can use to sample the candidate / students
behavior. The common kind of test that the teacher is used to is the
paper and pencil (paper and pen) test, i.e. that in which the pupils is
required to write or mark his answers on a paper. However, tests make
take various other forms. In some cases, the pupils may indicate his
answers orally (oral tests), in others he may be required to carry out
certain activities during which he is observed and scored by an observer.
- A test must be in harmony with instructional objectives and subject
content. To be sure that these are achieved, the preparation of a
test should follow a systematic procedure.

Test Construction Procedure


1. State general instructional objectives and define each instructional
objective in terms of specific types of behavior students are
expected to demonstrate at the end of the exercise.
141
EPSC 228/311/0228

2. Make an outline of the content to be covered during the instruction.


3. Prepare a table of specifications or test blue print which will
describe the nature of the test sample.
4. Construct test items that measure the sample of behavior of the
candidates‘ specified in the test blue print.

1. STATING BEHAVIOURAL OBJECTIVES


Behavioural objective is also called performance objectives. This is a
statement that specifies what observable performance the learner should
be engaged in when we evaluate the achievement of the course objective.
Behavioral objectives must be stated in action verbs.
Different experts recommend different approaches to writing behavioural
objectives. One recommendation that is fairly simple to follow is that a
statement of behavioural objective should consist of four parts as follows:
(i) The learner (the pupil)
(ii) An action verb (states)
(iii) A content reference (e.g.four characteristics of living things)
(iv) A performance level
Written properly, the behavioural objective would read:
―The pupil should be able to state four characteristics of living things‖.
Given this kind of statement, it is easy to write a test to elicit the
behavior.
In stating the ―action verb‖ it is useful to note that certain types of verbs
are not appropriate. These are verbs that represent action that cannot be
readily observed or have ambiguous meanings e.g. understand,
appreciate, feel, intends. The verbs are not considered behavioural
because one cannot observe or measure a person ―understanding or
appreciating‖. Below are examples of action verbs that are appropriate for
stating measurable behavioural objectives.

142
EPSC 228/311/0228

describe make illustrate predict


define recognize construct infer
measure identify draw repeat
state classify build write
discuss read recall make

In KNEC stating of behavioral objectives is the function of the Subject


Examination Panels. These statements are incorporated in the
Regulations and Syllabuses which the Kenya National Examinations
Council publishes.

2. OUTLINING THE CONTENT


In order to ensure that a test adequately samples the subject matter of
any discipline, it is essential to make an outline of content to be
examined. The Kenya National Examinations Council publishes for each
examination a syllabus that list for individual subjects what content areas
it will test. Again the task of selecting the subject matter is the
responsibility of the Subject Examinations Panel. It should be pointed out
that curriculum development is the responsibility of Kenya Institute of
Education.
3. TABLE OF SPECIFICATIONS
The purpose of a table of specifications is to ensure that the test covers
all the objectives of the instruction. A table of specifications or a Test
Blue Print is a two dimensional table with the content objectives listed
along one dimension and the behavioral performance / content or
instructional objectives listed along the other. Numbers are then inserted
in the cells so created to indicate how many test items should be set on
each behavioral and content objective.

In allocating items to the different cells, there is no rule of thumb. All


that a test constructor must not do is to produce an imbalanced paper.

143
EPSC 228/311/0228

The weighting will be reflected in the behavioral objectives. The


behavioral objective is arbitrary as the decision lies with the individual or
a group of individuals.

Once the purpose of a test has been made, the teacher or test developer
has to make two decisions, namely:

(i) Decide on the weight to be given to each topic covered in the


course. The test items must be balanced with respect to relative
importance of the topics.
(ii) A second decision relates to the kinds of learning to be tested. How
much weight should be given to knowledge, comprehension,
application, analysis, synthesis and evaluation?

Once these two decisions have been made, the ―specifications‖ for a
particular test can be made. The ―specifications‖ of a test is presented in
a table called a “Table of Specifications‖ or blue print. A table of
specifications is a two dimensional chart with the content (topics) as one
dimension and behavioural performance or kinds of achievement as the
other.

Assume that a teacher wants to develop 50-item biology objective test.

Content Behavioral Performance


/Topics
Knowledge Comp. Application Analysis Synthesis Evaluation Total

Topic 1 2 2 0 0 0 1 5

Topic 2 3 2 2 1 1 1 10

Topic 3 3 1 3 2 2 2 13

Topic 4 2 5 4 2 0 2 15

Topic 5 2 2 1 0 1 1 7

144
EPSC 228/311/0228

Total 12 12 10 5 4 7 50

Table of specifications is used for the design and development of the


objective tests.

4. CONSTRUCTION OF TESTS
In the school setting the most convenient tests are paper and pencil tests
– or written tests. Such tests are commonly of two types:
(i) Essay tests and (ii) Objective type tests

ESSAY TESTS
The essential feature of an essay is that it is open-ended and each
candidate may present his own answer in his own particular style.

WRITING ESSAY TESTS


In writing essay items, the test constructor/developer needs to:
(i) Identify the topic to be tested.
(ii) Must be anchored on instructional objectives.
(iii) Be clear on what he/she would like to test. He/she should
identify components of the topic and decide on what aspect(s)
he/she would like to examine the students on.
(iv) Frame the question in simple and direct language.
(v) Require students to answer all items.
(vi) Present one task at a time.
Below is an example of an essay question that present one task at a time.

Example:
(a) Name two sources of support which helped Britain during Mau
Mau war of 1950-1959?
(b) Why did British lose the war?
145
EPSC 228/311/0228

(c) What were the effects of the war on?


(i) The British
(ii) The African Communities?
In the example above, the topic was Mau Mau War of 1950-1959. The
test constructor wished to examine the candidates on the following
aspects: (i) the allies of Britain, (ii) the reasons for the defeat of the
British and (iii) the impact of the War on both the colonizer and the
colonized people. The items are expressed in simple and direct language;
and the candidates are presented with one task at a time.

Merits of Essay Tests


Essay tests are particularly useful for testing:
(a) The ability to recall rather than simply recognize information,
(b) The ability for expression and communication.
(b) The ability to select, organize and integrate ideas in a general attack
on problems.
(c) Suitable for assessing students ‗ability to analyze, synthesize, and
evaluate.
(d) Not susceptible to correct guesses
(e) Measure creative abilities, such writing talents or imagination.

Demerits of Essay Tests

However, their uses are restricted by the following limitations:


(i) the scoring tends to be unreliable.
(ii) ‗Halo effect‘ is more operative in easy test.
(iii) ―Leniency/severity errors‖ is common in essay test. - Where
examiners/raters give ratings which that are consistently too or
too low high.

146
EPSC 228/311/0228

(iv) ―Error of central tendency‖. Some examiners tend to avoid


extreme categories or giving high scores, concentrating instead
on categories around the midpoint or average of the scales. This
is called error of central tendency. Also come in essay test.
(v) While saves time in writing, the scoring is time consuming.
(vi) A limited sampling of achievement is obtained/test limited range
of the content.
(vii) One issue relating to essay tests is whether and how much to
count grammar, spelling and other mechanical features. If you
do count these factors, give students separate grades in content
and in mechanics so that they will know the basis on which their
work is being evaluated. For examples are given below. These
are real examples drawn from EPSC 311 March 2018
examination scripts:
 Litrate for literate
 Wrote learning for rote learning
 Negletion
 Coatching for coaching
 Privillaged environment
 Flock for block
 Compitend for competent
 Diviation for dviation
Mechanically you as an examiner know what the candidate is
saying.

Scoring procedures of the essay tests can be improved by:


(i) Using a marking scheme.
(ii) Through coordination of examiners.
(iii) Sampling of marked scripts by senior examiners

147
EPSC 228/311/0228

OBJECTIVE TESTS
An objective test is one so constructed that irrespective of who marks the
answers, the scores for a particular candidate is always the same. The
objectivity really refers to the marking of the test. In order to achieve
such objectivity, objective tests usually have pre-coded answers. In any
particular item, there has to be one and only one correct answer.

FORMATS FOR OBJECTIVE TEST ITEMS


Three main formats are used in constructing objective test items. They
are
(i) True – False
(ii) Matching
(iii) Multiple – Choice
TRUE FALSE ITEMS
In these items the examinee must decide whether a given statement is
true or false. For example:
1. The first President of KANU was James Gichuru. T/F.
2. History is about the past. T/F
3. Rift valley is gradually sinking. T/F
4. Sea level is rising. T/F

MATCHING ITEMS
A matching item consists of two lists, phrases, pictures, other symbols
and a set of instructions explaining the basis on which the examinee is to
match an item in the first list with an item in the second list. The
elements of the list that is read first are called premises, and the
elements in the other list are called responses. It is possible to have
more premises than responses, more responses than premises, or to have
the same number of each. In the example of a matching exercise that
148
EPSC 228/311/0228

follows, the premises appear in the left-hand column, with the responses
at the right but in some cases the responses may be placed below the
premises.
The primary cognitive skill that matched exercises test is recall.

List I List II (Responses)


(Premises)

1. KANU ( ) 2007

2. NARC ( ) 2013

3. PNU ( ) 1960

4. JUBILEE ( ) 2002

5. KADU ( ) 1925

MULTIPLE CHOICE ITEMS


Multiple choice questions are generally considered to be must useful of
the objective type items. A multiple choice item consists of a stem plus
two or more alternatives (options), one of which meets the requirement
demanded by the stem. The item stem may be in the form of:
(i) A question
(ii) A complete statement
(iii) An incomplete statement

STRUCTURE OF MULTIPLE – CHOICE ITEMS

Multiple-choice test item consists of two parts:

(i) A problem called stem.

149
EPSC 228/311/0228

(ii) A list of suggested solution called alternatives/options – one


which meets the requirements demanded the stem.

Stem is in the form of:

(iv) A question
(v) A complete statement
(vi) An incomplete statement

List of alternatives contains:

(i) One and only one correct answer


(ii) Three distractors (Incorrect alternatives)

Question Example of a Stem

Who among the following people chaired the Kenya Constitution Review
Commission?

1. Githu Muigai
2. James Orengo
3. Yash Pal Ghai
4. Paul Muite
5. Raila Odinga

What is the complex level in the taxonomy of the cognitive domains?

a. Knowledge
b. Synthesis
c. Evaluation
d. Analysis
e. Comprehensive

Sources of good distracters include:

(i) Common misconception and common errors

150
EPSC 228/311/0228

(ii) A statement which itself is true, but which does not satisfy the
requirement of the problem.
(iii) A carefully worked incorrect statement.

Complete Statement Example of a Stem

In order to sell fish in a village market, a trader requires a license. The


license is obtained from:

a. Police officer in the area.


b. County officer in the area.
c. County officer of the Health Inspector in the area.
d. The leading business man in the area.

Incomplete Statement Example of a Stem

The primary effect of climate change in Kenya is:

a. Reduction of mangrove forest


b. Limited increase in livestock
c. Rising water level in Rift Valley
d. Poor crop production

The term test as used in measurement is defined as:

a. A standard procedure for assessing learners.


b. Making adjustments of learners‘ abilities.
c. Device for sampling learners‘ abilities.
d. A reliable measurement instrument.

GUIDELINES FOR CONSTRUCTING MULTIPLE CHOICES ITEMS

1. Construct each item to assess a single written objective


2. Base each item on a specific problem stated clearly in the stem

151
EPSC 228/311/0228

After reading the stem, the student should know exactly what the
problem is and what he or she is expected to do to solve it.
3. State the stem in positive form.
4. Keep the item short.
5. Word the alternatives clearly and concisely. This is to reduce student
confusion.
6. Keep the alternatives mutually exclusive.
7. Avoid ―all of these‖, none of these‖ and ―both A and B‖ answer choices.
8. Keep options lengths similar
9. Avoid cues to the correct answer
10. Use only one correct option
11. Vary the position of the correct options
12. Guard against giving clues in the correct answers.
13. Avoid any tendency to make the correct answer consistently longer
than the distracters.
14. Avoid ―give-ways‖ in the distracters, for example ―always, ―only‖,
―all‖, ―never‖ etc.
15. Use language that is simple, direct and free of ambiguity.
16. Do not use double negatives in a item.
MERITS OF OBJECTIVE TESTS

(i) Measure a great variety of educational objectives


(ii) Measure all cognitive domains from simple skill (knowledge) higher
level skills (evaluation)
(iii) Item analysis can be applied to multiple choice items.
(iv) A student is able to answer many multiple choice items in the time
it would take to answer a single essay question. Takes shorter time
to answer than essay.
(v) Their marking is free from bias- Can be marked mechanically.
Multiple choices tests can be scored on a completely objective basis.
(vi) They enable test developers to sample a wider content area (more
representative achievement)
152
EPSC 228/311/0228

(vii) They enable test developer to evaluate greater variety of abilities.


(viii) Free of ―halo effects‖ – immune is ―halo effects‖
(ix) Test using it are usually more reliable than other types
(x) The role that guessing plays in determining an examinee‘s score is
reduced when each item is provided with several alternatives e.g. 4
to 5 and this increases the rehabilitee of the test.

DEMERITS OF OBJECTIVE TESTS

(i) Cannot sample ability to communicate. Or express ideas.


(ii) Not totally free of guessing factor – thus reducing reliability of M.C.T
(iii) More difficult and time consuming to write than other types of test
items.
(iv) Difficult to find distracters.
(v) Does not provide a measure of writing ability – same as (i).

Easy vs Objective Tests


Although much has been said and written about the relative merits of
essay and objectives test questions, it can safely be said that neither is
fundamentally superior for all purposes. Both have their merits, their
problems, and situations in which they are preferable. Among the major
advantages of objective type questions are that a large number of
questions can be asked in a given testing time permitting fairer and more
complete sampling of subject matter; scoring is easier and much more
reliable; the questions land themselves to item analysis and; through
pretesting of questions, test difficulty, validity and reliability can be
predicted, controlled and improved.

Because KNEC examinations involve large numbers of candidates, these


advantages make almost mandatory that most of our tests be of the
objective, machine scorable variety. The major disadvantage of objective
questions is that it is not easy to write good objective questions testing
more than knowledge and requiring candidates to demonstrate more
153
EPSC 228/311/0228

sophisticated mental processes. It requires a high order of ingenuity and


creativity to write multiple-choice items that test the full gamut of
abilities. This is a constant challenge for item writers, both on the staff
and on committees involved in item preparation.

154
EPSC 228/311/0228

LECTURE 12
Welcome to lecture 12

ITEM ANALYSIS
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
 Define item analysis.
 Explain and illustrate the three aspects of item
analysis:
 Item difficulty index
 Item discrimination index
 Distractor analysis
 Calculate and interpret three aspects of item analysis.

After a test has been administered and scored, even if we adhered to


qualities of a good test, it is usually desirable to evaluate the
effectiveness of the test items. This is done by studying the examinees‘
responses to each item. This procedure is called item analysis. The
purpose of item analysis is to identify deficiencies in the test instrument.
Item analysis may appear like giving medicine after death.

In principle, item analysis can be carried out on both essay and objective
tests but the techniques are much developed for the objective test items.

Items can be analyzed qualitatively in terms of their content and form,


and quantitatively in terms of statistical properties.

155
EPSC 228/311/0228

(a) Qualitative Analysis

For this type of analysis one requires the service of both subject content
specialists and test construction specialist. Most of the qualitative
analysis can be done before the tests are administered, for example,
judging the content validity is a qualitative type of item analysis.
Examining stems of items for ambiguity is another way of
qualitative analysis.

(b) Quantitative Analysis

This includes principally the measurement of such properties as difficulty


level, discrimination power or index of the items and determining
the effectiveness of the distractors of distractor analysis.

When we prepare items for a test, we hope that each of them will be
useful in certain statistical way. That is, we hope that each item will turn
out to be of the appropriate level of difficulty for the group, the
proportionately more or the better students will get it right than the
poorer, and that the incorrect options will prove attractive to the students
who cannot arrive at the right answer through their own ability.

Item analysis uses statistical methods to identify any test items that are
not working well. If an item is too easy, failing to show a different
between skilled and unskilled examinees, or even scored incorrectly, an
item analysis will reveal it. That is, item analysis information can tell us if
an item was too easy or too hard, how well it discriminated between high
and low scorers on the test, and whether all of the alternatives
(distractors) functioned as intended. The three most common statistics or
areas reported in an item analysis are:

 Item Difficulty Index


 Item Discrimination Index
 Distractor Analysis

156
EPSC 228/311/0228

ITEM DIFFICULTY INDEX

The item difficulty index is one of the most useful and most frequently
reported, item analysis statistics. It is a measure of the proportion of
examinees who answered the item correctly. Teachers produce a difficulty
index for a test item by calculating the proportion of students in class who
got an item correct. The larger the proportion, the more students who
have learned the content measured by the item.

Evaluating a course e.g. geography. First state the objectives of why


geography should be taught in high school. Ask the learners geographical
facts to evaluate the achievement / failure of geography.

CALCULATION OF ITEM DIFFICULTY INDEX

There are two approaches ways of computing item difficulty index,


namely:

i. Simpler Approach for Calculating ID Index

ii. More Complex Approach for Calculating ID Index

SIMPLER APPROACH FOR CALCULATING ITEM DIFFICULY INDEX

This approach is less accurate but good for teachers in understanding the
concept of ID index in a simpler way. For example, imagine a classroom
of 40 Standard 6 students who took a test which included the item below.
What is the item difficulty of this test item? The asterisk indicates
that B is the correct answer.

Test Item: Who was the First President of KANU?

Option No. Choosing

A. Tom Mboya 6
*B. James Gichuru 24
157
EPSC 228/311/0228

C. Jomo Kenyatta 10
D. Robert Matano 0

Item Difficulty Index: Proportion of students who got an item correct

 Count the number of students who got the correct


answer. 24 students chose the item.
 Divide by the total number of students who took the test.

Difficulty Index ranges from .00 to 1.00. For this example, Difficulty
Index=24/40=.60 or 60%. This means that sixty percent of students
knew the answer.

Interpretation of Item Difficulty Index

% Range Difficulty Level

20% and below Very difficult

21 – 40% Difficult

41 – 60% Average

61 – 80% Easy

81% and above Very easy

MORE COMPLEX APPROACH FOR CALCULATING ID INDEX

In computing item difficulty index of a test item using this approach you
need to do the following:

 First, select one-third of the examinees with the highest scores in


the paper and call this the upper group and select the same
number with the lowest score and call this group lower group.

158
EPSC 228/311/0228

 Second, for each item, count the number of examinees in the


upper group who selected each alternative. Make the same count
for the lower group.
 Third, estimate item difficulty by determining the percentage of
examinees that get the item right.

Assume 30 examinees took History paper and responses to Question 1


are as follows:

Alternatives (Options)

Groups A B* C D E

Upper 10 0 6 3 1 0

Lower 10 3 2 2 3 0

*
= Correct answer.

Total number of upper and lower group = 10 + 10 = 20

Total selecting correct answer = 6 + 2 =8

Index of Difficulty = RU + RL
n1+n2
Where,
U=Represents number of examinees in the upper-scoring group
responding correctly
L= Represents number of examinees in the lower-scoring group
responding correctly
n1+n2=Represent the total number of examinees in the upper and lower-
scoring groups, respectively.

ID= Students with correct answers x 100 = 8/20 x 100 = 40% =0.40
Total students (n1 + n2)

159
EPSC 228/311/0228

Since difficulty refers to the percentage getting the item right, the
smaller the percentage figure the more difficulty the item. Hence, Index
of Difficulty can range between 0% and 100% with a higher value
indicating that a greater proportion of examinees responded to the item
correctly, and it was thus an easier item.

Interpretation of Item Difficulty. The same table given above applies.

ITEM DISCRIMINATION INDEX

The item discrimination index is a measure of how well an item is able to


distinguish/discriminate between examinees who are knowledgeable and
those who are not, or between masters and non-masters.

It is the degree to which students with high overall examination scores


also get a particular item correct. For an item that is highly
discriminating, in general the examinees who responded to the item
correctly also did well on the test, while in general the examinees who
responded to them item incorrectly also tended to do poorly on the
overall test.

Question: Do students who scored high in the history examination also


got Question1 correct?

CALCULATING ITEM DISCRIMINATION INDEX (DI)

There are actually several ways to compute item discrimination. Some of


these formulae use equal numbers of upper scorers and lower scorers.
Other use unequal numbers:

i. Simpler approach for ordinary teacher with limited knowledge of


statistic.

ii. The point-biserial correlation. This is the most common approach


but complex for an ordinary classroom teacher. This statistic
looks at the relationship between an examinee‘s performance on
160
EPSC 228/311/0228

the given item (correct and incorrect) and the examinee‘s score
on the overall test.

SIMPLER CALCULATION APPROACH FOR DI

Create two equal groups of students of Upper and Lower Scorers


High scores group= Made up of the high class scorers [or upper half of
the class] in the whole test.
Low scores group= Made up of the low scores or the bottom half of
class in the whole test.
For each group:
a. Calculate a difficulty index for the test item for each group.
b. Subtract the difficulty index for the low scorers from the
difficulty index of the high scorers.
Discrimination Index ranges from -1.0 to 1.0

Test Item: Imagine in KANU test item example:


16 out of 20 students in the high group (n1) and 8 out of 20 students in
the low group (n2) the item correct. Hence:
High Scores Group 16/20=0.80
Low Scores Group 8/20 =0.40

Discrimination Index=.80-.40=.40. This indicates the test item was


good in discriminating learners.
Discrimination Index = U – L =6 - 2 =0.60 – 0.20 = 0.40

n1 n2 10 10

Where,

nu =number of those in the high scoring group.

nl =number of those in the low scoring group.

161
EPSC 228/311/0228

Interpretation of Discrimination Index

Correlation Description
Range

.40 and above Very good items

.30-.39 Good item

.20-.29 Fair item

.09-.19 Poor item

A strong and positive correlation suggests that students who get any one
question correct also have a relatively high score on the overall
examination.

A negative discrimination index may indicate that the item is measuring


something other than what the rest of the test is measuring. More often,
it is a sign that the item has been mis-keyed.

Examples Negative Discrimination Index

1. High scorers=6/20=0.30

Low scorers=18/20=0.90

Discrimination Index =0.30-0.90= - 0.60

2. High scorers=0/20=0.00

Low scorers=20/20=1.00

Discrimination Index =0.00-1.00= -1.00

162
EPSC 228/311/0228

Alternative Approach

Get proportion of examinees responding correctly to the item in the lower


scoring group and subtract from proportion of examinees responding
correctly in the upper scoring group and divide by the number of
examinees (both upper +lower numbers).

Discrimination Index= RU - Rl = 6 - 2 = 4 = 0.40

(nu + nl) ÷ 2 20÷2 10

Ru =Number of those in high scoring group that got the item correct.

Rl = Number of those in low scoring group that got the item correct.

nu =number of those in the high scoring group.

nl =number of those in the low scoring group.

Why divide the difference by 2? The number of students on each side of


the diving line is the half of the class.

Interpretation of Discrimination Index

The possible range of the discrimination index is -1.0 to 1.0. When an


item is discriminating negatively, overall the most knowledgeable
examined are getting the item wrong and the least knowledgeable
examined are getting the item right. A negative discrimination index may
indicate that the item is measuring something other than what the rest of
the test is measuring. More often, it is a sign that the item has been mis-
keyed.

If the discrimination index is negative, it also means that for some reason
students who scored low on the test were more likely to get the answer

163
EPSC 228/311/0228

correct. This is a strange situation which suggests poor validity for an


item.

Interpretation of Discrimination Index

Correlation Range Description

.40 and above Very good items

.30-.39 Good item

.20-.29 Fair item

.09-.19 Poor item

A strong and positive correlation suggests that students who get any one
question correct also have a relatively high score on the overall
examination.

COMPLEX APPROACH FOR DI CALCULATION- POINT-BISERIAL


CORRELATION

[NOT FOR DISCUSSION BECAUSE OF KNOWLEDGE OF STATISTICS]

DISTRACTOR ANALYSIS / ANALYSIS OF RESPONSE OPTIONS

One important element in the quality of multiple-choice item is the quality


of the item‘s distractors. However, neither the item difficulty nor the item
discrimination index considers the performance of the incorrect response
options or distractors. A distractor analysis addresses the performance of
these incorrect response options.

Just as the key, or correct response option, must be definitely correct, the
distractors must be clearly incorrect (or clearly not the ―best‖ option). In
addition to being clearly incorrect, the distractors must also be plausible.
That is, the distractors should seem likely or reasonable to an examinee
164
EPSC 228/311/0228

who is not sufficiently knowledgeable in the content area. If a distractor


appears so unlikely that almost no examinee will select it, it is not
contributing to the performance of the item. In fact, the presence of one
or more plausible distractors in a multiple-choice item can make
artificially far easier than it ought to be.

In addition to examining the performance of an entire test item, teachers


are often interested in examining the performance of individual distractors
(incorrect answer options) on multiple-choice items. By calculating the
proportion of students who chose each answer option, teachers can
identify which distractors are "working" and appear attractive to students
who do not know the correct answer, and which distractors are simply
taking up space and not being chosen by many students.

Example 1. The KANU Example

The analysis of response options shows that those who missed the item
were about equally likely to choose answer A and answer C. No students
chose answer D. Answer option D does not act as a distractor. Students
are really choosing between three options. This makes guessing more
likely, which hurts the validity for an item.

A =6/40 =.15
B =24/40 =.60
C =10/40 =.25
D =0/40 =.00
A good distractor will attract more examinees from the lower group than
the upper group. In this example D was a very poor distractor. It was
obvious to both good and poor students. In this example distractors A and
C are functioning effectively.

165
EPSC 228/311/0228

ABILITY GROUPS OPTIONS


A B C D
High Scorers 1 16 3 0
Low Scorers 5 8 7 0
TOTAL 6 24 10 0

Interpretation

The analysis of response options shows that those who missed the item
were about equally likely to choose answer A and answer C. No students
chose answer D. Answer option D does not act as a distractor. Students
are not choosing between four answer options on this item, they are
really choosing between only three options, as they are not even
considering answer D. This makes guessing correctly more likely, which
hurts the validity of an item.

Example 2.

In a simple approach to distractor analysis, the proportion of examinees


in the upper and lower groups who selected each of the incorrect
response options is examined. A good distractor will attract more
examinees from the lower group than the upper group. In the
example given below, distractors A and C are functioning effectively.

Alternatives (Options)

Groups A B* C D E

Upper 10 0 6 3 1 0

Lower 10 3 2 2 3 0

*
= Correct answer.

The proportion of examinees who select each of the distractors can be


informative. For example, it can reveal an item that is mis-keyed.
166
EPSC 228/311/0228

Whenever the proportion of examinees who selected a distractor is


greater than the proportion of examinees who selected the key, the item
should be re-examined to determine if it has been mis-keyed or double
keyed. A distractor analysis can also reveal an implausible distractor.

[SEE STANDARD NEWS PAPER REPORT OF OCTOBER 30, 2017-


ATTACHED]

LIMITATIONS OF ITEM ANALYSIS

 It is not commonly used in the analysis of essay items

 It is only used when the test involves large population of students.

 It requires the preparation of large number of test items.

 Not good for small groups of students

167
EPSC 228/311/0228

LECTURE 13
Welcome to lecture 13

DEFICIENCIES IN TEACHER-MADE TESTS

TOPIC OBJECTIVES
After this lecture, the learner should be able to:
 Explain the mistakes teachers make in developing
classroom tests.

By and large, teacher – made achievement tests also referred to as locally


developed tests are quite poor. The deficiencies include:
1. Ambiguous question. These are questions which can be
interrupted in two or more ways.
Example of Test Item that is Ambiguous
Flying plane is dangerous. Is statement addressing the act of plane
flying that is dangerous or the act of piloting the plane being
dangerous?
2. Excessive wording. Too often teachers think that the more
wording there is in a question, the clearer it will be to the students.
This is not always so. In fact, the more precise and clear cut the
wording, the greater the probability that the students will not be
confused.
Example of Test Item that is Excessively Worded
Define the term ―Osmosis‖. That is, what you understand by the
term ―Osmosis‖? In other words, what does ―Osmosis‖ mean?
3. Lack of appropriate emphasis. More often than not, teacher –
made tests do not cover the objectives stressed and taught by the

168
EPSC 228/311/0228

teacher, and do not reflect proportionally the teacher‘s judgment as


to the importance of those objectives. Heavily loaded with items
that only test recall.
4. Use of inappropriate item format. Some teachers use different
item formats (such as true-false or essays) because they feel that
change or diversity is desirable. This is not the basis for setting
questions.
How do you design reliable classroom test?
1. Chance factors must be reduced to a minimum. One way is to
eliminate true false and other two-choice types.
2. Write clear instructions so that students will be measured on their
performance rather than on ability to ―figure out what the teacher
wants‖.
3. Ensure consistency in scoring by using a key, prepared in advance.
4. Test must be moderated.

169
EPSC 228/311/0228

LECTURE 14

Welcome to lecture 14

TEST ADMINISTRATION, SCORING AND


INTERPRETATION OF TEST RESULTS

TOPIC OBJECTIVES
After this lecture, the learner should be able to:
 Explain the different procedures KNEC has put in place to provide
credible examination.

This section has been covered under KNEC.


From initiation to scoring, a test goes through three key stages, namely:
 Development
 Administration
 Processing= mechanical scoring, manual scoring and training of
examiners.
We have discussed test development under test construction or test
planning.

Test Administration
As already discussed, tests determine the destiny of the individuals and
hence the conditions under which they are administered by must be fair
and uniform. In this respect, fair administration of tests takes into
considerations:
 Rehearsal in case of national examinations.
 Provision of uniform instructions on the conduct of examinations.
170
EPSC 228/311/0228

 Provision of the same test time to all candidates.


 Provision of testing environment that is free of noise.
 Provision of adequate lighting in the classroom and laboratories
where experiments are carried out.
 Provision of security of examination materials to avoid theft and
cheating.
 Provision of good supervision and invigilation to avoid cheating by
candidates. Cheating gives unfair advantage to those who cheat.
For reliability purposes, there is need for consistency in test
administration and scoring.

Test Scoring
Scoring is one pillar of fairness in an examination. A source of unfairness
if not well managed. There are two types of scoring systems in use in
Kenya.
 Manual scoring. Used mainly in schools and in essay types of
questions in KNEC examinations. Subjectivity/bias can be high
under manual scoring.
 Electronic scoring by use of Optical Mark Reader/ Scanner.
This is used by KNEC for scoring objective test items in KCPE and
KCSE. Objectivity is high in electronic marking/scoring. Objectivity
refers to consistency in test interpretation and scoring.
The conditions that promote fair scoring of a test include:
 Moderation of the marking scheme.
 Training of markers in case of easy tests.
 Coordination of markers and putting in smaller teams.
 Retirement of erratic and generous markers.

Interpretation of Test Results


Fair interpretation of test results takes into considerations:
 Type of test-whether norm-references test or criterion-referenced
tests.
171
EPSC 228/311/0228

 Test difficulty.
Because the assumption under testing is normality, scores are interpreted
in relation to the normal curve. However, there is estimated
(hypothetical) distribution and observed (actual) distribution of test
scored. These are illustrated below using physics and geography test
scores.

(i) Estimated (Hypothetical)

Minimum score = 200


Maximum score = 800

200 300 400 500 600 700 800


Scale of scores

(ii) Physics Test Scores


Estimated (hypothetical)
distribution of the physics
test for all candidates in
Observed distribution of
the KNEC standardization
candidates who took the
group
physics test

200 300 400 500 600 700 800


Scale of scores on physics test

172
EPSC 228/311/0228

(iii) Geography Test Estimated (hypothetical distribution


on the Geography test for all
candidates in the KNEC
standardization group

Observed distribution
of candidates who took
Geography Test

200 300 400 500 600 700 800


Scale of scores on Geography Test

In terms of norm-reference testing, raw scores are interpreted in terms of


a defined group (the standardization group).

173
EPSC 228/311/0228

LECTURE 15
Welcome to lecture 15.

STATISTICAL ANALYSIS OF TEST SCORES

TOPIC OBJECTIVES
After this lecture, the learner should be able to:
 Describe the terms:
 Population
 Statistic
 Describe the application and interpretation of measures of central
tendency to test scores.
 Describe the application and interpretation of measures of
variability to test scores.

Tests are quantifiable measures. Statistics help in decision-making with


respect to the interpretation of results. Regardless of scale or level of
measurement inherent in a particular test, the data from that test must
be placed in a manageable and interpretable form. One way this can be
accomplished is by describing the test results in terms of statistics.

Statistical Concepts
(i) What is a sample? The smaller group of people who actually
participate in the test is known as a sample. This is a sub-set of the
population and is represented by lower case n.

Population

174 Sample
EPSC 228/311/0228

(ii) What is a population? The entire group of people/pupils


who take the test is known as a population. This is represented in
statistics by capital/upper case N.
(iii) What is a parameter? A parameter is a measure of
numerical (number) characteristic of an entire population. Example
1: The mean reading readiness score for all standard one pupils in
Kenya.
(iv) What is statistic? This is a numerical (number)
characteristic of a sample. Example 1: If we draw a sample of
Nakuru County Standard One pupils for example from Njoro sub
county and determine the mean reading scores for this sample, this
mean would be a statistic. A statistic is used as an estimate of the
parameter.

Descriptive Statistics
Once a large set of scores have been collected, certain descriptive values
can be calculated. These are the values that summarize or condense the
set of scores, giving it meaning. Descriptive values are used by the
teachers to evaluate individual performance of pupils and to describe the
group‘s performance or compare its performance with that of another
group.

Once you collect data from a large sample, you can do the following
things:
 Organizing and graphing test scores.
 Applying descriptive statistics.

175
EPSC 228/311/0228

Organizing and Graphing Test Scores

Prepare distribution of frequency table of the test scores. You can present
the test scores individually or in a group format in the form of frequency
table or histogram.

(a) For individual scores from highest to lowest (ungrouped


data).
Scores f (frequency)
96 1
92 1
90 1
88 2
86 1
84 1
75 3
73 2
70 1
60 1
58 2
56 1
_______
17
________

(b) Grouped Scores

Class Internal f (frequency)


90 – 100 3
80 – 89 4

176
EPSC 228/311/0228

70 – 79 6
60 – 69 1
50 – 59 3
________
17
________

 In a grouped frequency distribution, test-scores intervals are called


―class intervals”.
 Decide on the ―width‖ of class interval.
 Class interval must be of the same width. In the above the class
interval is of 10 scores.

Applying Descriptive Statistics


The descriptive statistical tools we use to give meaning to test scores are
measures of central tendency and measures of variability.

(i) Measures of Central Tendency


One type of descriptive value is the measure of central tendency, which
includes those points which scores tend to be concentrated. These
measures describe the centre or location of a distribution. They do not
provide any information regarding the spread or scatter of the scores.
There are three measures of central tendency:
 the mode
 the media
 the mean

177
EPSC 228/311/0228

Mode
The mode is the score most frequently received. It is used with nominal
data. In the ungrouped scores given above the mode is 75. For grouped
data the modal interval is 70-79.
A frequency distribution can be uni-modal (one mode), bi-modal two
modes); trio-modal (three modes); poly-modal many modes.
Example: 2, 2, 2, 3, 4, 6, 6, 6, 7, 8.

Median
The median is the middle score; half the scores fall above the median and
half below. It cannot be calculated unless the scores are listed in order
from best to worst or in either ascending or descending order. Hence the
procedure for getting a median of a distribution is as follows:
 First arrange the scores in a ascending or descending order
 Determine the position or location of approximate median

 Calculate the median of the scores.


Find the median for 66, 65, 61, 59, 53
Position= 5 + 1 = 3, or the 3th score = 61
2
If the scores are 66, 65, 61, 59, 53, 50
Position= 6 + 1 = 7=3.5. That is the median lies between 61 and 59.
2 2
Median = 61 + 59 = 120 = 60
2 2
Mean
The mean (symbolized X) is the most commonly used measure of central
tendency. It is affected by extreme scores. It is the sum of the scores
divided by the number of scores.

X = ∑X
n
178
EPSC 228/311/0228

Where X (X bar) is the mean, ∑X is the sum of the scores, and n is the
number of scores. The symbol ∑ means sum of. Hence ∑X means sum of
all scores. In Greek ∑ is called sigma. X represents individual scores and
n is the number of students or number of scores.
Mean (X) of 66, 65, 61, 59, 53 = 304 = 60.8
5
The mean is appropriate for interval or ratio data.
The disadvantage of the mean is that it is influenced by outliers.

(ii) Measure of Variability


A second type of descriptive value is the measure of variability, which
describes the set of scores in terms of their spread, scatter or
heterogeneity. For example, consider these set-up scores for two groups.

Group 1 Group 2
9 5
5 6
1 4
For both groups the mean and median is 5. If you simply report that the
mean and median for both groups are identical without showing the
variability of scores, another person could conclude that the two groups
have equal or similar ability. This is not true. Group 2 is more
homogenous in performance than Group 1. A measure of variability is the
descriptive term that indicates this difference in the spread, scatter or
heterogeneity, of a set of scores. There are two such measures of
variability: the range and the standard deviation.

179
EPSC 228/311/0228

Range
The range is the easiest measure of variability to obtain and the one that
is used when the measure of central tendency is the mode or median.
The range is the difference between the highest and the lowest scores.
For example:
For Group 1: Range = 9 – 1 = 8
For Group 2: Range = 6 – 4 = 2
The range is neither a precise nor a stable measure, because it depends
on only two scores- the highest and the lowest.

Standard Deviation
The standard deviation (symbolized S.D) is the measure of variability
used with the mean. It indicates the amount that all the scores differ or
deviate from the mean – the more the scores deviate from the mean, the
higher the standard deviation. The sum of the deviations of the scores
from the mean is always 0. There are two types of formulas that are
used to compute S.D.
 Deviation formula.
 Raw score formula.
The deviation formula illustrates what the S.D. is, but is more difficult to
use by hand if the mean has a fraction. The raw score formula is easier
to us if you have only a simple calculator.
Let us use the scores: 7, 2, 7, 6, 5, 6, 2.

(i). Deviation Formula

2
S.D. = ∑ (x-x )
n-1

Where S.D. is the standard deviation, X is the scores, X is the mean, and
n is the number of scores.
180
EPSC 228/311/0228

Some books, calculators, and computer programs will use the term n
rather than n-1 in the denominator of the standard deviation formula.
When the sample is large you can use n because a larger sample
approaches the population size.
Why n-1?
i. Use of n-1 gives a good estimate of population variance or S.D.
That is, it gives unbiased estimate of population variance.
ii. We use n-1 when the sample size is small in order to get unbiased
estimate of population variance.

In this illustration let us use n-1.


Step 1
35
X = ∑X = /7 = 5
n

Steps 2 – 3

2
X X (X–X) (X – X)
7 5 2 4
2 5 -3 9
7 5 2 4
6 5 1 1
5 5 0 0
6 5 1 1
2 5 -3 9
∑=35 ∑=0 ∑=28
Step 4
S.D. =28 / (7–1) =

181
EPSC 228/311/0228

(ii). The Raw Score formula

The deviation formula is seldom used to calculate the S.D. by hand,


because it is cumbersome when the mean has a fraction. Instead the
following raw formula (also called computational formula) is used to
calculate S.D:

S.D. = ∑ x2 – (∑x)2/n
n-1

Where ∑X2 is the sum of the squared scores, ∑X is the sum of the scores,
and n is the number of scores.
X X2
7 49
2 4
7 49
6 36
5 25
6 36
2 4
∑X = 35 ∑X2 = 203

The computation of SD is as follows:

S.D. =203 – (35)2/7= 203-1225/7


7–1 6

= 203 – 175 = 4.62 = 2.2


6

182
EPSC 228/311/0228

How do you interpret the scores in relation to the mean and S.D?
In reporting your pupils‘ scores, you need to report both the mean and
the S.D.
 A test norm allows meaningful interpretation of test scores.
 A person‘s raw test score is meaningless unless evaluated in terms
of the standardized group norms. For example, if a student
receives a raw score of 78 out of 100 in history, does that mean
that the student is doing well?
The score of 78 can be interrupted only when the norms are consulted. If
the mean of the test norm is 80 and the standard deviation is 10, the
score of 78 can be evaluated as ―typical‖ performance indicating that the
student possesses an average knowledge of history.

SELF-ASSESSMENT EXERCISE
Use raw score formula to compute the mean and the SD for the test
scores of the following two groups of students:

Group 1: 9, 5, 1
Group 2: 5, 6, 4
What does the S.D. tell you about these two groups?
For Group 1 you should get SD of 4
For Group 2 you should get SD of 1.414

183
EPSC 228/311/0228

Interpretation

Though both groups have a mean score of 5; pictorially/graphically, the


spread of the scores will look like in the figure given below.

Group 2

Group 1
Frequency

0 1 2 3 4 5 6 7 8 9
Score Value

For both groups, the test scores have the same mean, but different
variability or spread of scores. Students in Group 1 have a larger S.D
(SD=4) indicating that they are more heterogeneous in ability. Students
in Group 2 have smaller S.D (SD=1.414) indicating that they are more
homogeneous in ability.

**********************************************************

184

You might also like