Epsc 228 311 0228 Final Measurement and Evaluation Revised March 2021 (Repaired) (1)

EPSC 228/311/0228
candidates were disadvantaged. Moderation of

marking would have taken case of this error-through
compensation. The Ministry has removed this key
process. The Cabinet Secretary (CS), unfortunately
called moderation process ―massaging‖ of results.
4. Lack of Inter-examiners‟ Reliability Process

i. Standard Procedure:
During marking, examiners are put in a pool of seven
with a team leader. For every 10 scripts marked, the
team leader has to review (re-mark) at least two scripts
which are picked randomly to verify if they have been
marked well. The marking errors allowed is – 2 or +2
and if an examiner goes beyond this margin, he/she is
forced to re-mark the entire scripts. If she/he
consistently makes the same mistake he/she is retired
(expelled) from marking exercise.
ii. Abnormal Procedure:

In 2017 KCSE, it was reported by the examiners that the
inter-examiners‘ process was not done. Skipping this
process enabled many examiners go for volume of
scripts since payment is based on the number of scripts
marked.
COMPETENCE BASED EDUCATION/CURRICULUM

(CBE)
Kenya plans to phase out 8-4-4 and introduce CBE. Why have the
previous reforms in Kenya failed?
 We abolished ‗A level‘ segment saying it did NOT serve Kenya well.
‗A level‘ serves Britain well and enables Britain to advance
67
EPSC 228/311/0228
technologically. So why is it not good for Kenya? Why is it that it

develops Britain and under develops Kenya?
 We changed to 8-4-4. It is an education structure that serves USA
and Canada well? Why is it that it develops USA and Canada and
under develops Kenya? Education reforms in Kenya are guided by
politics rather professionalism.
What is Competence–Based Competence (CBC) and what

makes it different?
The most important characteristic of competency-based education is that

it measures learning rather than time. Students‘ progress by
demonstrating their competence, which means they prove that they have
mastered the knowledge, skills and attitudes (called competencies)
required for a particular course, regardless of how long it takes. While
more traditional models can and often do measure competency,
they are time-based — courses last about eight years, and students
may advance only after they have put in the seat time. This is true even
if they could have completed the coursework and passed the final exam in
half the time. So, while most colleges and universities hold time
requirements constant and let learning vary, competency-based learning
allows us to hold learning constant and let time vary.
68
EPSC 228/311/0228
Head + Hand + Heart= Competent Professional
Competence= combination of knowledge, skills and attitude

required to become a competent professional. [Refer Nelson
Mandela on mother tongue].
Benefits of Competency-Based Learning
There are many reasons schools are moving away from seat time and
toward competency-based learning. These reasons include:
 This is the type of education that combines the above in a graduate.

 Performance based/outcome-based education with clear
performance indicators.
 Education that incorporates real life assignments and assessment
(problem solving skills, meaningful, etc).
 Education that incorporates the requirements of the labour market
and industry.
 Education that gives students to learn at their own pace.
 Learner-centred education.
 Competency-based learning keeps each student from getting bored.
69
EPSC 228/311/0228
 Competency-based learning allows each student to work at his or

her own pace.
 A greater mastery of the subject, done in a less frustrating and
more invigorating way. Students move at their own pace with the
curriculum.
In summary, it is argued by the proponents of CBE that Competency-

based education (CBE) is a new model in education that uses learning,
not time, as the metric of student success. This student-centered,
accelerated approach redefines traditional credit-based requirements in
learning and stresses competencies derived from the skills proven to be
the most relevant by educators and employers.
With competency-based education, institutions can help students‘

complete credentials in less time, at lower cost — with a focus on real-
world learning that leads to greater employability. This versatile model
benefits the student, the instructor, the institution, and the economy.
Very thing sounds good with CBE but………
Where has CBC succeeded?
CBE has been successfully implemented in South Korea, Japan, Finland,

and Netherland. This is primarily because of the culture of the people. In
South Korea children spend up to 14 hours in school (8.00 am to
11.00pm) per day. Since most children do not get home until midnight,
dinner is served in school? Why? to get into good college.
South Korea Education Structure

1. Pre- schools (3-6 years old children= 2 years –(optional)
2. Primary Education = 6 years
3. Lower Esc (Middle School) =3 years
4. Upper Sec (High School) =3 years:
Academic Stream (62%)
70
EPSC 228/311/0228
Vocatioal Stream (38%)
Where has CBC failed?
CBE has had little success, if any, in Africa. South Africa and Malawi tried
and abandoned. It was tried for 12 years in South Africa. Why it failed in
South Africa?
 It involves too much administration and record keeping.

 Very expensive.
 Exodus of teachers. Many teachers left the teaching profession.
 For 12 years‘ teachers never understood how to implement CBE.
 Assessment very subjective.
Challenges of CBC
1. Failed system
It is also important to note that the proposed Competence-Based
Curriculum education was tried for 12 years in South Africa and it
failed/abandoned. In South Africa it was called Objective-Based Education
(OBE). It was also tried in Malawi and abandoned.
The proposed Kenyan CBE was borrowed from Japan and South Korea.
Kenya does not have the culture of these countries and is wrong to
assume that we will succeed. Do we have the Asian culture that has been
responsible for the success of CBE? No.
2. Assessment of student in CBC is subjective

In Kenya 70% of assessment will come from CA. The proposed
competence-based curriculum education system proposes the
replacement of national examinations as a tool for testifying the
achievement of an educational level with continuous assessment. As a
country we are at a stage where the government does not respect and
trust teachers. The invigilation and the marking of the 2016 KCSE
examinations is a clear demonstration of this mistrust. In pointing out his
71
EPSC 228/311/0228
level of mistrust of teachers the CS in charge of Education said, ―With

the use of ICT, we eliminated people who were changing marks”
(Standard, January 12, 2017: p.8). The people being referred to are
teachers. These are same people that the proposed CA system will rely
on. Will ICTs be able check CAs awards?
Is this CA policy direction not paradoxical? Will the teachers‘ assessment

be a fair tool for assessment in a country with limited resources? Will this
approach eliminate competition in the job market place? What about
corruption? What about the ―halo effect‖?
The use of CA works well in countries where recruitment is not strictly
based on certificates but on what a candidate can do. This approach of CA
failed when it was used by KNEC in Primary Teachers Examinations in
early 1980s. At this time 40% of the final grade came from CA and 60%
from KNEC examination. Teacher trainees were passing CA but failing the
KNEC examinations.
3. Low public participation. This proposed system is being driven
politically. No sessional paper was produced before rolling it out.
This ie expected in June 2019- three years after its implementation.
4. Too futuristic and impractical to implement in Kenya in its current
form.
5. Expensive to implement.
6. Overloaded syllabus. Between 11-12 subjects to be taken.
7. Pupils‟ progression from primary to secondary unclear.
8. Teacher-based subjective assessment in determining movement
from primary to secondary is bias and hence pupils cannot go to
their school of choice in the absence of a standard, unifying national
examination.
9. Damage national integration because children will remain in
their neighborhoods.
10. Demand literacy of parents.
72
EPSC 228/311/0228
LECTURE 7
Welcome to lecture 7.
PSYCHOMETRIC CHARACTERISTICS OF A
GOOD TEST
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
 List and describe the characteristics of a good test.
 List the factors that affect validity of a testing
instrument.
 List the factors that affect reliability of a testing
instrument.
 Explain the different ways of estimating reliability.
WHAT MAKES A TEST GOOD?
Achievement tests, as measuring instruments, have unique characteristics

by which their usefulness is judged. One principle, however, comes first
in any discussion of what makes a test good. Even the very best test is
good only when used for its specific purpose with the kinds of students for
whom it was intended. A good test for standard 6 is a bad test for
standard 3.
There are two broad considerations of a ―good test‖, namely:
(i) Practical Considerations
From practical standpoint, a test can be said to be good if it has the
following psychometric characteristics:
94
EPSC 228/311/0228
 Relevance to the target students.

 The clarity of the instructions for administrating the test;
 The clarity of the guidelines for interpreting the results;
 Offers economy in the time it takes to administer, score, and
interpret it. Image a test that takes 8 hours.
(ii) Technical Considerations

From technical standpoint, a test can be said to be good if it has the
following qualities:
 Validity
 Reliability
 Objectivity
 Difficulty and Discrimination
 Comprehensiveness
 Efficiency
 Fairness
 Norms
1. VALIDITY
A test is said to be valid when it measures what is intended to measure.
That is, a good test measures what it is intended to measure.
What are you looking for when you are establishing the validity of an
instrument? You are looking for:
i. Trustworthiness of the instrument. Is the instrument trustworthy?
ii. Credibility of the instrument. Is the instrument credible?
A valid classroom test measures what has been taught (or should have
been taught). There are several aspects of validity:
These are:
 Content validity
 Construct validity
 Face validity
95
EPSC 228/311/0228
 Concurrent validity
 Predictive validity
The two of these validities that are of particular importance with respect
to teacher – made tests are content and construct validity.
Content Validity
This is the most important validity for practicing teachers. It measures the
extent to which a test adequately covers the syllabus to be tested.
Content validity refers to the extent to which a test ―covers‖ the content.
 If the test does not cover the content that has been taught, then it
is not valid.
 An essay test intended to measure knowledge is likely to lack
content validity because of the severe limitation on the number of
topics which can be included in one such test. In other words, the
sample of possible learning is small. Hence essay test lacks content
validity. It lacks content balance.
 A valid test provides for measurement of a good sampling of
content and is balanced with respect to coverage of the various
parts of the subject.
FACTORS THAT AFFECT CONTENT VALIDITY

To have a good content validity of a test, consideration must be given to:
(i) Length of test
A short test could not adequately cover a year‘s work. A syllabus needs
to be sampled and a representative selection made of the most important
topics to be tested.
(ii) Topic Coverage

Test questions are prepared in such a way that they reflect the way the
topic was treated during the course. How many hours were spent
teaching the topic? 8, 4, or 2? More test items drawn from topics where
more hours were spent.
96
EPSC 228/311/0228
(iii) Test blueprint or specifications

Better content validity can be achieved by the use of a specification grid.
The teacher has to decide on the weighting (% of the total marks) to be
attached to each ability and the most suitable number of questions;
namely:
 Recall questions. How many?
 Knowledge questions. How many?
 Application question. How many?
 Analysis questions: How many?
 Evaluation questions: How many.
We will discuss more in test construction lecture.
(iv) Test Level
A third aspect of validity that is more concern to teachers has to do with
level of the test. Example: Test on Kenya‘s Colonial History. A single test
of Kenya‘s colonial history is not equally valid for learners / students in:
[What strategies did freedom fighters used to destabilize the colonial
regime?]
 Form 3
 Standard 8
 University
Why?
 Course objectives are different
 Course coverage is different.
 Learners‘ abilities are different.
Thus validity is specific to purpose, to subject matter, to objectives; and
to learners / students; it is not a general quality.
(v). Teacher‟s Bias
A teacher is human and this means the test may be open to human
errors. The teacher may influence the outcome of the content of the test
because of his/her expectations regarding the results.
97
EPSC 228/311/0228
In primary schools, where teachers are compared at the end of the term
on the performance of their classes, teachers may set easier questions for
their classes in order that their pupils perform better.
Construct Validity
This is another aspect of validity that is important in teacher – made
tests. It refers to the kinds of learning specified or implied in the course
/learning objectives. That is based on learning objectives.
For example, if the course learning objective specified that at the end of
the course the learner must:
i. Identify four characteristics of living things, test tasks are required
to measure identification of these characteristics. Each kind of
learning objective must be tested to provide a valid measurement of
achievement.
ii. Identify the methods freedom fighters used destabilize the colonial
regime.
A test which measures only knowledge (e.g. ability to recall important
historical events) must lack validity if the objectives specify other learning
objectives.
Face Validity
 The test should look as if it is testing what is intended to test. This,
however, is the starting point.
 Describes how well a measurement instrument appears to measure
what is intended to measure as judged by its appearance, what it
was designed to measure. For example, a test of mathematical
ability would have face validity if it contained math problems.
 A test is said to have face validity if it appears to be measuring
what it claims to measure.
For example:
 If we are trying to select pilots from highly trained personnel
face valid tests of rapid reaction time will ensure full
98
EPSC 228/311/0228
cooperation because subjects believe them (tests) to be valid

indicators of flying skills.
 If, however, a test required them to make animal noise or add
up numbers while distracted by jokes many subjects would
refuse even if the tests were valid. They will think this is not
appropriate test for them (pilot).
 If you set a test for STD 3 or FORM IV and give another
teacher of STD 3 or FORM 3 to look at whether it is
appropriate for the level. The expert opinion provided is taken
as a face-validity. On the face of it, it looks like it is
appropriate for STD 3 or FORM 3.
 Face validity is a weak form of validity in that an instrument may
lack validity.
Concurrent Validity
 This is where test results are compared with another measure of the
same abilities at the same time or taken at the same time or about
the same time. For example, comparing mock results and actual
KCSE results. These examinations are taken on about the same
time – one in July (Mock) and the other in November (KCSE). In
1970‘s, mock results were used to select students to join Form 5 in
January before the KCE results were released on the belief that
mock was a good measure of the final examination or had good
concurrent validity.
O1 O2 O1 O2
High High Low Low
99
EPSC 228/311/0228
([[
(a) (b)
(c)
A positively skewed distribution (c) reflects a very difficult test while a
negatively skewed distribution (a) reflects an easy test.
(iv) Length of a Test
The length of a test affects reliability. A very short test (or five items, for
example) cannot spread the scores sufficiently to give consistent results.
Five items are too few to provide a reliable measure. In general, the
longer the test, the more reliable.
(v) Erratic or inconsistent Marking/Scoring
If the markers are erratic in scoring the award of scores will be unreliable.
Inconsistency in scoring leads to low reliability of results. This is why
KNEC train examiners on marking so that if the same test or script is
marked by different examiners or even when the same examiner marks
the same test at different times the scores will be similar. See the article
of September 13, 2014 on ―Train every teacher on setting, marking
exams‖
103
EPSC 228/311/0228
104
EPSC 228/311/0228
(vi). Testing environment

Reliability requires noise free environment. If the testing environment is
noisy it affects reliability.
(vii). Number of performers. Large numbers are required to establish

reliability.
(viii). Forgetting and Fatigue. Between first testing and second testing
he/she may forget because of interruption between the two situations or
become tired.
(ix). „Halo effect‟. Examiners allowing their knowledge of the candidates‘
index numbers or ability or position in class or ethnic background to
influence their awarding of marks or their judgment. [See article below of
Daily Nation of Sept. 13, 2014).
(x). “Leniency/severity errors‖ is common in essay test. –This is
where examiners/raters give ratings which that are consistently too high
or too low.
(xi). “Error of central tendency”. Some examiners tend to avoid
extreme categories or giving high scores, concentrating instead on
categories around the midpoint or average of the scales. This is called
error of central tendency. Also come in essay test.
105
EPSC 228/311/0228
“Halo Effect” and subjectivity
106
EPSC 228/311/0228
ESTIMATING THE RELIABILITY OF AN

INSTRUMENT/TEST
Test reliability can be estimated by use of several procedures. The
following are the statistical procedures/methods of estimating the
reliability of an instrument/test.
1. Test-Retest Reliability
The procedure for estimating test-retest reliability utilizes the method

of administration of the same test twice to a group of students and
correlating the two sets of scores. A correlation coefficient suggests the
amount of agreement between the two sets of scores. The higher the
(positive) correlation derived, the higher the reliability estimate.
2. Parallel Forms Reliability
This is also called ―alternative forms reliability‖ or ―equivalent forms

reliability.
An estimate of parallel forms reliability involves administration of two

forms of the same test to the same participants. If the scores on the two
forms of the same test are identical or nearly identical, parallel forms
reliability has been demonstrated. Parallel forms of a test are developed
in such a way that no matter which form (variation) of the test a person
completes, the score should be the same.
One student can take one form of a test, and the students sitting to the
right and left could have different variations of the same test. None of
the three students would have an advantage over the others; their
respective scores would provide a fair comparison of the variable being
measured. For example, measuring mathematical ability of standard 8
pupils.
107
EPSC 228/311/0228
Form A Form B
Mathematics Test Mathematics Test
The two forms A and B are developed from the same curriculum and
subjected to the due process of the development of a good test and the
two tests can be taken currently or at different times.
3. Inter-Scorer Reliability
This is also called “inter-rater reliability”. An estimate of inter-scorer

reliability involves two independent scorers scoring the qualitative content
of the learners‘ responses e. g. learners‘ essays. The two scores from the
two independent examiners are statistically compared. If there is high
agreement, then the scoring is reliable. This is used by KNEC in essay
tests during co-ordination of examiners (dummy marking) and during the
marking of life scripts.
4. Inter-Observer Reliability
Where the measurement involves observation rather than paper-and

pencil test, an estimate of the reliability of the measurement process is
required. Inter-observer reliability estimates the degree to which two or
more observers agree in their measurement of a variable. For example,
variable to be measured through observation is ―aggression‖ among pre-
school boys.
Examples: Two observers went to Kilimo Nursery School to observe

aggression among the pre-school boys and recorded their observations as
follows: -
Time Observer 1 Observer 2 Total Variance

Observations
108
EPSC 228/311/0228
10.00 / // 3 1
10.01 / / 2 -
10.02 // / 3 1
10.03 // /// 5 1
10.04 /// // 5 1
10.05 / / 2 -
10.06 // // 4 -
10.07 / / 2 -
10.08 // /// 5 1
10.09 // // 4 -
Total 17 18 35 5
Aggression was measured by such indicators as: kicks, destroys, fights,

hurts, slaps‘
Inter-observer reliability =No. of agreed observations x100

Total No. of observations
Inter-observer reliability = 30 x 100 = 85.7%

35
The higher the % agreement between the observers or interviewers, the

greater the reliability.
109
EPSC 228/311/0228
5. Inter-Item Reliability
Another way of estimating the reliability is to assess inter-item reliability.

Inter-item reliability is the extent to which different parts of a
questionnaire, or test designed to assess the same variable attain
consistent results. Scores on different items designed to measure the
same construct should be highly correlated.
There are two approaches to estimating inter-item reliability. These are:

-
i. Split Half Reliability
This involves splitting the test into two halves and computing coefficient
of reliability between the two halves (odd numbered questions and even-
numbered questions).
This is one of the measures of internal consistency of a test that is

estimated by comparing two independent halves of a single-test. This
procedure gives correlation between scores on the odd - numbered and
even numbered items of a single test.
Split half reliability estimates are widely used, because of their simplicity.
The procedure is that you split the test into two halves and compute
coefficient of reliability between the two halves.
ii. Coefficient Alpha ()
This is also called internal consistency or Cronbach alpha (). Cronbach

alpha involves evaluating the internal consistency of the whole set of
items.
 An evaluation of internal consistency is most often used where we

have created a multiple-items questionnaire to measure a single
construct variable like intelligence, need for achievement or anxiety.
The individual items on standardized test of anxiety will show a high
110
EPSC 228/311/0228
degree of internal consistency if they are reliably measuring the

same variable.
 Internal consistency is high if questions – measure the same

variable
6. Kuder-Richardson (K-R). This measures internal consistency. This is

widely used to estimate test reliability from one administration of a
test. Correlation is determined from a single administration of a test
through a study of scores variances.
DIFFERENCES BETWEEN RELIABILITY AND VALIDITY
i. Can a test/instrument be reliable, but not valid?

ii. Can a test/instrument be valid, but not reliable?
1. Reliability
Reliability is always a statement of probability.
Example 1: Inter-scorer reliability. Like asking a question: What is the

probability that the two independent examiners will agree on answers
given by a candidate?
a) Yes, probability will be high if the two examiners are trained so that
they can mark consistently.
b) No, how probability if the two are not trained.
Example 2: Reliability of KZY 788. What is the probability that my old

KZY 788 will reach Nairobi from Njoro without breaking down?
i. Yes, probability is high, if the car is well serviced – servicing

increases reliability.
ii. Probability is low if the car is not well serviced.
111
EPSC 228/311/0228
Example 3: You know your actual weight to be 92 kg. You take your
weight three times in a day using the same machine and you find:
Morning = 85 kg
Lunch = 92 kg
Evening = 87 kg
Decision: Your scale is unreliable. The scale should read at all times/
whenever you step on it 92 kg.
Example 4: If your three consecutive weights are:
Morning = 85 kg
Lunch = 85 kg
Evening = 85 kg
Decision: The scale is reliable (consistently giving 85kg), but not

correct/valid. The scale does not have to be right/correct to be reliable;
it just has to provide consistent results. This instrument is reliable without
being valid for the purpose of giving accurate weight.
Which is more important in a test? Validity or reliability?
“A valid test is always reliable but a reliable test is not

necessarily valid”
 A good test is a valid test, and a test is considered to be valid if it in
fact measures what it purports to measure. A test of intelligence is
a valid test if truly measures intelligence.
 If the instrument/test is valid, it must be reliable. If every time
pupils take the same test, they get different results, the test is not
able to predict anything. However, if a test is reliable, that does not
mean that it is valid.
112
EPSC 228/311/0228
 Reliability is necessary, but not sufficient, condition for validity. A

valid instrument must have reliability, but reliability in itself does
not ensure validity: that is, reliability is said to be necessary but not
sufficient condition for validity
2. Validity
Validity is a statement of suitability. Whether something meets the

requirement or is fits for the purpose. E. g. Is the test meeting the
requirements or objectives of the course?
Illustration by use of a shotgun

 The purpose of a gun is to kill.
 A shotgun is valid if it kills. However, it might not be reliable
because it does not hit the target all the times.
 If every time we use the same gun to shot the target, we get
different results (not hitting the bull-eye), the gun is not able to
predict the hitting of the bull-eye. Hence, the gun is valid with
respect to killing (it does what is intended to do), but not reliable
in hitting the bull-eye.
Target
Seeing validity as archery target and reliability as shots of the target.
113
EPSC 228/311/0228
Target A Target B Target C Target D
Target A: Target B: Target D:

Poor validity, but Shots within the area Good validity and good reliability.
good reliability but not hitting one point
Source: Google validity as archery target
EXAMPLES: Where the test is reliable but not valid.

1. Homework
Homework giving consistent results, but the homework itself is not
relevant to classroom lesson- what was taught. This means the test is
reliable but not valid.
2. Measuring Intelligence
Suppose you want to measure the intelligence of smart students and you
decide to use a tape measure the circumference of their heads in
centimeters and you consistently receive the same results/value of the
heads of these students-that is the tape measure is reliable. But the test
of using tape measure is not valid measure because we do not use tape
measure to measure intelligence. We use intelligence test to measure
intelligence of people.
Which is more important in a test? Validity or Reliability?
114
EPSC 228/311/0228
LECTURE 11
Welcome to lecture 11
TEST SPECIFICATIONS AND CONSTRUCTION
TOPIC OBJECTIVES
 Define table of specifications.
 Describe the role of table of specifications in test
construction
 List and discuss the steps/procedure in test
construction.
 Explain the merits and demerits of easy and
objective tests.
 Briefly describe the guidelines for the construction of essay and
objectives tests.
What Content-Ability Specifications Look Like (Table of

Specifications)
Probably the most typical set of specifications is a two – dimensional grid
with the subject-matter topics listed on one axis and the abilities to be
probed on the second. Typically, the aces are orthogonal to each other;
that is to say, they are independent of each other to the extent that any
ability listed can be tested in connection with any of the content
categories.
The number, nature, and specifically of the categories will depend on the
purpose of the test. Although the two-dimensional grid for content and
abilities is most common, there will be cases in which three or more are
140
EPSC 228/311/0228
needed. For example, in history it may be important to insure that

certain time periods are covered by stipulated proportions of the test
questions. Unfortunately, it is not feasible to put workable three or more
dimensional grids on paper, so it may become necessary to have two
different grids to be used in conjunction with each other.
Once the list of topics and abilities has been decided on, the next task is
to determine the relative emphasis to be given to each topic and ability
and to enter into each cell either a percentage of the test or the actual
number of questions to be tested in that cell. It may be that certain
topics by their nature are essentially limited to certain abilities or test
plan may lead to insights into what has been previously untested and to
ingenious solutions or new approaches to writing items that test what
may be long considered untestable in an objective format.
What is a test? (Discussed earlier)

A test is a device by which we can use to sample the candidate / students
behavior. The common kind of test that the teacher is used to is the
paper and pencil (paper and pen) test, i.e. that in which the pupils is
required to write or mark his answers on a paper. However, tests make
take various other forms. In some cases, the pupils may indicate his
answers orally (oral tests), in others he may be required to carry out
certain activities during which he is observed and scored by an observer.
- A test must be in harmony with instructional objectives and subject
content. To be sure that these are achieved, the preparation of a
test should follow a systematic procedure.
Test Construction Procedure

1. State general instructional objectives and define each instructional
objective in terms of specific types of behavior students are
expected to demonstrate at the end of the exercise.
141
EPSC 228/311/0228
2. Make an outline of the content to be covered during the instruction.

3. Prepare a table of specifications or test blue print which will
describe the nature of the test sample.
4. Construct test items that measure the sample of behavior of the
candidates‘ specified in the test blue print.
1. STATING BEHAVIOURAL OBJECTIVES

Behavioural objective is also called performance objectives. This is a
statement that specifies what observable performance the learner should
be engaged in when we evaluate the achievement of the course objective.
Behavioral objectives must be stated in action verbs.
Different experts recommend different approaches to writing behavioural
objectives. One recommendation that is fairly simple to follow is that a
statement of behavioural objective should consist of four parts as follows:
(i) The learner (the pupil)
(ii) An action verb (states)
(iii) A content reference (e.g.four characteristics of living things)
(iv) A performance level
Written properly, the behavioural objective would read:
―The pupil should be able to state four characteristics of living things‖.
Given this kind of statement, it is easy to write a test to elicit the
behavior.
In stating the ―action verb‖ it is useful to note that certain types of verbs
are not appropriate. These are verbs that represent action that cannot be
readily observed or have ambiguous meanings e.g. understand,
appreciate, feel, intends. The verbs are not considered behavioural
because one cannot observe or measure a person ―understanding or
appreciating‖. Below are examples of action verbs that are appropriate for
stating measurable behavioural objectives.
142
EPSC 228/311/0228
describe make illustrate predict

define recognize construct infer
measure identify draw repeat
state classify build write
discuss read recall make
In KNEC stating of behavioral objectives is the function of the Subject

Examination Panels. These statements are incorporated in the
Regulations and Syllabuses which the Kenya National Examinations
Council publishes.
2. OUTLINING THE CONTENT

In order to ensure that a test adequately samples the subject matter of
any discipline, it is essential to make an outline of content to be
examined. The Kenya National Examinations Council publishes for each
examination a syllabus that list for individual subjects what content areas
it will test. Again the task of selecting the subject matter is the
responsibility of the Subject Examinations Panel. It should be pointed out
that curriculum development is the responsibility of Kenya Institute of
Education.
3. TABLE OF SPECIFICATIONS
The purpose of a table of specifications is to ensure that the test covers
all the objectives of the instruction. A table of specifications or a Test
Blue Print is a two dimensional table with the content objectives listed
along one dimension and the behavioral performance / content or
instructional objectives listed along the other. Numbers are then inserted
in the cells so created to indicate how many test items should be set on
each behavioral and content objective.
In allocating items to the different cells, there is no rule of thumb. All

that a test constructor must not do is to produce an imbalanced paper.
143
EPSC 228/311/0228
The weighting will be reflected in the behavioral objectives. The

behavioral objective is arbitrary as the decision lies with the individual or
a group of individuals.
Once the purpose of a test has been made, the teacher or test developer
has to make two decisions, namely:
(i) Decide on the weight to be given to each topic covered in the

course. The test items must be balanced with respect to relative
importance of the topics.
(ii) A second decision relates to the kinds of learning to be tested. How
much weight should be given to knowledge, comprehension,
application, analysis, synthesis and evaluation?
Once these two decisions have been made, the ―specifications‖ for a
particular test can be made. The ―specifications‖ of a test is presented in
a table called a “Table of Specifications‖ or blue print. A table of
specifications is a two dimensional chart with the content (topics) as one
dimension and behavioural performance or kinds of achievement as the
other.
Assume that a teacher wants to develop 50-item biology objective test.
Content Behavioral Performance

/Topics
Knowledge Comp. Application Analysis Synthesis Evaluation Total
Topic 1 2 2 0 0 0 1 5
Topic 2 3 2 2 1 1 1 10
Topic 3 3 1 3 2 2 2 13
Topic 4 2 5 4 2 0 2 15
Topic 5 2 2 1 0 1 1 7
144
EPSC 228/311/0228
Total 12 12 10 5 4 7 50
Table of specifications is used for the design and development of the

objective tests.
4. CONSTRUCTION OF TESTS
In the school setting the most convenient tests are paper and pencil tests
– or written tests. Such tests are commonly of two types:
(i) Essay tests and (ii) Objective type tests
ESSAY TESTS
The essential feature of an essay is that it is open-ended and each
candidate may present his own answer in his own particular style.
WRITING ESSAY TESTS

In writing essay items, the test constructor/developer needs to:
(i) Identify the topic to be tested.
(ii) Must be anchored on instructional objectives.
(iii) Be clear on what he/she would like to test. He/she should
identify components of the topic and decide on what aspect(s)
he/she would like to examine the students on.
(iv) Frame the question in simple and direct language.
(v) Require students to answer all items.
(vi) Present one task at a time.
Below is an example of an essay question that present one task at a time.
Example:
(a) Name two sources of support which helped Britain during Mau
Mau war of 1950-1959?
(b) Why did British lose the war?
145
EPSC 228/311/0228
(c) What were the effects of the war on?

(i) The British
(ii) The African Communities?
In the example above, the topic was Mau Mau War of 1950-1959. The
test constructor wished to examine the candidates on the following
aspects: (i) the allies of Britain, (ii) the reasons for the defeat of the
British and (iii) the impact of the War on both the colonizer and the
colonized people. The items are expressed in simple and direct language;
and the candidates are presented with one task at a time.
Merits of Essay Tests

Essay tests are particularly useful for testing:
(a) The ability to recall rather than simply recognize information,
(b) The ability for expression and communication.
(b) The ability to select, organize and integrate ideas in a general attack
on problems.
(c) Suitable for assessing students ‗ability to analyze, synthesize, and
evaluate.
(d) Not susceptible to correct guesses
(e) Measure creative abilities, such writing talents or imagination.
Demerits of Essay Tests
However, their uses are restricted by the following limitations:

(i) the scoring tends to be unreliable.
(ii) ‗Halo effect‘ is more operative in easy test.
(iii) ―Leniency/severity errors‖ is common in essay test. - Where
examiners/raters give ratings which that are consistently too or
too low high.
146
EPSC 228/311/0228
(iv) ―Error of central tendency‖. Some examiners tend to avoid

extreme categories or giving high scores, concentrating instead
on categories around the midpoint or average of the scales. This
is called error of central tendency. Also come in essay test.
(v) While saves time in writing, the scoring is time consuming.
(vi) A limited sampling of achievement is obtained/test limited range
of the content.
(vii) One issue relating to essay tests is whether and how much to
count grammar, spelling and other mechanical features. If you
do count these factors, give students separate grades in content
and in mechanics so that they will know the basis on which their
work is being evaluated. For examples are given below. These
are real examples drawn from EPSC 311 March 2018
examination scripts:
 Litrate for literate
 Wrote learning for rote learning
 Negletion
 Coatching for coaching
 Privillaged environment
 Flock for block
 Compitend for competent
 Diviation for dviation
Mechanically you as an examiner know what the candidate is
saying.
Scoring procedures of the essay tests can be improved by:

(i) Using a marking scheme.
(ii) Through coordination of examiners.
(iii) Sampling of marked scripts by senior examiners
147
EPSC 228/311/0228
OBJECTIVE TESTS
An objective test is one so constructed that irrespective of who marks the
answers, the scores for a particular candidate is always the same. The
objectivity really refers to the marking of the test. In order to achieve
such objectivity, objective tests usually have pre-coded answers. In any
particular item, there has to be one and only one correct answer.
FORMATS FOR OBJECTIVE TEST ITEMS

Three main formats are used in constructing objective test items. They
are
(i) True – False
(ii) Matching
(iii) Multiple – Choice
TRUE FALSE ITEMS
In these items the examinee must decide whether a given statement is
true or false. For example:
1. The first President of KANU was James Gichuru. T/F.
2. History is about the past. T/F
3. Rift valley is gradually sinking. T/F
4. Sea level is rising. T/F
MATCHING ITEMS
A matching item consists of two lists, phrases, pictures, other symbols
and a set of instructions explaining the basis on which the examinee is to
match an item in the first list with an item in the second list. The
elements of the list that is read first are called premises, and the
elements in the other list are called responses. It is possible to have
more premises than responses, more responses than premises, or to have
the same number of each. In the example of a matching exercise that
148
EPSC 228/311/0228
follows, the premises appear in the left-hand column, with the responses
at the right but in some cases the responses may be placed below the
premises.
The primary cognitive skill that matched exercises test is recall.
List I List II (Responses)

(Premises)
1. KANU ( ) 2007
2. NARC ( ) 2013
3. PNU ( ) 1960
4. JUBILEE ( ) 2002
5. KADU ( ) 1925
MULTIPLE CHOICE ITEMS

Multiple choice questions are generally considered to be must useful of
the objective type items. A multiple choice item consists of a stem plus
two or more alternatives (options), one of which meets the requirement
demanded by the stem. The item stem may be in the form of:
(i) A question
(ii) A complete statement
(iii) An incomplete statement
STRUCTURE OF MULTIPLE – CHOICE ITEMS
Multiple-choice test item consists of two parts:
(i) A problem called stem.
149
EPSC 228/311/0228
(ii) A list of suggested solution called alternatives/options – one

which meets the requirements demanded the stem.
Stem is in the form of:
(iv) A question
(v) A complete statement
(vi) An incomplete statement
List of alternatives contains:
(i) One and only one correct answer

(ii) Three distractors (Incorrect alternatives)
Question Example of a Stem
Who among the following people chaired the Kenya Constitution Review
Commission?
1. Githu Muigai
2. James Orengo
3. Yash Pal Ghai
4. Paul Muite
5. Raila Odinga
What is the complex level in the taxonomy of the cognitive domains?
a. Knowledge
b. Synthesis
c. Evaluation
d. Analysis
e. Comprehensive
Sources of good distracters include:
(i) Common misconception and common errors
150
EPSC 228/311/0228
(ii) A statement which itself is true, but which does not satisfy the
requirement of the problem.
(iii) A carefully worked incorrect statement.
Complete Statement Example of a Stem
In order to sell fish in a village market, a trader requires a license. The

license is obtained from:
a. Police officer in the area.

b. County officer in the area.
c. County officer of the Health Inspector in the area.
d. The leading business man in the area.
Incomplete Statement Example of a Stem
The primary effect of climate change in Kenya is:
a. Reduction of mangrove forest

b. Limited increase in livestock
c. Rising water level in Rift Valley
d. Poor crop production
The term test as used in measurement is defined as:
a. A standard procedure for assessing learners.

b. Making adjustments of learners‘ abilities.
c. Device for sampling learners‘ abilities.
d. A reliable measurement instrument.
GUIDELINES FOR CONSTRUCTING MULTIPLE CHOICES ITEMS
1. Construct each item to assess a single written objective

2. Base each item on a specific problem stated clearly in the stem
151
EPSC 228/311/0228
After reading the stem, the student should know exactly what the
problem is and what he or she is expected to do to solve it.
3. State the stem in positive form.
4. Keep the item short.
5. Word the alternatives clearly and concisely. This is to reduce student
confusion.
6. Keep the alternatives mutually exclusive.
7. Avoid ―all of these‖, none of these‖ and ―both A and B‖ answer choices.
8. Keep options lengths similar
9. Avoid cues to the correct answer
10. Use only one correct option
11. Vary the position of the correct options
12. Guard against giving clues in the correct answers.
13. Avoid any tendency to make the correct answer consistently longer
than the distracters.
14. Avoid ―give-ways‖ in the distracters, for example ―always, ―only‖,
―all‖, ―never‖ etc.
15. Use language that is simple, direct and free of ambiguity.
16. Do not use double negatives in a item.
MERITS OF OBJECTIVE TESTS
(i) Measure a great variety of educational objectives

(ii) Measure all cognitive domains from simple skill (knowledge) higher
level skills (evaluation)
(iii) Item analysis can be applied to multiple choice items.
(iv) A student is able to answer many multiple choice items in the time
it would take to answer a single essay question. Takes shorter time
to answer than essay.
(v) Their marking is free from bias- Can be marked mechanically.
Multiple choices tests can be scored on a completely objective basis.
(vi) They enable test developers to sample a wider content area (more
representative achievement)
152
EPSC 228/311/0228
(vii) They enable test developer to evaluate greater variety of abilities.

(viii) Free of ―halo effects‖ – immune is ―halo effects‖
(ix) Test using it are usually more reliable than other types
(x) The role that guessing plays in determining an examinee‘s score is
reduced when each item is provided with several alternatives e.g. 4
to 5 and this increases the rehabilitee of the test.
DEMERITS OF OBJECTIVE TESTS
(i) Cannot sample ability to communicate. Or express ideas.

(ii) Not totally free of guessing factor – thus reducing reliability of M.C.T
(iii) More difficult and time consuming to write than other types of test
items.
(iv) Difficult to find distracters.
(v) Does not provide a measure of writing ability – same as (i).
Easy vs Objective Tests

Although much has been said and written about the relative merits of
essay and objectives test questions, it can safely be said that neither is
fundamentally superior for all purposes. Both have their merits, their
problems, and situations in which they are preferable. Among the major
advantages of objective type questions are that a large number of
questions can be asked in a given testing time permitting fairer and more
complete sampling of subject matter; scoring is easier and much more
reliable; the questions land themselves to item analysis and; through
pretesting of questions, test difficulty, validity and reliability can be
predicted, controlled and improved.
Because KNEC examinations involve large numbers of candidates, these

advantages make almost mandatory that most of our tests be of the
objective, machine scorable variety. The major disadvantage of objective
questions is that it is not easy to write good objective questions testing
more than knowledge and requiring candidates to demonstrate more
153
EPSC 228/311/0228
sophisticated mental processes. It requires a high order of ingenuity and

creativity to write multiple-choice items that test the full gamut of
abilities. This is a constant challenge for item writers, both on the staff
and on committees involved in item preparation.
154
EPSC 228/311/0228
LECTURE 12
ITEM ANALYSIS
TOPIC OBJECTIVES
 Define item analysis.
 Explain and illustrate the three aspects of item
analysis:
 Item difficulty index
 Item discrimination index
 Distractor analysis
 Calculate and interpret three aspects of item analysis.
After a test has been administered and scored, even if we adhered to

qualities of a good test, it is usually desirable to evaluate the
effectiveness of the test items. This is done by studying the examinees‘
responses to each item. This procedure is called item analysis. The
purpose of item analysis is to identify deficiencies in the test instrument.
Item analysis may appear like giving medicine after death.
In principle, item analysis can be carried out on both essay and objective
tests but the techniques are much developed for the objective test items.
Items can be analyzed qualitatively in terms of their content and form,

and quantitatively in terms of statistical properties.
155
EPSC 228/311/0228
(a) Qualitative Analysis
For this type of analysis one requires the service of both subject content
specialists and test construction specialist. Most of the qualitative
analysis can be done before the tests are administered, for example,
judging the content validity is a qualitative type of item analysis.
Examining stems of items for ambiguity is another way of
qualitative analysis.
(b) Quantitative Analysis
This includes principally the measurement of such properties as difficulty

level, discrimination power or index of the items and determining
the effectiveness of the distractors of distractor analysis.
When we prepare items for a test, we hope that each of them will be
useful in certain statistical way. That is, we hope that each item will turn
out to be of the appropriate level of difficulty for the group, the
proportionately more or the better students will get it right than the
poorer, and that the incorrect options will prove attractive to the students
who cannot arrive at the right answer through their own ability.
Item analysis uses statistical methods to identify any test items that are
not working well. If an item is too easy, failing to show a different
between skilled and unskilled examinees, or even scored incorrectly, an
item analysis will reveal it. That is, item analysis information can tell us if
an item was too easy or too hard, how well it discriminated between high
and low scorers on the test, and whether all of the alternatives
(distractors) functioned as intended. The three most common statistics or
areas reported in an item analysis are:
 Item Difficulty Index

 Item Discrimination Index
 Distractor Analysis
156
EPSC 228/311/0228
ITEM DIFFICULTY INDEX
The item difficulty index is one of the most useful and most frequently
reported, item analysis statistics. It is a measure of the proportion of
examinees who answered the item correctly. Teachers produce a difficulty
index for a test item by calculating the proportion of students in class who
got an item correct. The larger the proportion, the more students who
have learned the content measured by the item.
Evaluating a course e.g. geography. First state the objectives of why

geography should be taught in high school. Ask the learners geographical
facts to evaluate the achievement / failure of geography.
CALCULATION OF ITEM DIFFICULTY INDEX
There are two approaches ways of computing item difficulty index,

namely:
i. Simpler Approach for Calculating ID Index
ii. More Complex Approach for Calculating ID Index
SIMPLER APPROACH FOR CALCULATING ITEM DIFFICULY INDEX
This approach is less accurate but good for teachers in understanding the
concept of ID index in a simpler way. For example, imagine a classroom
of 40 Standard 6 students who took a test which included the item below.
What is the item difficulty of this test item? The asterisk indicates
that B is the correct answer.
Test Item: Who was the First President of KANU?
Option No. Choosing
A. Tom Mboya 6
*B. James Gichuru 24
157
EPSC 228/311/0228
C. Jomo Kenyatta 10
D. Robert Matano 0
Item Difficulty Index: Proportion of students who got an item correct
 Count the number of students who got the correct

answer. 24 students chose the item.
 Divide by the total number of students who took the test.
Difficulty Index ranges from .00 to 1.00. For this example, Difficulty
Index=24/40=.60 or 60%. This means that sixty percent of students
knew the answer.
Interpretation of Item Difficulty Index
% Range Difficulty Level
20% and below Very difficult
21 – 40% Difficult
41 – 60% Average
61 – 80% Easy
81% and above Very easy
MORE COMPLEX APPROACH FOR CALCULATING ID INDEX
In computing item difficulty index of a test item using this approach you
need to do the following:
 First, select one-third of the examinees with the highest scores in

the paper and call this the upper group and select the same
number with the lowest score and call this group lower group.
158
EPSC 228/311/0228
 Second, for each item, count the number of examinees in the

upper group who selected each alternative. Make the same count
for the lower group.
 Third, estimate item difficulty by determining the percentage of
examinees that get the item right.
Assume 30 examinees took History paper and responses to Question 1

are as follows:
Alternatives (Options)
Groups A B* C D E
Upper 10 0 6 3 1 0
Lower 10 3 2 2 3 0
*
= Correct answer.
Total number of upper and lower group = 10 + 10 = 20
Total selecting correct answer = 6 + 2 =8
Index of Difficulty = RU + RL
n1+n2
Where,
U=Represents number of examinees in the upper-scoring group
responding correctly
L= Represents number of examinees in the lower-scoring group
responding correctly
n1+n2=Represent the total number of examinees in the upper and lower-
scoring groups, respectively.
ID= Students with correct answers x 100 = 8/20 x 100 = 40% =0.40
Total students (n1 + n2)
159
EPSC 228/311/0228
Since difficulty refers to the percentage getting the item right, the
smaller the percentage figure the more difficulty the item. Hence, Index
of Difficulty can range between 0% and 100% with a higher value
indicating that a greater proportion of examinees responded to the item
correctly, and it was thus an easier item.
Interpretation of Item Difficulty. The same table given above applies.
ITEM DISCRIMINATION INDEX
The item discrimination index is a measure of how well an item is able to

distinguish/discriminate between examinees who are knowledgeable and
those who are not, or between masters and non-masters.
It is the degree to which students with high overall examination scores

also get a particular item correct. For an item that is highly
discriminating, in general the examinees who responded to the item
correctly also did well on the test, while in general the examinees who
responded to them item incorrectly also tended to do poorly on the
overall test.
Question: Do students who scored high in the history examination also

got Question1 correct?
CALCULATING ITEM DISCRIMINATION INDEX (DI)
There are actually several ways to compute item discrimination. Some of

these formulae use equal numbers of upper scorers and lower scorers.
Other use unequal numbers:
i. Simpler approach for ordinary teacher with limited knowledge of

statistic.
ii. The point-biserial correlation. This is the most common approach

but complex for an ordinary classroom teacher. This statistic
looks at the relationship between an examinee‘s performance on
160
EPSC 228/311/0228
the given item (correct and incorrect) and the examinee‘s score
on the overall test.
SIMPLER CALCULATION APPROACH FOR DI
Create two equal groups of students of Upper and Lower Scorers

High scores group= Made up of the high class scorers [or upper half of
the class] in the whole test.
Low scores group= Made up of the low scores or the bottom half of
class in the whole test.
For each group:
a. Calculate a difficulty index for the test item for each group.
b. Subtract the difficulty index for the low scorers from the
difficulty index of the high scorers.
Discrimination Index ranges from -1.0 to 1.0
Test Item: Imagine in KANU test item example:

16 out of 20 students in the high group (n1) and 8 out of 20 students in
the low group (n2) the item correct. Hence:
High Scores Group 16/20=0.80
Low Scores Group 8/20 =0.40
Discrimination Index=.80-.40=.40. This indicates the test item was

good in discriminating learners.
Discrimination Index = U – L =6 - 2 =0.60 – 0.20 = 0.40
n1 n2 10 10
Where,
nu =number of those in the high scoring group.
nl =number of those in the low scoring group.
161
EPSC 228/311/0228
Interpretation of Discrimination Index
Correlation Description
Range
.40 and above Very good items
.30-.39 Good item
.20-.29 Fair item
.09-.19 Poor item
A strong and positive correlation suggests that students who get any one
question correct also have a relatively high score on the overall
examination.
A negative discrimination index may indicate that the item is measuring

something other than what the rest of the test is measuring. More often,
it is a sign that the item has been mis-keyed.
Examples Negative Discrimination Index
1. High scorers=6/20=0.30
Low scorers=18/20=0.90
Discrimination Index =0.30-0.90= - 0.60
2. High scorers=0/20=0.00
Low scorers=20/20=1.00
Discrimination Index =0.00-1.00= -1.00
162
EPSC 228/311/0228
Alternative Approach
Get proportion of examinees responding correctly to the item in the lower

scoring group and subtract from proportion of examinees responding
correctly in the upper scoring group and divide by the number of
examinees (both upper +lower numbers).
Discrimination Index= RU - Rl = 6 - 2 = 4 = 0.40
(nu + nl) ÷ 2 20÷2 10
Ru =Number of those in high scoring group that got the item correct.
Rl = Number of those in low scoring group that got the item correct.
nu =number of those in the high scoring group.
nl =number of those in the low scoring group.
Why divide the difference by 2? The number of students on each side of

the diving line is the half of the class.
The possible range of the discrimination index is -1.0 to 1.0. When an

item is discriminating negatively, overall the most knowledgeable
examined are getting the item wrong and the least knowledgeable
examined are getting the item right. A negative discrimination index may
indicate that the item is measuring something other than what the rest of
the test is measuring. More often, it is a sign that the item has been mis-
keyed.
If the discrimination index is negative, it also means that for some reason
students who scored low on the test were more likely to get the answer
163
EPSC 228/311/0228
correct. This is a strange situation which suggests poor validity for an

item.
Correlation Range Description
.40 and above Very good items
.30-.39 Good item
.20-.29 Fair item
.09-.19 Poor item
A strong and positive correlation suggests that students who get any one
question correct also have a relatively high score on the overall
examination.
COMPLEX APPROACH FOR DI CALCULATION- POINT-BISERIAL

CORRELATION
[NOT FOR DISCUSSION BECAUSE OF KNOWLEDGE OF STATISTICS]
DISTRACTOR ANALYSIS / ANALYSIS OF RESPONSE OPTIONS
One important element in the quality of multiple-choice item is the quality

of the item‘s distractors. However, neither the item difficulty nor the item
discrimination index considers the performance of the incorrect response
options or distractors. A distractor analysis addresses the performance of
these incorrect response options.
Just as the key, or correct response option, must be definitely correct, the
distractors must be clearly incorrect (or clearly not the ―best‖ option). In
addition to being clearly incorrect, the distractors must also be plausible.
That is, the distractors should seem likely or reasonable to an examinee
164
EPSC 228/311/0228
who is not sufficiently knowledgeable in the content area. If a distractor

appears so unlikely that almost no examinee will select it, it is not
contributing to the performance of the item. In fact, the presence of one
or more plausible distractors in a multiple-choice item can make
artificially far easier than it ought to be.
In addition to examining the performance of an entire test item, teachers

are often interested in examining the performance of individual distractors
(incorrect answer options) on multiple-choice items. By calculating the
proportion of students who chose each answer option, teachers can
identify which distractors are "working" and appear attractive to students
who do not know the correct answer, and which distractors are simply
taking up space and not being chosen by many students.
Example 1. The KANU Example
The analysis of response options shows that those who missed the item
were about equally likely to choose answer A and answer C. No students
chose answer D. Answer option D does not act as a distractor. Students
are really choosing between three options. This makes guessing more
likely, which hurts the validity for an item.
A =6/40 =.15
B =24/40 =.60
C =10/40 =.25
D =0/40 =.00
A good distractor will attract more examinees from the lower group than
the upper group. In this example D was a very poor distractor. It was
obvious to both good and poor students. In this example distractors A and
C are functioning effectively.
165
EPSC 228/311/0228
ABILITY GROUPS OPTIONS

A B C D
High Scorers 1 16 3 0
Low Scorers 5 8 7 0
TOTAL 6 24 10 0
Interpretation
The analysis of response options shows that those who missed the item
were about equally likely to choose answer A and answer C. No students
chose answer D. Answer option D does not act as a distractor. Students
are not choosing between four answer options on this item, they are
really choosing between only three options, as they are not even
considering answer D. This makes guessing correctly more likely, which
hurts the validity of an item.
Example 2.
In a simple approach to distractor analysis, the proportion of examinees

in the upper and lower groups who selected each of the incorrect
response options is examined. A good distractor will attract more
examinees from the lower group than the upper group. In the
example given below, distractors A and C are functioning effectively.
Alternatives (Options)
Groups A B* C D E
Upper 10 0 6 3 1 0
Lower 10 3 2 2 3 0
*
= Correct answer.
The proportion of examinees who select each of the distractors can be

informative. For example, it can reveal an item that is mis-keyed.
166
EPSC 228/311/0228
Whenever the proportion of examinees who selected a distractor is

greater than the proportion of examinees who selected the key, the item
should be re-examined to determine if it has been mis-keyed or double
keyed. A distractor analysis can also reveal an implausible distractor.
[SEE STANDARD NEWS PAPER REPORT OF OCTOBER 30, 2017-

ATTACHED]
LIMITATIONS OF ITEM ANALYSIS
 It is not commonly used in the analysis of essay items
 It is only used when the test involves large population of students.
 It requires the preparation of large number of test items.
 Not good for small groups of students
167
EPSC 228/311/0228
LECTURE 13
DEFICIENCIES IN TEACHER-MADE TESTS
TOPIC OBJECTIVES
 Explain the mistakes teachers make in developing
classroom tests.
By and large, teacher – made achievement tests also referred to as locally

developed tests are quite poor. The deficiencies include:
1. Ambiguous question. These are questions which can be
interrupted in two or more ways.
Example of Test Item that is Ambiguous
Flying plane is dangerous. Is statement addressing the act of plane
flying that is dangerous or the act of piloting the plane being
dangerous?
2. Excessive wording. Too often teachers think that the more
wording there is in a question, the clearer it will be to the students.
This is not always so. In fact, the more precise and clear cut the
wording, the greater the probability that the students will not be
confused.
Example of Test Item that is Excessively Worded
Define the term ―Osmosis‖. That is, what you understand by the
term ―Osmosis‖? In other words, what does ―Osmosis‖ mean?
3. Lack of appropriate emphasis. More often than not, teacher –
made tests do not cover the objectives stressed and taught by the
168
EPSC 228/311/0228
teacher, and do not reflect proportionally the teacher‘s judgment as

to the importance of those objectives. Heavily loaded with items
that only test recall.
4. Use of inappropriate item format. Some teachers use different
item formats (such as true-false or essays) because they feel that
change or diversity is desirable. This is not the basis for setting
questions.
How do you design reliable classroom test?
1. Chance factors must be reduced to a minimum. One way is to
eliminate true false and other two-choice types.
2. Write clear instructions so that students will be measured on their
performance rather than on ability to ―figure out what the teacher
wants‖.
3. Ensure consistency in scoring by using a key, prepared in advance.
4. Test must be moderated.
169
EPSC 228/311/0228
LECTURE 14
TEST ADMINISTRATION, SCORING AND

INTERPRETATION OF TEST RESULTS
TOPIC OBJECTIVES
 Explain the different procedures KNEC has put in place to provide
credible examination.
This section has been covered under KNEC.

From initiation to scoring, a test goes through three key stages, namely:
 Development
 Administration
 Processing= mechanical scoring, manual scoring and training of
examiners.
We have discussed test development under test construction or test
planning.
Test Administration
As already discussed, tests determine the destiny of the individuals and
hence the conditions under which they are administered by must be fair
and uniform. In this respect, fair administration of tests takes into
considerations:
 Rehearsal in case of national examinations.
 Provision of uniform instructions on the conduct of examinations.
170
EPSC 228/311/0228
 Provision of the same test time to all candidates.

 Provision of testing environment that is free of noise.
 Provision of adequate lighting in the classroom and laboratories
where experiments are carried out.
 Provision of security of examination materials to avoid theft and
cheating.
 Provision of good supervision and invigilation to avoid cheating by
candidates. Cheating gives unfair advantage to those who cheat.
For reliability purposes, there is need for consistency in test
administration and scoring.
Test Scoring
Scoring is one pillar of fairness in an examination. A source of unfairness
if not well managed. There are two types of scoring systems in use in
Kenya.
 Manual scoring. Used mainly in schools and in essay types of
questions in KNEC examinations. Subjectivity/bias can be high
under manual scoring.
 Electronic scoring by use of Optical Mark Reader/ Scanner.
This is used by KNEC for scoring objective test items in KCPE and
KCSE. Objectivity is high in electronic marking/scoring. Objectivity
refers to consistency in test interpretation and scoring.
The conditions that promote fair scoring of a test include:
 Moderation of the marking scheme.
 Training of markers in case of easy tests.
 Coordination of markers and putting in smaller teams.
 Retirement of erratic and generous markers.
Interpretation of Test Results

Fair interpretation of test results takes into considerations:
 Type of test-whether norm-references test or criterion-referenced
tests.
171
EPSC 228/311/0228
 Test difficulty.
Because the assumption under testing is normality, scores are interpreted
in relation to the normal curve. However, there is estimated
(hypothetical) distribution and observed (actual) distribution of test
scored. These are illustrated below using physics and geography test
scores.
(i) Estimated (Hypothetical)
Minimum score = 200

Maximum score = 800
200 300 400 500 600 700 800

Scale of scores
(ii) Physics Test Scores

Estimated (hypothetical)
distribution of the physics
test for all candidates in
Observed distribution of
the KNEC standardization
candidates who took the
group
physics test
200 300 400 500 600 700 800

Scale of scores on physics test
172
EPSC 228/311/0228
(iii) Geography Test Estimated (hypothetical distribution

on the Geography test for all
candidates in the KNEC
standardization group
Observed distribution
of candidates who took
Geography Test
200 300 400 500 600 700 800

Scale of scores on Geography Test
In terms of norm-reference testing, raw scores are interpreted in terms of

a defined group (the standardization group).
173
EPSC 228/311/0228
LECTURE 15
Welcome to lecture 15.
STATISTICAL ANALYSIS OF TEST SCORES
TOPIC OBJECTIVES
 Describe the terms:
 Population
 Statistic
 Describe the application and interpretation of measures of central
tendency to test scores.
 Describe the application and interpretation of measures of
variability to test scores.
Tests are quantifiable measures. Statistics help in decision-making with

respect to the interpretation of results. Regardless of scale or level of
measurement inherent in a particular test, the data from that test must
be placed in a manageable and interpretable form. One way this can be
accomplished is by describing the test results in terms of statistics.
Statistical Concepts
(i) What is a sample? The smaller group of people who actually
participate in the test is known as a sample. This is a sub-set of the
population and is represented by lower case n.
Population
174 Sample
EPSC 228/311/0228
(ii) What is a population? The entire group of people/pupils

who take the test is known as a population. This is represented in
statistics by capital/upper case N.
(iii) What is a parameter? A parameter is a measure of
numerical (number) characteristic of an entire population. Example
1: The mean reading readiness score for all standard one pupils in
Kenya.
(iv) What is statistic? This is a numerical (number)
characteristic of a sample. Example 1: If we draw a sample of
Nakuru County Standard One pupils for example from Njoro sub
county and determine the mean reading scores for this sample, this
mean would be a statistic. A statistic is used as an estimate of the
parameter.
Descriptive Statistics
Once a large set of scores have been collected, certain descriptive values
can be calculated. These are the values that summarize or condense the
set of scores, giving it meaning. Descriptive values are used by the
teachers to evaluate individual performance of pupils and to describe the
group‘s performance or compare its performance with that of another
group.
Once you collect data from a large sample, you can do the following
things:
 Organizing and graphing test scores.
 Applying descriptive statistics.
175
EPSC 228/311/0228
Organizing and Graphing Test Scores
Prepare distribution of frequency table of the test scores. You can present
the test scores individually or in a group format in the form of frequency
table or histogram.
(a) For individual scores from highest to lowest (ungrouped

data).
Scores f (frequency)
96 1
92 1
90 1
88 2
86 1
84 1
75 3
73 2
70 1
60 1
58 2
56 1
_______
17
________
(b) Grouped Scores
Class Internal f (frequency)

90 – 100 3
80 – 89 4
176
EPSC 228/311/0228
70 – 79 6
60 – 69 1
50 – 59 3
________
17
________
 In a grouped frequency distribution, test-scores intervals are called

―class intervals”.
 Decide on the ―width‖ of class interval.
 Class interval must be of the same width. In the above the class
interval is of 10 scores.
Applying Descriptive Statistics

The descriptive statistical tools we use to give meaning to test scores are
measures of central tendency and measures of variability.
(i) Measures of Central Tendency

One type of descriptive value is the measure of central tendency, which
includes those points which scores tend to be concentrated. These
measures describe the centre or location of a distribution. They do not
provide any information regarding the spread or scatter of the scores.
There are three measures of central tendency:
 the mode
 the media
 the mean
177
EPSC 228/311/0228
Mode
The mode is the score most frequently received. It is used with nominal
data. In the ungrouped scores given above the mode is 75. For grouped
data the modal interval is 70-79.
A frequency distribution can be uni-modal (one mode), bi-modal two
modes); trio-modal (three modes); poly-modal many modes.
Example: 2, 2, 2, 3, 4, 6, 6, 6, 7, 8.
Median
The median is the middle score; half the scores fall above the median and
half below. It cannot be calculated unless the scores are listed in order
from best to worst or in either ascending or descending order. Hence the
procedure for getting a median of a distribution is as follows:
 First arrange the scores in a ascending or descending order
 Determine the position or location of approximate median
 Calculate the median of the scores.

Find the median for 66, 65, 61, 59, 53
Position= 5 + 1 = 3, or the 3th score = 61
2
If the scores are 66, 65, 61, 59, 53, 50
Position= 6 + 1 = 7=3.5. That is the median lies between 61 and 59.
2 2
Median = 61 + 59 = 120 = 60
2 2
Mean
The mean (symbolized X) is the most commonly used measure of central
tendency. It is affected by extreme scores. It is the sum of the scores
divided by the number of scores.
X = ∑X
n
178
EPSC 228/311/0228
Where X (X bar) is the mean, ∑X is the sum of the scores, and n is the
number of scores. The symbol ∑ means sum of. Hence ∑X means sum of
all scores. In Greek ∑ is called sigma. X represents individual scores and
n is the number of students or number of scores.
Mean (X) of 66, 65, 61, 59, 53 = 304 = 60.8
5
The mean is appropriate for interval or ratio data.
The disadvantage of the mean is that it is influenced by outliers.
(ii) Measure of Variability

A second type of descriptive value is the measure of variability, which
describes the set of scores in terms of their spread, scatter or
heterogeneity. For example, consider these set-up scores for two groups.
Group 1 Group 2
9 5
5 6
1 4
For both groups the mean and median is 5. If you simply report that the
mean and median for both groups are identical without showing the
variability of scores, another person could conclude that the two groups
have equal or similar ability. This is not true. Group 2 is more
homogenous in performance than Group 1. A measure of variability is the
descriptive term that indicates this difference in the spread, scatter or
heterogeneity, of a set of scores. There are two such measures of
variability: the range and the standard deviation.
179
EPSC 228/311/0228
Range
The range is the easiest measure of variability to obtain and the one that
is used when the measure of central tendency is the mode or median.
The range is the difference between the highest and the lowest scores.
For example:
For Group 1: Range = 9 – 1 = 8
For Group 2: Range = 6 – 4 = 2
The range is neither a precise nor a stable measure, because it depends
on only two scores- the highest and the lowest.
Standard Deviation
The standard deviation (symbolized S.D) is the measure of variability
used with the mean. It indicates the amount that all the scores differ or
deviate from the mean – the more the scores deviate from the mean, the
higher the standard deviation. The sum of the deviations of the scores
from the mean is always 0. There are two types of formulas that are
used to compute S.D.
 Deviation formula.
 Raw score formula.
The deviation formula illustrates what the S.D. is, but is more difficult to
use by hand if the mean has a fraction. The raw score formula is easier
to us if you have only a simple calculator.
Let us use the scores: 7, 2, 7, 6, 5, 6, 2.
(i). Deviation Formula
2
S.D. = ∑ (x-x )
n-1
Where S.D. is the standard deviation, X is the scores, X is the mean, and
n is the number of scores.
180
EPSC 228/311/0228
Some books, calculators, and computer programs will use the term n
rather than n-1 in the denominator of the standard deviation formula.
When the sample is large you can use n because a larger sample
approaches the population size.
Why n-1?
i. Use of n-1 gives a good estimate of population variance or S.D.
That is, it gives unbiased estimate of population variance.
ii. We use n-1 when the sample size is small in order to get unbiased
estimate of population variance.
In this illustration let us use n-1.

Step 1
35
X = ∑X = /7 = 5
n
Steps 2 – 3
2
X X (X–X) (X – X)
7 5 2 4
2 5 -3 9
7 5 2 4
6 5 1 1
5 5 0 0
6 5 1 1
2 5 -3 9
∑=35 ∑=0 ∑=28
Step 4
S.D. =28 / (7–1) =
181
EPSC 228/311/0228
(ii). The Raw Score formula
The deviation formula is seldom used to calculate the S.D. by hand,

because it is cumbersome when the mean has a fraction. Instead the
following raw formula (also called computational formula) is used to
calculate S.D:
S.D. = ∑ x2 – (∑x)2/n
n-1
Where ∑X2 is the sum of the squared scores, ∑X is the sum of the scores,
and n is the number of scores.
X X2
7 49
2 4
7 49
6 36
5 25
6 36
2 4
∑X = 35 ∑X2 = 203
The computation of SD is as follows:
S.D. =203 – (35)2/7= 203-1225/7

7–1 6
= 203 – 175 = 4.62 = 2.2

6
182
EPSC 228/311/0228
How do you interpret the scores in relation to the mean and S.D?
In reporting your pupils‘ scores, you need to report both the mean and
the S.D.
 A test norm allows meaningful interpretation of test scores.
 A person‘s raw test score is meaningless unless evaluated in terms
of the standardized group norms. For example, if a student
receives a raw score of 78 out of 100 in history, does that mean
that the student is doing well?
The score of 78 can be interrupted only when the norms are consulted. If
the mean of the test norm is 80 and the standard deviation is 10, the
score of 78 can be evaluated as ―typical‖ performance indicating that the
student possesses an average knowledge of history.
SELF-ASSESSMENT EXERCISE
Use raw score formula to compute the mean and the SD for the test
scores of the following two groups of students:
Group 1: 9, 5, 1
Group 2: 5, 6, 4
What does the S.D. tell you about these two groups?
For Group 1 you should get SD of 4
For Group 2 you should get SD of 1.414
183
EPSC 228/311/0228
Interpretation
Though both groups have a mean score of 5; pictorially/graphically, the

spread of the scores will look like in the figure given below.
Group 2
Group 1
Frequency
0 1 2 3 4 5 6 7 8 9
Score Value
For both groups, the test scores have the same mean, but different
variability or spread of scores. Students in Group 1 have a larger S.D
(SD=4) indicating that they are more heterogeneous in ability. Students
in Group 2 have smaller S.D (SD=1.414) indicating that they are more
homogeneous in ability.
**********************************************************
184

Epsc 228 311 0228 Final Measurement and Evaluation Revised March 2021 (Repaired) (1)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Epsc 228 311 0228 Final Measurement and Evaluation Revised March 2021 (Repaired) (1)

Uploaded by

Copyright:

Available Formats

EPSC 228/311/0228

candidates were disadvantaged. Moderation of

4. Lack of Inter-examiners‟ Reliability Process

ii. Abnormal Procedure:

COMPETENCE BASED EDUCATION/CURRICULUM

technologically. So why is it not good for Kenya? Why is it that it

What is Competence–Based Competence (CBC) and what

The most important characteristic of competency-based education is that

Head + Hand + Heart= Competent Professional

Competence= combination of knowledge, skills and attitude

Benefits of Competency-Based Learning

 This is the type of education that combines the above in a graduate.

 Competency-based learning allows each student to work at his or

In summary, it is argued by the proponents of CBE that Competency-

With competency-based education, institutions can help students‘

Where has CBC succeeded?

CBE has been successfully implemented in South Korea, Japan, Finland,

South Korea Education Structure

Vocatioal Stream (38%)

Where has CBC failed?

 It involves too much administration and record keeping.

2. Assessment of student in CBC is subjective

level of mistrust of teachers the CS in charge of Education said, ―With

Is this CA policy direction not paradoxical? Will the teachers‘ assessment

WHAT MAKES A TEST GOOD?

Achievement tests, as measuring instruments, have unique characteristics

 Relevance to the target students.

(ii) Technical Considerations

FACTORS THAT AFFECT CONTENT VALIDITY

(ii) Topic Coverage

(iii) Test blueprint or specifications

cooperation because subjects believe them (tests) to be valid

High High Low Low

(vi). Testing environment

(vii). Number of performers. Large numbers are required to establish

“Halo Effect” and subjectivity

ESTIMATING THE RELIABILITY OF AN

The procedure for estimating test-retest reliability utilizes the method

This is also called ―alternative forms reliability‖ or ―equivalent forms

An estimate of parallel forms reliability involves administration of two

Mathematics Test Mathematics Test

This is also called “inter-rater reliability”. An estimate of inter-scorer

Where the measurement involves observation rather than paper-and

Examples: Two observers went to Kilimo Nursery School to observe

Time Observer 1 Observer 2 Total Variance

Aggression was measured by such indicators as: kicks, destroys, fights,

Inter-observer reliability =No. of agreed observations x100

Inter-observer reliability = 30 x 100 = 85.7%

The higher the % agreement between the observers or interviewers, the

Another way of estimating the reliability is to assess inter-item reliability.

There are two approaches to estimating inter-item reliability. These are:

i. Split Half Reliability

This is one of the measures of internal consistency of a test that is

ii. Coefficient Alpha ()

This is also called internal consistency or Cronbach alpha (). Cronbach

 An evaluation of internal consistency is most often used where we

degree of internal consistency if they are reliably measuring the

 Internal consistency is high if questions – measure the same

6. Kuder-Richardson (K-R). This measures internal consistency. This is

DIFFERENCES BETWEEN RELIABILITY AND VALIDITY

i. Can a test/instrument be reliable, but not valid?

Reliability is always a statement of probability.

Example 1: Inter-scorer reliability. Like asking a question: What is the

Example 2: Reliability of KZY 788. What is the probability that my old