Professional Documents
Culture Documents
Epsc 228 311 0228 Final Measurement and Evaluation Revised March 2021 (Repaired)
Epsc 228 311 0228 Final Measurement and Evaluation Revised March 2021 (Repaired)
1
EPSC 228/311/0228
2
EPSC 228/311/0228
3
EPSC 228/311/0228
TABLE OF CONTENTS
LECTURE 1: INTRODUCTION
LECTURE 2: MEASUREMENT CONCEPTS
LECTURE 3: THE USES OF TESTS
LECTURE 4: EXAMINATION SYSTEM IN KENYA
LECTURE 5: EVALUATION AND PERFORMANCE STANDARDS
LECTURE 6: TESTING CODE OF ETHICS
LECTURE 7: PSYCHOMETRIC CHARACTERISTICS OF A GOOD TEST
LECTURE 8: BLOOM’S DOMAINS OF LEARNING
LECTURE 9: LEARNING OBJECTIVES AND LEARNING OUTCOMES
LECTURE 10: TEST PLANNING
LECTURE 11: TEST SPECIFICATIONS AND TEST CONSTRUCTION
LECTURE 12: ITEM ANALYSIS
LECTURE 13: DEFICIENCIES IN TEACHER-MADE TESTS
LECTURE 14: TEST ADMINISTRATION, SCORING AND INTERPRETATION
LECTURE 15: STATISTICAL ANALYSIS OF TEST SCORES
4
EPSC 228/311/0228
LECTURE 1: INTRODUCTION
Welcome to lecture 1
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Discuss how testing developed in China, Europe
and America and its relationship to African context.
Explain the concept of education with special
reference to United Nations, World Trade
Organization and the Kenyan constitution.
African Context
Did the pre-modern African society have tests? Yes. Wrestling, jumping,
dancing, tooth extraction/removal, and circumcision, were among the
tests that were used by women for the choice and the selection of
suitors; show of bravery; and readiness for the society‘s defence.
In China, Europe and USA the increase use of testing was attributed to
three major areas of development: civil-service; school; and study of
individual differences.
Chinese Context
Civil-service testing began in China about 3000 years ago when an
emperor decided to assess the competency of his officials. It has
5
EPSC 228/311/0228
been reported that by the year 2200 B.C, the emperor of China
examined his officials every third year and after three examinations,
he either promoted or dismissed them from Civil Service.
Later, the Chinese government positions were filled by persons who
scored well on examinations that covered topics such as music,
horsemanship, civil law, and writing.
Such examinations were eliminated in 1905 and were replaced by
formal educational requirements.
European Context
Students in European schools were given oral examinations until
after the 12th century.
In the 16th century the Jesuits (catholic order/organization that vow
for poverty and obedience) started using tests for the evaluation
and placement of their students. By this time, Jesuits were running
schools across Europe. The Jesuits‘ standardized curriculum and
teaching methods became the basis of many education systems to-
day.
In 19th Century the study of individual difference led by Sir Francis
Galton (1822-1911) began in Great Britain. He was the first
experimental psychologist to look into psychological differences
(nature and nurture) in sensory and motor skills between people
and the application of statistical methods (correlation and
regression) in the analysis and quantification of the individual
differences.
By 1905 when China was phasing out the use of examinations in
public service, civil-service examinations were being developed in
Britain and the United States as way of selecting applicants for
government jobs.
In 1905 Alfred Binet (1871-1938) developed the first individual
tests of intelligence as part of his work on the study of individual
differences (M.A/C.A x 100). He was asked by the French
6
EPSC 228/311/0228
WHAT IS EDUCATION?
Though education is seen as both an investment and a social service to
the community and the nation, it means different things to different
people or organizations.
7
EPSC 228/311/0228
Iceberg
Formal education
_____________________________________________________
Non-formal education
Informal education Hidden curriculum
Their roles were given according to the age of the child –from
simple to complex duties.
a) Seeing (Observation)
b) Imitation
2. Learning Through Oral Literature
8
EPSC 228/311/0228
So long as learning was taking place, testing of the same though not in
writing or by use of paper and pen, was also in place. Punishment and
corrections were indicators of failure of acquisition of expected
9
EPSC 228/311/0228
4. Our Constitution
Our constitution conceptualizes education as a fundamental human right.
Section 43 (f) states that every person has the right to education. Section
10
EPSC 228/311/0228
53 states that every child has the right to free and compulsory basic
education.
Classification of Education
1. By Levels
Three levels of education in Kenya are:
University level
2. By Methods
Formal – School settings
Non-formal – By attachment
Informal – Family settings
3. By Programs include:
Special education
Science and technology education
Arts / humanities education
Vocational education
Welcome to lecture 2.
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Explain the concept of measurement.
List the attributes of scale of measurement.
Discuss the relationship between the measurement
scales.
Differentiate between qualitative and quantitative data.
In this course there are certain terms that are frequently used. These
include:
What is a test?
A test is a device by which we can use to sample the candidate/students
behavior. In this case behavior is performance.
Population
Sample
12
EPSC 228/311/0228
TYPE OF TESTS
Which is your favorite quotation in the bible? Bible is a document with so
many tests. Our earthly tests include but not limited to:
13
EPSC 228/311/0228
Aptitude Tests
Aptitude tests are measures of potential. They are used to predict how
well someone is likely to perform in future. Tests of general aptitude are
variously referred to as scholastic aptitude tests, intelligence tests, and
tests of general mental ability. Aptitude tests are also available to predict
a person‘s likely level of performance after following some specific future
instruction or training.
Aptitude tests are standardized.
Example: When Kenya Science Teachers College was established the
only institution for training secondary school science teachers then it used
aptitude test to selection the trainees.
Personality Tests
Personality tests for diagnosis of behavior problems. They are designed to
measure characteristics of individuals along a number of dimensions
including:
Attitudes towards self and others
Team building
Most of these tests are self-report measures where an individual is asked
to respond to a series of questions or statements and are available online.
14
EPSC 228/311/0228
The opposite of standardized test is non- standardized test. Such tests are
usually developed locally for a specific purpose. The tests are used by
teachers in the classroom are examples of locally developed tests. Such
test may:
Assessment
Coursework vs Examination
Evaluation
Examination
What is measurement?
16
EPSC 228/311/0228
Benefits of measurement
A great advantage in using measurement is that one may quantify and
apply the powerful tools of statistic/ mathematics to the study and
describe a numbers of relationships:
i. The relation between mental ability and achievement. For example,
the mental ability and achievement can be measured by assigning
numbers and calculation of an index of relation between the two
variables (correlation coefficient). This cannot be done through
observations or descriptions.
Studies show that children who are carried at back while are infants are
slow in language and emotional development. They have no opportunity
to observe emotional expressions of the mothers unlike the children who
are placed in front or being carried facing the mothers‘ faces.
17
EPSC 228/311/0228
18
EPSC 228/311/0228
12 10 kg
10 9 kg
9 8 kg
Intervening Variable
(E.g. Home Chores)
Independent
Variable Dependent Variable
(E.g.SchoolType) (School Performance)
Nominal Scale
Which marriage option do you prefer? (Tick one choice).
1. Church wedding 27%
2. Civil wedding 4%
3. Traditional wedding 37%
4. Come-we-stay 33%
Source: Daily Nation, Saturday, July 8, 2017, p.3
Nominal scales are the simplest form of measurement.
A nominal scale entails the assignment of numbers or labels to
objects or classes of objects. In other words, numbers are used to
substitute for names.
20
EPSC 228/311/0228
Examples of Nominal
- We classify and label people objects and places according to:
- Sex: male vs female and can be assigned numbers 1 = male, 2 =
female.
- Place of residence: Nakuru or Nairobi– assign Nakuru =1, Nairobi
=2
- Jobs: teacher, lawyer, driver, nurse; 1 = teacher, 2= lawyer, 3 =
driver, 4 = nurse.
- Size: 1 = big, 2 = medium, 3=small
- Colour: 1 = Red, 2 = Black, 3 = Yellow.
- Shape: 1=round; 2=circular, 3=triangular, 4=rectangular
- Numerals on sports uniforms: The player represented by 45 is not
―better‖ or ―more than‖ the player represented by 32.
Use of symbols:
A=Nakuru
B=Nairobi
C= Kisumu
M= Male
F= Female
21
EPSC 228/311/0228
Ordinal Scale
Ordinal scales are typically measures of non-numeric concept like
satisfaction, happiness, discomfort, birth order etc.
Like nominal, ordinal scale permits classification. However, in
addition to classification, rank-ordering on some characteristics is
also permissible with ordinal scales.
It ranks objects or events in order of their magnitude.
Nominal + Rank ordering = ordinal scale
No absolute zero point
Measures quality rather than quantity.
22
EPSC 228/311/0228
23
EPSC 228/311/0228
24
EPSC 228/311/0228
Interval Scale
Interval scale meets all the criteria for ordinal level measurement and one
additional one-the exact distance between categories of the variable are
known and are equal to each other. The distance between points on the
scale is fixed.
25
EPSC 228/311/0228
Two points next to each other on the scale, no matter whether they
are high or low are separately by the same distance.
i. Distance between 00c and 100c = 100c
ii. Distance between 900c and 1000c=100c
Cf 100c and 1000c. Does not mean 1000c is not 10 times hotter than
something measuring 100c. This is because there is no absolute zero, the
zero is arbitrary.
00c / / / 1000c
26
EPSC 228/311/0228
Difference between 20oC and 30oC is the same as the difference between
30oC and 40oC. Each interval is 10oC.
While the difference between 20oC and 30oC and 40oC is the same
we cannot say that 40oC is twice as hot as 20oC.
Allows us to make interpretation of how warm or cold on a given
day:
- Monday = 75oC
- Tuesday = 650C
Monday was warmer than Tuesday
An interval scale has a zero point that does not indicate the absence
of the quality. We say it has arbitrary zero, not absolute zero. That
is zero degrees C is not indicative of the complete absence of
temperature. That is 00c does not mean that the temperature
does not exist at that point. Example Kenya‟s athlete whose
legs were amputated because of walking frozen snow.
Fallacy: 80 0C is twice as hot as 40 0C. The statement is fallacious
because the zero point on the Celsius scale has been arbitrarily set
at the freezing of water.
The case study of Kenyan athlete Cheseto amputated
legs
27
EPSC 228/311/0228
He set off on a familiar trail, and, initially all was well as he jogged into
the woods. He was not wearing a jacket or gloves to beat the chilling cold.
What happened next remains a mystery to date.
It was soon all over the news that Cheseto had gone missing for three
days.
―I gained consciousness at night covered deep in the snow and I did not
know where I was,‖ Cheseto, already a household name in Alaska for his
running exploits, recalls. ―I tried to get up but couldn‘t because my legs
were frozen... I tried to stretch but could not.‖
28
EPSC 228/311/0228
2019 AD 0 BC 3000
Ratio Scale
Ratio scale has all the properties of nominal, ordinal and interval
measurement + true zero point.
Ratio = Nominal + Ordinal + Interval + Zero or absolute
Point.
Ratio scale has a true or absolute zero point. Zero-point means that
no amount of the attributes measured.
When you are measuring the number of responses in an experiment
or in observation, you are using a ratio scale. Zero responses mean
literary that there are no responses.
This scale has a true or absolute zero point. Zero indicates an
absence of the quality measured. Zero means that no amount of the
attribute measured. Zero means ―none‖. Zero height, zero weight,
zero time means no amount of these variables is present.
Examples of Ratio:
i. Measures of weight
On a measurement of weight, it is meaningful to say that an object
weighing 30 kgs is three times as heavy as one weighing 10 kgs.
ii. Measures of height
29
EPSC 228/311/0228
A person who is 180 cm fall is 1.2 times taller than one who is 150 cm
tall.
iii. Measure of time
Employees are expected to report to their work stations at 8.00 am.
Employee A reported at 9.00 am = 1 hour late
Employee B reported at 10.00 am= 2 hours late
Employee C reported at 11.00 am= 3 hours late
Employee D reported at 12.00 am= 4 hours late
Employee D arrived 4 hours late than employee A. You can count how
many employees came four hour late.
iv. Income
Person A= earns kshs. 60,000/=
Person B= earns kshs. 20,000/=
Person A earns three times person B.
The value of zero income means no earned income.
v. Length
30
EPSC 228/311/0228
- One
- Twice
- Three
Summary: A ratio scale is one that has a meaningful zero point as
well as meaningful differences between the numbers on the scales.
Nominal Ratio
Identity
Identity
+ Order
+ Equal intervals
+ Absolute zero
Ordinal Interval
Identity
Identity
+ Order
+ Order
+Equal intervals
31
EPSC 228/311/0228
Data Types
Quantitative Qualitative
32
EPSC 228/311/0228
Precision
Ratio
Interval
Ordinal
Nominal
33
EPSC 228/311/0228
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Discuss the various uses of tests in educational settings.
Explain the side effects of school-based testing.
34
EPSC 228/311/0228
Tests are used to determine the grade or year the pupil he/she
should be enrolled.
Putting learners in either homogenous or heterogeneous groups for
purposes ot instructions.
2. Diagnosis
Tests can be used to diagnose weaknesses or the learning difficulties of
pupils while placement usually involves the status of the individual
relative to others, diagnosis is used to isolate specific deficiencies that
make for low or undesirable status. In pre-school to secondary school
settings, tests can identify areas where students need to make
improvements.
For purposes of inclusive education and provision of Least Restricted
Environment diagnosis with respect to hearing difficulties; visual
difficulties; epilepsy and other challenges is necessary. With effect
from January 2018, Grade 1 admission is accompanied by medical
reports.
In physical education / exercise or health setting, test results are
used to diagnose a health problem. In PE a treadmill stress test is
used to screen for heart disease. (Cite 2016 business men climbing
Longonot). One died.
In case of reading and writing, a child who suffers from dyslexia
writes in mirror images:
18 for 81
81 for 18
F for 7
b for d
5 for 2
A child who suffers from dyslexia speaks one thing and writes another
thing.
35
EPSC 228/311/0228
3. Evaluation of Achievement/Instruction
One goal of testing is to determine whether individuals have achieved the
course objectives.
Placement, diagnosis and the evaluation of achievement together form
the basis of individualized instruction. In pre-school to secondary school
settings, this can be the achievement of instructional objectives. Testing
is therefore used to assess and improve teaching.
4. Prediction
The test results can be used to predict the pupil‘s level of achievement in
future activities or predict one measure from another e.g. from KCPE to
KCSE performance. Prediction often seeks information on future
achievement from a measure of present status, and it may help students
to select the activities they are mostly likely to master.
5. Readiness
Pre-school tests are used to measure the child‘s readiness for Standard 1
tasks.
6. Personal/Guidance and Counselling
The test results can be used in making decisions about the future
e.g. subject choice/ career choice.
Assist individual to make wise decision for themselves (personality
tests, aptitude tests, etc.)
7. Grading Classification
The test results can be used to assign pupils to a particular achievement
classification e.g. first class, second class and pass.
8. Programme Evaluation
Test results of participants can be used as one bit of evidence to evaluate
the programme. By comparing the KCSE results of a County school
against national norms or standards, important decisions can be made.
Comparing changes in class performance between tests can provide
evidence of the effectiveness of teaching.
36
EPSC 228/311/0228
X1 X2
(Test 1) (Test 2)
Results
9. Motivation
Test scores can be motivating. Achievement of important standards can
encourage one to achieve higher levels of performance or to participate
regularly in physical activity. When children are given feedback or positive
remarks on their learning progress, they are motivated.
37
EPSC 228/311/0228
X1 X2
Pretest scores Post-test scores =M=80
38
EPSC 228/311/0228
39
EPSC 228/311/0228
Mitigation by KNEC
i. With effect from 2019, KNEC no longer require candidates
to use Index Number based on mock or other school
examinations but rather user the Registration Numbers
given when they entered Form 1.
ii. One other way is to randomize the candidates‘ numbers.
40
EPSC 228/311/0228
In the Pygmalion study, class teachers were told that the intelligence test
that was administered to the children has identified 20% of the children
41
EPSC 228/311/0228
as ―late bloomers‖. This was not true. The 20% of the children identified
to the teachers were actually selected at random.
42
EPSC 228/311/0228
Welcome to lecture 4.
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Discuss the current place of national examinations in
Kenya.
Discuss public criticisms of national examinations in
Kenya.
Explain non-normality in examinations.
Discuss the proposed Competency Based Education.
Since the colonial era, examinations have been part and parcel of
education in Kenya. During the colonial period secondary school
examinations for Form IV and Form VI were conducted by Cambridge
Examinations Syndicate.
In 2016, only 15% achieved between A and C+. This is a group that
demonstrated mastery of subject content and the readiness for further
43
EPSC 228/311/0228
learning. The implication of 15% is that affirmative action policy will not
be applied this year in university admission. If applied the C grade will be
admitted.
Colonial Period
1925: Nairobi European School did poorer than Indian schools in KCE. The
results were suppressed. The white community did not want to hear that
Indian students have done better than the whites.
1940: Alliance High School and Mangu High School did better than Prince
of Wales School. The white community was unhappy.
Post-Colonial
After independence, Kenya, Uganda and Tanzania under the umbrella of
East African Community established the East African Examinations Council
to take over from Cambridge Examinations Syndicate.
44
EPSC 228/311/0228
45
EPSC 228/311/0228
KICD KNEC
(Curriculum Development) (Examining the Achievement of Curriculum)
46
EPSC 228/311/0228
Examination Development
Examination development is a process that starts with test planning to
moderation of draft questions. The subject experts are guided through the
47
EPSC 228/311/0228
table specifications by the subject officer. The subject expert uses the
syllabi to prepare the test items. Once the test items have been prepared,
they are subjected to moderation by another team of subject experts.
Moderation is a process of ensuring that test items are not ambiguous
and are free of errors.
Item Banking
Discuss the concept of Item Banking and its importance and
challenges in a corrupt society like Kenya.
Globally used by examination boards in Europe
KNEC used item banking until 2016 when CS ignorantly directed
KNEC to abolish it.
Used at the Open University of Tanzania
Useful when the curriculum has stabilized
In case of major disaster like plane crush while bringing exams from
UK where they are printed and mass leakage that may lead to
cancellation of the entire examination item banking is useful
Examination Administration
This is a process of field administration of the examinations to ensure that
no candidates receive unfair advantage. This involves:
(i) Recruitment of examiners, supervisors and invigilators.
(ii) Briefing of supervisors and invigilators.
(iii) Distribution of question papers to all examination centres
(iv) Ensure that transport for examinations and personnel including
security during examination period in place.
(v) Training of examiners on marking skills and co-ordination of the
entire examination processing.
(vi) Compiling records from field reports of examination irregularities
encountered.
48
EPSC 228/311/0228
Examination Processing
Examination processing is an exercise that involves:
(i) Marking being the main activity. KNEC examinations
marking/scoring is in two parts, namely:
(a) Manual marking for non-objective test items (essay)
(b) Mechanical marking for objective test items using
Optical Mark Readers (OMR)/ scanners. Also referred to as
Optical Mark Recognition.
Advantages of Optical Mark Readers (OMR)
Speed
Accuracy
Cost –efficiency - no markers
Post-mortem Reports
After the examination, KNEC carries out post facto analysis of the
candidates‘ performance per subject. The purposes of this exercise are:
To identify the general performance trend of the candidates.
To identify the test items that attracted poor performance and that
reflect poor teaching.
To assist teachers, know the areas where candidates performed
poorly.
What will happy to 2016 KCSE results since the government is hiding
subject by subject performance?
49
EPSC 228/311/0228
50
EPSC 228/311/0228
51
EPSC 228/311/0228
The 2016, 2017 and 2018 KCSE examination results did not reflect a
normal distribution. Abilities or behaviors of a large population assume a
normal distribution, unless there are biases. When a distribution is
normal, it should be like the bell-shaped figure given below.
.13% .13%
-3 -2 -1 0 +1 +2 +3
52
EPSC 228/311/0228
MILITIRAZATION ERA
2019 KCSE
sex A A- B+ B B- C+ C C- D+ D D- E
F 269 2172 5145 9803 14961 21425 32084 43083 51813 69809 76198 12936
53
EPSC 228/311/0228
A_ to A =6,423 =0.921
A-=5,796
C+= 125,746=18%
Conclusion=Skewed
Gender:
Boys=355782 (51%)
Girls-341,440 (49%)
2016 KCSE
The extreme ends of the curve are not comparable – 141 (0.02%) vs
33,399 (5.85%). A and A- =4,786 (0.93%) and D_ and E= 183,328
(32.10%).
54
EPSC 228/311/0228
Causes of Non-Normality
1. Poorly set examination. This is possible in 2016 examination
where the setters copied test items from some commercial revision
texts. Newspapers printed out the exact papers of the commercial
texts. Hence those schools that used those books had unfair
advantage. To hide this flaw no extra examination papers were left
in schools. Because KNEC was directed to disregard Item Banking
practice the examinations were developed in hurry.
2. Lack of content validity. This means the examination items were
drawn from outside the content taught.
3. Difficult examination – leading to the majority of students
performing poorly. Difficulty examination does not discriminate
bright students from dull students.
4. Poor marking conditions. This is a possibility. Reports indicate
that markers worked for long hours (6.00 am – 1.00 am) and
experienced fatigue. Under these hostile conditions most makers
are not willing to participate in 2017.
5. Poor coordination of examiners and hence failure to adhere to the
marking scheme. Evidence gathered by KNUT shows that there was
no standardization and moderation of the marking schemes.
6. The ignoring of award stage – where the chief examiners of
every paper look at the performance in their subjects across the
country and then propose the grading system. This is meant to
normalize the grades. In 2016, this stage was ignored in a hurry to
release the results for political mileage. It is not possible to use a
grading system for Mathematics on English. Performance varies
and therefore it must be moderated and standardized.
Ignoring award ceremony/stage means that one grading system
was applied and thus leading to erroneous conclusion that the poor
performance is attributable to elimination of cheating.
55
EPSC 228/311/0228
The issues surrounding the debate and the politics of the 2016 and 2017
KCSE are:
Normal Curve
56
EPSC 228/311/0228
In a normal distribution:
a. 2016 KCSE
57
EPSC 228/311/0228
b. 2017 KCSE
Difficult questions
Poor teaching
Poor content validity
Rushed marking for political expedience.
Poor marking
Use of commercial test items that give advantage to
some schools. This was the case in 2016.
Fatigued examiners
Lack of standardization and moderation of test items
Non-adherence to standard practices/procedures
Treating KCSE as criterion-referenced test rather than as
normed-referenced test.
59
EPSC 228/311/0228
60
EPSC 228/311/0228
61
EPSC 228/311/0228
(i) Norm-Referenced
100%
75
50
25
0
10 20 30 40 50 60 70 80 90 100%
Criterion-Referenced
100%
75
50
25
0
10 20 30 40 50 60 70 80 90 100%
62
EPSC 228/311/0228
63
EPSC 228/311/0228
64
EPSC 228/311/0228
NON-STANDARD PRACTICES/PROCEDURES
The non-standard procedures adopted may have affected 2016 and 2017
KCSE examination results. Read Daily Nation, Saturday, December 23,
2017, p. 1 on the Teachers Pain in Marking KCSE. These non-standard
procedures include inter alia:
65
EPSC 228/311/0228
66
EPSC 228/311/0228
67
EPSC 228/311/0228
68
EPSC 228/311/0228
There are many reasons schools are moving away from seat time and
toward competency-based learning. These reasons include:
CBE has had little success, if any, in Africa. South Africa and Malawi tried
and abandoned. It was tried for 12 years in South Africa. Why it failed in
South Africa?
Challenges of CBC
1. Failed system
It is also important to note that the proposed Competence-Based
Curriculum education was tried for 12 years in South Africa and it
failed/abandoned. In South Africa it was called Objective-Based Education
(OBE). It was also tried in Malawi and abandoned.
The proposed Kenyan CBE was borrowed from Japan and South Korea.
Kenya does not have the culture of these countries and is wrong to
assume that we will succeed. Do we have the Asian culture that has been
responsible for the success of CBE? No.
5. Expensive to implement.
6. Overloaded syllabus. Between 11-12 subjects to be taken.
7. Pupils‟ progression from primary to secondary unclear.
8. Teacher-based subjective assessment in determining movement
from primary to secondary is bias and hence pupils cannot go to
their school of choice in the absence of a standard, unifying national
examination.
9. Damage national integration because children will remain in
their neighborhoods.
10. Demand literacy of parents.
72
EPSC 228/311/0228
LECTURE 5
Welcome to lecture 5.
What is evaluation?
In terms of measurement evaluation is defined as the ability to make
judgment on the basis of given information. Judging the
worthiness of a programme project or course. It is the use of
measurement in making decisions.
A story is told to illustrate the birth or the concept of evaluation. ―In the
beginning God created the heavens and the earth (Genesis Chapter 31).
73
EPSC 228/311/0228
And God saw everything that He made, and said ―behold, it is very good‖.
And the evening and the morning were created on the sixth day. And on
the seventh day God rested from all His work. His archangel came then
to Him asking ―God, how do you know that what you have created is
―very good?‖ What are your criteria? On what data do you base your
judgement? Aren‘t you a little too close to the situation to make a fair and
unbiased evaluation? God thought about these questions all day and His
rest was greatly disturbed. On the eight-day God said, ―Lucifer, go to
hell‖. Thus was evaluation born in a blaze of glory …. a legacy under
which we continue to operate.
3. School
For school evaluation, we focus on how education programme with
respect to its functioning:
School activities
School resources
4. Programme
For programme evaluation, we look at the following:
Needs assessment
Initial objectives of the programme
Activities
Impact
5. Personnel
For personnel evaluation (staff evaluation) we look at:
Performance
Productivity
75
EPSC 228/311/0228
Evaluation Design
Assignment:
What is still there (impact)? Examine the impact (what is still there
on the ground).
76
EPSC 228/311/0228
The design of the study is to a large extent set by the frames of the
assignment. The data has to be collected.
Purposes of Evaluation
1. Programme Improvement
Example
First examine the objectives of the feeding programme – why was the
feeding programme introduced?
Objective: To assist the children from the poor background
to attend school.
Question/results: Has higher school attendance been
achieved? Are there now more children from poor background
in school than before the project was introduced?
If yes, then the project has met the objectives.
If no, then the project has failed.
Is there still higher school attendance since the project ended?
(Impact).
78
EPSC 228/311/0228
Student A 6 6 6 6
Student B 4 6 8 6
Here you can see that both students have an average grade of 6. But
what does it really say? If you just show your student this grade, it means
nothing. If you dig deeper and take a look at the process, you can see
that one student actually did better than the other. Student B shows
progress and improvement in the learning material. Student A has the
same grade as Student B, but he‘s stuck. He doesn‘t get a complete grip
on the learning material, and only masters some parts of it.
79
EPSC 228/311/0228
Intervention
Demonstration/ intervention
program or on a new
approach to a social problem
Smoking Addiction e.g.Therapy on drug Stop Smoking
addiction such as giving
them Coffee every one hour.
5. Hidden Agendas
Sometimes the true purpose of the evaluation has little to do with actually
obtaining information about the programme‘s performance.
80
EPSC 228/311/0228
Step 1 Objective
Step 2 Pretest
Step 3 Instruction
Step 4 Measurement
Step 5 Evaluation
81
EPSC 228/311/0228
Step 1: Objective
Preparation of the objectives is the first step in the evaluation process,
because objectives determine what we will seek to achieve. The
objectives give direction to instruction and define what behaviors we want
to change.
Step 2: Pretest
With some type of pretest, we can answer three questions:
1. How much had already been learned?
2. What are the individual‘s current status and capabilities?
3. What type of activity should be prescribed to help achieve the
objectives?
Step 3: Instruction
Sound instructional methods are needed to achieve the agreed-on
objectives. Different instructional procedures may be needed to meet
students‘ individual needs.
Step 4: Measurement
This involves the selection or development of a test to gauge the
achievement of the objectives. It is crucial that the test be designed to
measure the behavior specified in the objectives. The objectives can be
cognitive, affective, or psychomotor. The key element is to select or
develop a test that measures the objectives. Often, teachers will need to
develop their own tests, because standardized tests may not be
consistent with their instructional objectives. This is a common method
used to provide evidence of test validity in educational settings.
Step 5: Evaluation
82
EPSC 228/311/0228
Once the instructional phase has been completed and achievement has
been measured, test results are judged (i.e. evaluated) to find whether
the desired changes achieved the stated objective.
What happens if students do not achieve the desire objectives? The figure
given above shows a feedback loop from evaluation back to each
component of the model. Failure to achieve the stated objectives may be
due to any segment of the model. First, it may be discovered that the
objectives are not appropriate and may need to be altered. The
instruction may not have been suitable for the group, or the selected test
may not have been appropriate. The educational evaluation model is
dynamic. The evaluation model provides information needed to alter any
aspect of the educational process.
Example 2: Project Evaluation Phases
83
EPSC 228/311/0228
Budget control
4. Project Evaluation. This involves:
Effect / impact assessment
Identifying side effects
Accountability
Donors like the World Bank and African Development Bank use Project
Cycle given above in evaluating the projects they fund.
Project Risks
Any project has risks that need to be pointed out. In any project we have:
Manifest effects= Intended effects e.g, encouraging school
enrollment
Latent effects= Unintended effects and these include project
risks. E.g warriors also became interested in going to school.
In Kenya project risks may include:
i. Political instability.
84
EPSC 228/311/0228
The concepts of manifest and latent effects are sociological concepts first
formulated by Robert Merton.
Summative Evaluation
End of programme
Formative Evaluation Interaction
During
programme
Formative Evaluation
This is the judgment of achievement during the formative stages of
learning. Feedback is one of the most powerful variables in learning.
85
EPSC 228/311/0228
Summative Evaluation
This is the judgment of achievement at the end of an instructional unit,
and typically involves the administration of tests at the conclusion of an
instructional unit or training period. Summative means “totaling up” to
indicate the level a learner has reached in a subject.
86
EPSC 228/311/0228
PERFORMANCE STANDARDS
Performance standards are the criteria to which the results of
measurement are compared in order to interpret them. A test score in
and of itself means nothing. There are two ways with which we can
interpret the results of a test. The two most widely used types of
performance standards are:
Norm – referenced standards
Criterion – referenced standards
These are the two bases for the comparison of performance or
interpretation of performance.
87
EPSC 228/311/0228
88
EPSC 228/311/0228
89
EPSC 228/311/0228
(i) Norm-Referenced
100%
75
50
25
0
10 20 30 40 50 60 70 80 90 100%
Criterion Referenced
100%
75
50
25
90
0
10 20 30 40 50 60 70 80 90 100%
EPSC 228/311/0228
91
EPSC 228/311/0228
LECTURE 6
Welcome to lecture 6.
Tests can be useful tools, but they can also be dangerous if misused. It is
our professional obligation to ensure that we use tests as accurately and
as fairly as possible.
Like in doing research, testing requires that the test administrator and
test user should adhere to the following code of ethics and standards.
1. Consent from parents / guardians.
2. Consent from school and local administration. These are called gate
keepers.
3. Be conducted in a fair and ethical manner which include:
Security of testing materials before, during and after testing to
ensure fairness. KNEC hires police to guard exams materials.
Security of scoring.
Confidentiality.
Testing to cover materials taught.
Training staff on testing and scoring.
Using tests that are developmentally appropriate.
Interpreting results within acceptable norms. There must be a
rationale for decision based on test scores.
92
EPSC 228/311/0228
93
EPSC 228/311/0228
LECTURE 7
Welcome to lecture 7.
PSYCHOMETRIC CHARACTERISTICS OF A
GOOD TEST
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
List and describe the characteristics of a good test.
List the factors that affect validity of a testing
instrument.
List the factors that affect reliability of a testing
instrument.
Explain the different ways of estimating reliability.
94
EPSC 228/311/0228
1. VALIDITY
A test is said to be valid when it measures what is intended to measure.
That is, a good test measures what it is intended to measure.
What are you looking for when you are establishing the validity of an
instrument? You are looking for:
i. Trustworthiness of the instrument. Is the instrument trustworthy?
ii. Credibility of the instrument. Is the instrument credible?
A valid classroom test measures what has been taught (or should have
been taught). There are several aspects of validity:
These are:
Content validity
Construct validity
Face validity
95
EPSC 228/311/0228
Concurrent validity
Predictive validity
The two of these validities that are of particular importance with respect
to teacher – made tests are content and construct validity.
Content Validity
This is the most important validity for practicing teachers. It measures the
extent to which a test adequately covers the syllabus to be tested.
Content validity refers to the extent to which a test ―covers‖ the content.
If the test does not cover the content that has been taught, then it
is not valid.
An essay test intended to measure knowledge is likely to lack
content validity because of the severe limitation on the number of
topics which can be included in one such test. In other words, the
sample of possible learning is small. Hence essay test lacks content
validity. It lacks content balance.
A valid test provides for measurement of a good sampling of
content and is balanced with respect to coverage of the various
parts of the subject.
97
EPSC 228/311/0228
In primary schools, where teachers are compared at the end of the term
on the performance of their classes, teachers may set easier questions for
their classes in order that their pupils perform better.
Construct Validity
This is another aspect of validity that is important in teacher – made
tests. It refers to the kinds of learning specified or implied in the course
/learning objectives. That is based on learning objectives.
For example, if the course learning objective specified that at the end of
the course the learner must:
i. Identify four characteristics of living things, test tasks are required
to measure identification of these characteristics. Each kind of
learning objective must be tested to provide a valid measurement of
achievement.
ii. Identify the methods freedom fighters used destabilize the colonial
regime.
A test which measures only knowledge (e.g. ability to recall important
historical events) must lack validity if the objectives specify other learning
objectives.
Face Validity
The test should look as if it is testing what is intended to test. This,
however, is the starting point.
Describes how well a measurement instrument appears to measure
what is intended to measure as judged by its appearance, what it
was designed to measure. For example, a test of mathematical
ability would have face validity if it contained math problems.
A test is said to have face validity if it appears to be measuring
what it claims to measure.
For example:
If we are trying to select pilots from highly trained personnel
face valid tests of rapid reaction time will ensure full
98
EPSC 228/311/0228
Concurrent Validity
This is where test results are compared with another measure of the
same abilities at the same time or taken at the same time or about
the same time. For example, comparing mock results and actual
KCSE results. These examinations are taken on about the same
time – one in July (Mock) and the other in November (KCSE). In
1970‘s, mock results were used to select students to join Form 5 in
January before the KCE results were released on the belief that
mock was a good measure of the final examination or had good
concurrent validity.
O1 O2 O1 O2
99
EPSC 228/311/0228
Predictive Validity
This is where test results are compared later with another criterion,
such as success in a particular job or in higher education. E.g. KCSE
Grade D is a good predictor of doing well in Police.
Predictive validity is good support for the efficiency of a test.
For example,
(i) Good KCPE results predicting good KCSE results or
(ii) Good KCSE results predicting good GPA at the university.
2. RELIABILITY
When you say that a friend is reliable, what do you mean?
Reliability refers to the degree to which a test or an assessment tool
produces consistent results. A reliable test gives consistent or dependable
results. That is, a good test, a good measuring tool or instrument is
reliable. That is, with repeated testing, each student will maintain about
the relative rank in his/her group (or achieve about the same test scores)
each time he/she takes the test.
X1, X2, X3, X4, X5 = Y1=Y2=Y3=Y4=Y5
If the same pupils take a vocabulary test two times/twice within a short
period of time, their scores on the two occasions should be similar or
same.
100
EPSC 228/311/0228
101
EPSC 228/311/0228
The bigger the n, the closer it is to N. The larger the test items developed
or drawn from the course coverage, the better the test. Less test items
are low in reliability because they fail to cover the course coverage.
102
EPSC 228/311/0228
([[
(a) (b)
(c)
A positively skewed distribution (c) reflects a very difficult test while a
negatively skewed distribution (a) reflects an easy test.
(iv) Length of a Test
The length of a test affects reliability. A very short test (or five items, for
example) cannot spread the scores sufficiently to give consistent results.
Five items are too few to provide a reliable measure. In general, the
longer the test, the more reliable.
(v) Erratic or inconsistent Marking/Scoring
If the markers are erratic in scoring the award of scores will be unreliable.
Inconsistency in scoring leads to low reliability of results. This is why
KNEC train examiners on marking so that if the same test or script is
marked by different examiners or even when the same examiner marks
the same test at different times the scores will be similar. See the article
of September 13, 2014 on ―Train every teacher on setting, marking
exams‖
103
EPSC 228/311/0228
104
EPSC 228/311/0228
105
EPSC 228/311/0228
106
EPSC 228/311/0228
One student can take one form of a test, and the students sitting to the
right and left could have different variations of the same test. None of
the three students would have an advantage over the others; their
respective scores would provide a fair comparison of the variable being
measured. For example, measuring mathematical ability of standard 8
pupils.
107
EPSC 228/311/0228
Form A Form B
The two forms A and B are developed from the same curriculum and
subjected to the due process of the development of a good test and the
two tests can be taken currently or at different times.
3. Inter-Scorer Reliability
4. Inter-Observer Reliability
108
EPSC 228/311/0228
10.00 / // 3 1
10.01 / / 2 -
10.02 // / 3 1
10.03 // /// 5 1
10.04 /// // 5 1
10.05 / / 2 -
10.06 // // 4 -
10.07 / / 2 -
10.08 // /// 5 1
10.09 // // 4 -
Total 17 18 35 5
109
EPSC 228/311/0228
5. Inter-Item Reliability
This involves splitting the test into two halves and computing coefficient
of reliability between the two halves (odd numbered questions and even-
numbered questions).
Split half reliability estimates are widely used, because of their simplicity.
The procedure is that you split the test into two halves and compute
coefficient of reliability between the two halves.
1. Reliability
a) Yes, probability will be high if the two examiners are trained so that
they can mark consistently.
b) No, how probability if the two are not trained.
111
EPSC 228/311/0228
Example 3: You know your actual weight to be 92 kg. You take your
weight three times in a day using the same machine and you find:
Morning = 85 kg
Lunch = 92 kg
Evening = 87 kg
Decision: Your scale is unreliable. The scale should read at all times/
whenever you step on it 92 kg.
Morning = 85 kg
Lunch = 85 kg
Evening = 85 kg
2. Validity
Target
113
EPSC 228/311/0228
2. Measuring Intelligence
Suppose you want to measure the intelligence of smart students and you
decide to use a tape measure the circumference of their heads in
centimeters and you consistently receive the same results/value of the
heads of these students-that is the tape measure is reliable. But the test
of using tape measure is not valid measure because we do not use tape
measure to measure intelligence. We use intelligence test to measure
intelligence of people.
114
EPSC 228/311/0228
115
EPSC 228/311/0228
3. OBJECTIVITY
5. COMPREHENSIVENESS
For a test to be comprehensive, it should sample major lesson objectives.
It is neither necessary nor practical to test every objective that is taught
in a course, but a sufficient number of objectives should be included to
provide a valid measure of student achievement in the complete course.
6. EFFICIENCY
A good test is efficient. Efficiency in measurement requires saving of time
for the students as well as for the teachers.
Efficient items require the least reading and responding time
(objective items and yet providing wide sampling of content).
The use of easy questions to measure knowledge is inefficient; too
few specifics can be tested in a given time.
Saving time on scoring is a measure of efficiency.
Providing clear directions on answering questions.
Free of ambiguity (unambiguous).
Tests can be designed for efficiency
(i) By selecting the form of items which will measuring what should be
measured in the shortest time and
(ii) By constructing the test so as to save time for both the students
and the teachers.
117
EPSC 228/311/0228
1. Gender sensitive. Items should not reflect things that favor one
gender e.g. boys. Studies show that if a test is set on activities that
boys normally do they will perform better than girls.
2. Taken under the same condition by all candidates e.g. free of noise.
3. Free of locational benefits. For example of 1983 CPE Guided English
Composition:
In 1983 CPE guided composition, candidates were given a photo
whish was showing:
A matatu that had hit a pupil at a zebra crossing.
A policeman was on site and
A crowd of people were around.
Candidates were required to write a composition of what was happening.
This scene is unfair for rural pupils who might not have seen a policeman
or even think that a matatu can cause an accident.
4. Culture-fair or culture free. That is there should be no cultural
biases / influences in the test. If a multiple-choice question asks:
Children are named according to:
A. Religious practices
B. Natural incidences
C. Random sampling of names
D. Day of the week
All the above four choices are possible answers. This is both a bad
question and unfair question. The question is culture loaded.
118
EPSC 228/311/0228
Illustration
Suppose on a ―Mock Test‖, two students cored:
Student A: Biology-72
Student B: Physics- 70
What could you conclude?
The answer is NOTHING
You need to know test‘s reliability, validity and norms. The norms
would tell how student A and B scored relative to other students like
themselves (norm group).
We might find that both student A and B scored in the average, or
that one scored below average and one above. The point here is
that interpretation of the test scores cannot be made in a vacuum.
A good test must provide direction of scoring and interpretation.
If you are given the class mean or class mastery level, then you can
make a better judgment the performance of student A and B.
-3 -2 -1 0 +1 +2 +3
119
EPSC 228/311/0228
Unless it has been established that average Mock test results for the same
age-group over the years has over the years been:
Biology 85
Physics 65
120
EPSC 228/311/0228
LECTURE 8
Welcome to lecture 8.
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Explain the term taxonomy.
Distinguish between and discuss three types of
Bloom‘s domains of learning.
Explain the six categories of cognitive domains.
In testing try to balance test items that measure lower level abilities and
those that measure higher level abilities.
124
EPSC 228/311/0228
125
EPSC 228/311/0228
126
EPSC 228/311/0228
accept discuss
accept display
attempt dispute
challenge follow
change form
commend initiate
comply integrate
conform join
defend judge
127
EPSC 228/311/0228
128
EPSC 228/311/0228
Believe in Demonstrate of
Show diagrammatically
129
EPSC 228/311/0228
LECTURE 9
Welcome to lecture 9.
INSTRUCTIONAL/LEARNING OBJECTIVES
AND LEARNING OUTCOMES
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Explain the concept of:
Learning objectives.
Learning outcomes.
Explain why action verbs and SMART principle are
necessary in stating learning objectives.
Write your own learning objective.
Whenever you plan for class instruction, you must set instructional
learning objectives of the course.
Refer to your EPSC 228 course objectives.
Objective Performance
o
P
After teaching the course you may have the following scenarios:
i. If the two triangles do not intersect, the course objectives were not
met.
O P
O P
131
EPSC 228/311/0228
O P
iv. The course objectives and performance may intersect 100% (in
total congruence) meaning that 100% of the objectives were
achieved.
O/P
132
EPSC 228/311/0228
134
EPSC 228/311/0228
135
EPSC 228/311/0228
LEARNING OUTCOMES
Learning outcomes are what students are able to do after a period
of learning or lesson.
136
EPSC 228/311/0228
LECTURE 10
Welcome to lecture 10.
TEST PLANNING
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Explain Why? What? How? Of a test
137
EPSC 228/311/0228
139
EPSC 228/311/0228
LECTURE 11
Welcome to lecture 11
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Define table of specifications.
Describe the role of table of specifications in test
construction
List and discuss the steps/procedure in test
construction.
Explain the merits and demerits of easy and
objective tests.
Briefly describe the guidelines for the construction of essay and
objectives tests.
The number, nature, and specifically of the categories will depend on the
purpose of the test. Although the two-dimensional grid for content and
abilities is most common, there will be cases in which three or more are
140
EPSC 228/311/0228
Once the list of topics and abilities has been decided on, the next task is
to determine the relative emphasis to be given to each topic and ability
and to enter into each cell either a percentage of the test or the actual
number of questions to be tested in that cell. It may be that certain
topics by their nature are essentially limited to certain abilities or test
plan may lead to insights into what has been previously untested and to
ingenious solutions or new approaches to writing items that test what
may be long considered untestable in an objective format.
142
EPSC 228/311/0228
143
EPSC 228/311/0228
Once the purpose of a test has been made, the teacher or test developer
has to make two decisions, namely:
Once these two decisions have been made, the ―specifications‖ for a
particular test can be made. The ―specifications‖ of a test is presented in
a table called a “Table of Specifications‖ or blue print. A table of
specifications is a two dimensional chart with the content (topics) as one
dimension and behavioural performance or kinds of achievement as the
other.
Topic 1 2 2 0 0 0 1 5
Topic 2 3 2 2 1 1 1 10
Topic 3 3 1 3 2 2 2 13
Topic 4 2 5 4 2 0 2 15
Topic 5 2 2 1 0 1 1 7
144
EPSC 228/311/0228
Total 12 12 10 5 4 7 50
4. CONSTRUCTION OF TESTS
In the school setting the most convenient tests are paper and pencil tests
– or written tests. Such tests are commonly of two types:
(i) Essay tests and (ii) Objective type tests
ESSAY TESTS
The essential feature of an essay is that it is open-ended and each
candidate may present his own answer in his own particular style.
Example:
(a) Name two sources of support which helped Britain during Mau
Mau war of 1950-1959?
(b) Why did British lose the war?
145
EPSC 228/311/0228
146
EPSC 228/311/0228
147
EPSC 228/311/0228
OBJECTIVE TESTS
An objective test is one so constructed that irrespective of who marks the
answers, the scores for a particular candidate is always the same. The
objectivity really refers to the marking of the test. In order to achieve
such objectivity, objective tests usually have pre-coded answers. In any
particular item, there has to be one and only one correct answer.
MATCHING ITEMS
A matching item consists of two lists, phrases, pictures, other symbols
and a set of instructions explaining the basis on which the examinee is to
match an item in the first list with an item in the second list. The
elements of the list that is read first are called premises, and the
elements in the other list are called responses. It is possible to have
more premises than responses, more responses than premises, or to have
the same number of each. In the example of a matching exercise that
148
EPSC 228/311/0228
follows, the premises appear in the left-hand column, with the responses
at the right but in some cases the responses may be placed below the
premises.
The primary cognitive skill that matched exercises test is recall.
1. KANU ( ) 2007
2. NARC ( ) 2013
3. PNU ( ) 1960
4. JUBILEE ( ) 2002
5. KADU ( ) 1925
149
EPSC 228/311/0228
(iv) A question
(v) A complete statement
(vi) An incomplete statement
Who among the following people chaired the Kenya Constitution Review
Commission?
1. Githu Muigai
2. James Orengo
3. Yash Pal Ghai
4. Paul Muite
5. Raila Odinga
a. Knowledge
b. Synthesis
c. Evaluation
d. Analysis
e. Comprehensive
150
EPSC 228/311/0228
(ii) A statement which itself is true, but which does not satisfy the
requirement of the problem.
(iii) A carefully worked incorrect statement.
151
EPSC 228/311/0228
After reading the stem, the student should know exactly what the
problem is and what he or she is expected to do to solve it.
3. State the stem in positive form.
4. Keep the item short.
5. Word the alternatives clearly and concisely. This is to reduce student
confusion.
6. Keep the alternatives mutually exclusive.
7. Avoid ―all of these‖, none of these‖ and ―both A and B‖ answer choices.
8. Keep options lengths similar
9. Avoid cues to the correct answer
10. Use only one correct option
11. Vary the position of the correct options
12. Guard against giving clues in the correct answers.
13. Avoid any tendency to make the correct answer consistently longer
than the distracters.
14. Avoid ―give-ways‖ in the distracters, for example ―always, ―only‖,
―all‖, ―never‖ etc.
15. Use language that is simple, direct and free of ambiguity.
16. Do not use double negatives in a item.
MERITS OF OBJECTIVE TESTS
154
EPSC 228/311/0228
LECTURE 12
Welcome to lecture 12
ITEM ANALYSIS
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Define item analysis.
Explain and illustrate the three aspects of item
analysis:
Item difficulty index
Item discrimination index
Distractor analysis
Calculate and interpret three aspects of item analysis.
In principle, item analysis can be carried out on both essay and objective
tests but the techniques are much developed for the objective test items.
155
EPSC 228/311/0228
For this type of analysis one requires the service of both subject content
specialists and test construction specialist. Most of the qualitative
analysis can be done before the tests are administered, for example,
judging the content validity is a qualitative type of item analysis.
Examining stems of items for ambiguity is another way of
qualitative analysis.
When we prepare items for a test, we hope that each of them will be
useful in certain statistical way. That is, we hope that each item will turn
out to be of the appropriate level of difficulty for the group, the
proportionately more or the better students will get it right than the
poorer, and that the incorrect options will prove attractive to the students
who cannot arrive at the right answer through their own ability.
Item analysis uses statistical methods to identify any test items that are
not working well. If an item is too easy, failing to show a different
between skilled and unskilled examinees, or even scored incorrectly, an
item analysis will reveal it. That is, item analysis information can tell us if
an item was too easy or too hard, how well it discriminated between high
and low scorers on the test, and whether all of the alternatives
(distractors) functioned as intended. The three most common statistics or
areas reported in an item analysis are:
156
EPSC 228/311/0228
The item difficulty index is one of the most useful and most frequently
reported, item analysis statistics. It is a measure of the proportion of
examinees who answered the item correctly. Teachers produce a difficulty
index for a test item by calculating the proportion of students in class who
got an item correct. The larger the proportion, the more students who
have learned the content measured by the item.
This approach is less accurate but good for teachers in understanding the
concept of ID index in a simpler way. For example, imagine a classroom
of 40 Standard 6 students who took a test which included the item below.
What is the item difficulty of this test item? The asterisk indicates
that B is the correct answer.
A. Tom Mboya 6
*B. James Gichuru 24
157
EPSC 228/311/0228
C. Jomo Kenyatta 10
D. Robert Matano 0
Difficulty Index ranges from .00 to 1.00. For this example, Difficulty
Index=24/40=.60 or 60%. This means that sixty percent of students
knew the answer.
21 – 40% Difficult
41 – 60% Average
61 – 80% Easy
In computing item difficulty index of a test item using this approach you
need to do the following:
158
EPSC 228/311/0228
Alternatives (Options)
Groups A B* C D E
Upper 10 0 6 3 1 0
Lower 10 3 2 2 3 0
*
= Correct answer.
Index of Difficulty = RU + RL
n1+n2
Where,
U=Represents number of examinees in the upper-scoring group
responding correctly
L= Represents number of examinees in the lower-scoring group
responding correctly
n1+n2=Represent the total number of examinees in the upper and lower-
scoring groups, respectively.
ID= Students with correct answers x 100 = 8/20 x 100 = 40% =0.40
Total students (n1 + n2)
159
EPSC 228/311/0228
Since difficulty refers to the percentage getting the item right, the
smaller the percentage figure the more difficulty the item. Hence, Index
of Difficulty can range between 0% and 100% with a higher value
indicating that a greater proportion of examinees responded to the item
correctly, and it was thus an easier item.
the given item (correct and incorrect) and the examinee‘s score
on the overall test.
n1 n2 10 10
Where,
161
EPSC 228/311/0228
Correlation Description
Range
A strong and positive correlation suggests that students who get any one
question correct also have a relatively high score on the overall
examination.
1. High scorers=6/20=0.30
Low scorers=18/20=0.90
2. High scorers=0/20=0.00
Low scorers=20/20=1.00
162
EPSC 228/311/0228
Alternative Approach
Ru =Number of those in high scoring group that got the item correct.
Rl = Number of those in low scoring group that got the item correct.
If the discrimination index is negative, it also means that for some reason
students who scored low on the test were more likely to get the answer
163
EPSC 228/311/0228
A strong and positive correlation suggests that students who get any one
question correct also have a relatively high score on the overall
examination.
Just as the key, or correct response option, must be definitely correct, the
distractors must be clearly incorrect (or clearly not the ―best‖ option). In
addition to being clearly incorrect, the distractors must also be plausible.
That is, the distractors should seem likely or reasonable to an examinee
164
EPSC 228/311/0228
The analysis of response options shows that those who missed the item
were about equally likely to choose answer A and answer C. No students
chose answer D. Answer option D does not act as a distractor. Students
are really choosing between three options. This makes guessing more
likely, which hurts the validity for an item.
A =6/40 =.15
B =24/40 =.60
C =10/40 =.25
D =0/40 =.00
A good distractor will attract more examinees from the lower group than
the upper group. In this example D was a very poor distractor. It was
obvious to both good and poor students. In this example distractors A and
C are functioning effectively.
165
EPSC 228/311/0228
Interpretation
The analysis of response options shows that those who missed the item
were about equally likely to choose answer A and answer C. No students
chose answer D. Answer option D does not act as a distractor. Students
are not choosing between four answer options on this item, they are
really choosing between only three options, as they are not even
considering answer D. This makes guessing correctly more likely, which
hurts the validity of an item.
Example 2.
Alternatives (Options)
Groups A B* C D E
Upper 10 0 6 3 1 0
Lower 10 3 2 2 3 0
*
= Correct answer.
167
EPSC 228/311/0228
LECTURE 13
Welcome to lecture 13
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Explain the mistakes teachers make in developing
classroom tests.
168
EPSC 228/311/0228
169
EPSC 228/311/0228
LECTURE 14
Welcome to lecture 14
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Explain the different procedures KNEC has put in place to provide
credible examination.
Test Administration
As already discussed, tests determine the destiny of the individuals and
hence the conditions under which they are administered by must be fair
and uniform. In this respect, fair administration of tests takes into
considerations:
Rehearsal in case of national examinations.
Provision of uniform instructions on the conduct of examinations.
170
EPSC 228/311/0228
Test Scoring
Scoring is one pillar of fairness in an examination. A source of unfairness
if not well managed. There are two types of scoring systems in use in
Kenya.
Manual scoring. Used mainly in schools and in essay types of
questions in KNEC examinations. Subjectivity/bias can be high
under manual scoring.
Electronic scoring by use of Optical Mark Reader/ Scanner.
This is used by KNEC for scoring objective test items in KCPE and
KCSE. Objectivity is high in electronic marking/scoring. Objectivity
refers to consistency in test interpretation and scoring.
The conditions that promote fair scoring of a test include:
Moderation of the marking scheme.
Training of markers in case of easy tests.
Coordination of markers and putting in smaller teams.
Retirement of erratic and generous markers.
Test difficulty.
Because the assumption under testing is normality, scores are interpreted
in relation to the normal curve. However, there is estimated
(hypothetical) distribution and observed (actual) distribution of test
scored. These are illustrated below using physics and geography test
scores.
172
EPSC 228/311/0228
Observed distribution
of candidates who took
Geography Test
173
EPSC 228/311/0228
LECTURE 15
Welcome to lecture 15.
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Describe the terms:
Population
Statistic
Describe the application and interpretation of measures of central
tendency to test scores.
Describe the application and interpretation of measures of
variability to test scores.
Statistical Concepts
(i) What is a sample? The smaller group of people who actually
participate in the test is known as a sample. This is a sub-set of the
population and is represented by lower case n.
Population
174 Sample
EPSC 228/311/0228
Descriptive Statistics
Once a large set of scores have been collected, certain descriptive values
can be calculated. These are the values that summarize or condense the
set of scores, giving it meaning. Descriptive values are used by the
teachers to evaluate individual performance of pupils and to describe the
group‘s performance or compare its performance with that of another
group.
Once you collect data from a large sample, you can do the following
things:
Organizing and graphing test scores.
Applying descriptive statistics.
175
EPSC 228/311/0228
Prepare distribution of frequency table of the test scores. You can present
the test scores individually or in a group format in the form of frequency
table or histogram.
176
EPSC 228/311/0228
70 – 79 6
60 – 69 1
50 – 59 3
________
17
________
177
EPSC 228/311/0228
Mode
The mode is the score most frequently received. It is used with nominal
data. In the ungrouped scores given above the mode is 75. For grouped
data the modal interval is 70-79.
A frequency distribution can be uni-modal (one mode), bi-modal two
modes); trio-modal (three modes); poly-modal many modes.
Example: 2, 2, 2, 3, 4, 6, 6, 6, 7, 8.
Median
The median is the middle score; half the scores fall above the median and
half below. It cannot be calculated unless the scores are listed in order
from best to worst or in either ascending or descending order. Hence the
procedure for getting a median of a distribution is as follows:
First arrange the scores in a ascending or descending order
Determine the position or location of approximate median
X = ∑X
n
178
EPSC 228/311/0228
Where X (X bar) is the mean, ∑X is the sum of the scores, and n is the
number of scores. The symbol ∑ means sum of. Hence ∑X means sum of
all scores. In Greek ∑ is called sigma. X represents individual scores and
n is the number of students or number of scores.
Mean (X) of 66, 65, 61, 59, 53 = 304 = 60.8
5
The mean is appropriate for interval or ratio data.
The disadvantage of the mean is that it is influenced by outliers.
Group 1 Group 2
9 5
5 6
1 4
For both groups the mean and median is 5. If you simply report that the
mean and median for both groups are identical without showing the
variability of scores, another person could conclude that the two groups
have equal or similar ability. This is not true. Group 2 is more
homogenous in performance than Group 1. A measure of variability is the
descriptive term that indicates this difference in the spread, scatter or
heterogeneity, of a set of scores. There are two such measures of
variability: the range and the standard deviation.
179
EPSC 228/311/0228
Range
The range is the easiest measure of variability to obtain and the one that
is used when the measure of central tendency is the mode or median.
The range is the difference between the highest and the lowest scores.
For example:
For Group 1: Range = 9 – 1 = 8
For Group 2: Range = 6 – 4 = 2
The range is neither a precise nor a stable measure, because it depends
on only two scores- the highest and the lowest.
Standard Deviation
The standard deviation (symbolized S.D) is the measure of variability
used with the mean. It indicates the amount that all the scores differ or
deviate from the mean – the more the scores deviate from the mean, the
higher the standard deviation. The sum of the deviations of the scores
from the mean is always 0. There are two types of formulas that are
used to compute S.D.
Deviation formula.
Raw score formula.
The deviation formula illustrates what the S.D. is, but is more difficult to
use by hand if the mean has a fraction. The raw score formula is easier
to us if you have only a simple calculator.
Let us use the scores: 7, 2, 7, 6, 5, 6, 2.
2
S.D. = ∑ (x-x )
n-1
Where S.D. is the standard deviation, X is the scores, X is the mean, and
n is the number of scores.
180
EPSC 228/311/0228
Some books, calculators, and computer programs will use the term n
rather than n-1 in the denominator of the standard deviation formula.
When the sample is large you can use n because a larger sample
approaches the population size.
Why n-1?
i. Use of n-1 gives a good estimate of population variance or S.D.
That is, it gives unbiased estimate of population variance.
ii. We use n-1 when the sample size is small in order to get unbiased
estimate of population variance.
Steps 2 – 3
2
X X (X–X) (X – X)
7 5 2 4
2 5 -3 9
7 5 2 4
6 5 1 1
5 5 0 0
6 5 1 1
2 5 -3 9
∑=35 ∑=0 ∑=28
Step 4
S.D. =28 / (7–1) =
181
EPSC 228/311/0228
S.D. = ∑ x2 – (∑x)2/n
n-1
Where ∑X2 is the sum of the squared scores, ∑X is the sum of the scores,
and n is the number of scores.
X X2
7 49
2 4
7 49
6 36
5 25
6 36
2 4
∑X = 35 ∑X2 = 203
182
EPSC 228/311/0228
How do you interpret the scores in relation to the mean and S.D?
In reporting your pupils‘ scores, you need to report both the mean and
the S.D.
A test norm allows meaningful interpretation of test scores.
A person‘s raw test score is meaningless unless evaluated in terms
of the standardized group norms. For example, if a student
receives a raw score of 78 out of 100 in history, does that mean
that the student is doing well?
The score of 78 can be interrupted only when the norms are consulted. If
the mean of the test norm is 80 and the standard deviation is 10, the
score of 78 can be evaluated as ―typical‖ performance indicating that the
student possesses an average knowledge of history.
SELF-ASSESSMENT EXERCISE
Use raw score formula to compute the mean and the SD for the test
scores of the following two groups of students:
Group 1: 9, 5, 1
Group 2: 5, 6, 4
What does the S.D. tell you about these two groups?
For Group 1 you should get SD of 4
For Group 2 you should get SD of 1.414
183
EPSC 228/311/0228
Interpretation
Group 2
Group 1
Frequency
0 1 2 3 4 5 6 7 8 9
Score Value
For both groups, the test scores have the same mean, but different
variability or spread of scores. Students in Group 1 have a larger S.D
(SD=4) indicating that they are more heterogeneous in ability. Students
in Group 2 have smaller S.D (SD=1.414) indicating that they are more
homogeneous in ability.
**********************************************************
184