Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 148

Wolaita Sodo University

College of Education and Behavioral Studies

Department of Psychology
Assessment and Evaluation, PGDT 423
Outline
CHAPTER ONE: Definitions of basic terms
Test
Measurement

 Assessment

 Evaluation

CHAPTER TWO: Test development


CHAPTER THREE:
Test administration
Test scoring
Test interpretation

CHAPTER FOUR: Reliability and Validity of test


CHAPTER FIVE: Item analysis
 Item difficulty, discriminating power, distracters, and item bank
Brainstorming question

Q. Define the term:

Test
Assessment

 Measurement

 Evaluation
CHAPTER ONE
INTRODUCTION: DEFINITION OF BASIC TERMS
 Test: is a measuring tool or instrument in education.

 It is most commonly used method of making measurements in


education.

 It is systematic procedure for measuring sample of behavior by


posing a set of questions in a uniform manner.

 Designed to measure any quality, ability, skill or knowledge.

 There is right or wrong answer.

 Testing: is the process of administering the test, scoring and


interpreting the result.
 Measurement: is the process of assigning number to the
characteristics of a person, objects or events.

 Is the process of quantifying individual’s achievement,


personality, attitudes, habits and skills.
 Numerical description of the degree of individual possesses.
 Using variety of instruments: test, rating scale.
 Quantifying of how much does learner learned.
 Assessment: is the process of collecting information from
multiple sources.
 It includes:
 Rating scales,
 Observation,
 Portfolios,
 Interview,
 Test etc……
 Assessment can be formal or informal.

 Assessment may be descriptive rather than judgment in


nature.

 Its role is to increase students’ learning and development.

 It helps learners to diagnose their problems and to improve


the quality of their subsequent learning.
PRINCIPLES OF ASSESSMENT
 Assessment should be relevant.

 Assessment should be appropriate.


 Assessment should be fair.
 Assessment should be accurate (objective).
 Assessment should give attention to outcomes and processes.
 Assessment should incorporate feedback mechanisms.
GENERAL PURPOSES OF ASSESSMENT
 Assessment of Learning

1. summative in nature (end)

2. decisions/judgments about students’ achievement


 Assessment for Learning

1. formative in nature (ongoing)

2. teaching and learning is on progress

3. the effectiveness of teaching-learning process


 Assessment as learning

1. It occurs when students are their own assessors.


 Evaluation-is a process because it includes a series of steps
 establishing objective,
 classifying objective,
 defining objective,
 selecting indicators, and
 comparing data with objectives.
 It is concerned with making judgments on the worth or value of a
performance.
 It answer the question how good, adequate, or desirable.
 It is a systematic process of determining the extent to which
instructional objectives are achieved
Types of Evaluation
The different types of evaluation are:
Placement
Formative
Diagnostic
Summative
Criterion referenced
Norm referenced
A. Placement Evaluation
 This is a type of evaluations carried out in order to fix the students in
the appropriate group or class.
 It is a form of pretest or aptitude test.
 Finding the right person for the right course.
B. Formative Evaluation
This is a type of evaluation designed to help both the student
and teacher to pinpoint areas where the student has failed to
learn so that this failure may be corrected.
 It provides a feedback to the teacher and the student and thus
estimating teaching success.
e.g. weekly tests etc.

C. Diagnostic Evaluation
Which is applied during instruction to find out the underlying
cause of students persistent learning difficulties.
D. Summative evaluation
This is the type of evaluation carried out at the end of the course
of instruction to determine the extent to which the instructional
objectives have been achieved.
It is called a summarizing evaluation because it looks at the
entire course of instruction or programme and can pass judgment
on both the teacher and students, the curriculum and the entire
system.
It is used for certification
CONT….
Formative evaluation Summative evaluation

 Evaluation is performed to  Evaluation is performed simply to


determine how well students have grade the students at the end of
mastered various element. course.
 Deals with only a segment.  Deals with the whole in detailed
 Test can be administered after manner.
completion of the units.  Test can be given after the
completion of the course/program.
 Immediate feedback.  Feedback not possible immediately
 Diagnostic and progress test can  Achievement examination can be
be possible. possible.
 Weakness and strengths of the  Success and failure of the students
students can be understand. will be possible.
CRITERION-REFERENCED EVALUATION
 Evaluation of performance by judging an individual's behavior, performance, or
knowledge against specific criteria or standards.
 Based on a predetermined set of criteria. For instance,
 85% and up = A

 80% to 84.99% = A-

 75% to 79.99% = B+

 70% to 74.99% = B

 65% to 69.99% = B-

 60% to 64.99% = C+

 50% to 59.99% = C

 Determine individual performance in comparison to some standard or criterion.


 It enable us to describe what an individual can do, without reference to other’s
performance.
Norm-referenced Evaluation
 Norm-referenced measures are designed to compare students performance according
to relative position in some known group.
 It determine individual performance in comparison to others.
The Purpose of Assessment and Evaluation

i. Placement of student, which involves bringing students


appropriately in the learning sequence and classification or
streaming of students according to ability or subjects.

Finding the right person for the right place.

ii. Selecting-students for courses– general, professional,


technical, commercial etc.

iii. Certification: this helps to certify that a student has achieved


a particular level of performance.

iv. Stimulating learning: this can be motivation of the student or


teacher, providing feedback, suggesting suitable practice etc.
v. Improving teaching: by helping to review the effectiveness
of teaching arrangements.

vii. For guidance and counseling services.

viii. For modification of the curriculum purposes.

x. For modification of teaching methods.

xi. For the purposes of promotions to students.

xii. For reporting students progress to their parents.

xiii. For the awards of scholarship and merit.

xiv. For the admission of students into educational institutions.


Continuous assessment
Continuous assessment (CA): is a process of gathering valid
and reliable information about students’ learning performance by
using a variety of methods on ongoing basis.
It is more likely to be formative, process oriented, informal, and
learner oriented. E.g.

1. Daily work such as essays, quizzes, presentation and


participation in class.
2. Projects/term papers.
3. Practical work (e.g. laboratory work, fieldwork).
 Continuous assessment occurs frequently during the school year and is
part of regular teacher-pupil interactions.

 CA includes: recording, reflecting and reporting the findings of


assessment by giving positive, supportive and motivational feedback
to the learner, educator, parents or other stakeholders.

 The main aim of CA is to provide the students with maximum


opportunities to learn, demonstrate what they have acquired and
actively participate in the teaching-learning process.
Why continuous assessment?
 It enables us to obtain an overall learning performance or behavior of
students.

 It enables as to obtain valid and reliable information about the learning


performance of students.
 It provides students with maximum opportunities to actively
participate and demonstrate the knowledge , skills and attitudes they
have acquired during instruction.
 It improves the teachers’ teaching skills. 
 The pressure of continuous assessment can ensure optimum
performance of the teachers as well.
Methods of collecting data in continuous assessment
There are two main methods of collecting data about students
learning: Formal and Informal methods
Formal method
 Formal assessment is the procedure for gathering information
about learners in a planed manner with respect to instructional
objectives.
E.g. include:
written classroom tests,
performance assessment tasks,
home works,
laboratory works,
project works, etc.
Informal method
Informal assessments are those spontaneous forms of assessment that can
easily be incorporated in the day-to-day classroom activities and that measure the
students’ performance and progress.

Informal assessments are content and performance driven.


Can be practice during and outside classroom.
They are not necessarily carefully planned.
Informal assessments include:
 observing learners work,
 interviewing and questioning
ASSESSMENT STRATEGIES
 Among the various assessment strategies that can be used by
classroom teachers
 Classroom presentations;
 Conferences;
 Exhibitions/Demonstrations;
 Interviews;
 Observation;
 Performance tasks;
 Portfolios;
 Questions and answers
 One- Minute paper;
 Muddiest Point;
 Student- generated test questions;
 Tests and etc
ASSESSMENT IN LARGE CLASSES
 The existing educational literature has identified various
assessment issues associated with large classes. They include:
 Surface Learning Approach: Traditionally, teachers rely on time-
efficient and exam-based assessment methods for assessing large
classes, such as multiple choices and short answer questions.
 Higher level learning such as critical thinking and analysis are
often not fully assessed.
 Feedback is often inadequate: With a large class, teachers may
not have time to give detailed and constructive feedback to every
student.
 Inconsistency in marking
 Difficulty in monitoring cheating and plagiarism
 Lack of interaction and engagement
 There are a number of ways to make the assessment of large
numbers of students more effective whilst still supporting
effective student learning. These include:
 Front ending: putting in an increased effort at the beginning
in setting up the students for the work they are going to do,
the work submitted can be improved.
 Making use of in-class assignments
 Self-and peer-assessment
 Group Assessments
 Changing the assessment method, or at least shortening it.

Challenges for implementing continuous assessment


1. Large class size
2. Teaching and learning resources
3. Knowledge and skill of teacher
4. Lack of time
5. Students attitude
6. Teachers attitude
DOMAINS OF THE TAXONOMY OF INSTRUCTIONAL
OBJECTIVES
 Instructional objectives are the statements of those desired changes
in behavior as a result of specific teaching learning activity or
specific teacher learner activities.
 Desired end results or goals are expected or anticipated end result.
 The result sought by the learner at the end of the educational
program, i.e.
 What the students should be able to do at the end of a learning
period, that they could not do before hand.
CLASSIFICATION OF EDUCATIONAL
OBJECTIVES
 In this taxonomy Bloom et al (1956) divided educational
objectives into three domains.

 These are cognitive domain, affective domain and psychomotor


domain.

 Each domain is further categorized into hierarchical levels.


COGNITIVE DOMAINS
This involves those objectives that deal with the development of intellectual abilities and
skills.
These have to do with the mental abilities of the brain. (Knowledge, comprehension,
application, analysis, synthesis and evaluation).
AFFECTIVE DOMAINS
Affective domain has to do with feelings and emotions.
It is concerned with interests, attitudes, appreciation, emotional biases and values.
(Receiving, responding, valuing, organization and characterization).
PSYCHOMOTOR DOMAINS
The psychomotor domain has to do with motor skills or abilities.
It deals with such activities which involve the use of the hand or the whole of the body
(Imitation, Manipulation, Precision, Articulation and Naturalization)
Consider the skills in running, walking, swimming, jumping, eating, playing, throwing,
etc.
CHAPTER TWO

Test Development – Planning the Test


 Most classroom tests are developed for one or more of the following
purposes:
 To establish a basis for assigning grades.
 To determine how well each student has achieved the course
objectives.
 To diagnose student problems for remedial action.
 To determine where instruction needs improvement.
 Developing a good test is like target shooting.
 Hitting the target requires planning.
 Developing a good test requires planning: you must determine the
purpose for the test, and carefully write appropriate test items to
achieve that purpose.
Planning a good classroom test

 Good test cannot just happen!!


 A good test is designed carefully and evaluated empirically
to ensure that it generates accurate and useful information.

Some preliminary considerations


 Determining the objectives of testing

 Selecting appropriate item types

 Preparing test specifications


Determining the objectives of testing

  It must be decided whether the test will be used to measure the entry
performance or the previous knowledge acquired by the student on the
subject.
Checking the quality of objectives
 Do objectives reflect appropriate all the intended outcomes?
 Are they observable and measurable and the outcomes clearly defined?
 Are they attainable by intended learners in the time available?
 Are they reflect the course and curriculum aims?
Selecting appropriate item types
 Mode of item presentation must be considered: oral, paper and
pencil……..
 true-false
 completion
 multiple choice
 matching and
 essay
 Studies showed that arranging items from easy to hard will
yield higher scores than arranging from hard to easy.
Table of specification, TOS (Blue print)

 It is a two-way chart (column and row).

 It represent learning outcomes to be tested, percentage of


items, item placement, type of test, and number of item.

 The blueprint represents the master plan and should readily


guide you in item writing and review.
Table of specifications is developed before the test is written.

A table of specifications benefits students in two ways.

 First, it improves the validity of teacher-made tests.

 Second, it can improve student learning as well.


Preparation of the Test Blueprint
 Test blueprint is a table showing the number of items that will be
asked under each topic of the content and the process objective.
 This is why it is often called Specification Table.
 There are two dimensions to the test blueprint, the content and the
process objectives.
 The content consists of the series of topics from which the competence of the
pupils is to be tested.
 These are usually listed on the left hand side of the table.
 The process objectives or mental processes are usually listed on the top-row of
the table.
 The process objectives are derived from the behavioral objectives stated for the
course initially.
Instructional Objectives
Contents Total Percent

Knowledge Comprehension Application Analysis Synthesis Evaluation

Air pressure
24%

Wind
16%

Temperature
28%

Rainfall
20%

Clouds
12%

Total

Percent 100%
28% 32% 16% 16% 4% 4%
Weighting of the Content and Process Objectives
 The proportion of test items on each topic depends on the
emphasis placed on it during teaching and the amount of time
spent.
 Also, the proportion of items on each process objectives
depends on how important you view the particular process skill
to the level of students to be tested.
• However, it is important that you make the test a balanced one
in terms of the content and the process objectives you have been
trying to achieve through your series of lessons.
Cont….
 Percentages are usually assigned to the topics of the content and the
process objectives such that each dimension will add up to 100%.

 After this, you should decide on the type of test you want to use and
this will depend on the process objective to be measured, the
content and your own skill in constructing the different types of
tests.
Cont….
Determination of the Total Number of Items
 At this stage, you consider the time available for the test, types of
test items to be used (essay or objective) and other factors like the
age, ability level of the students and the type of process objectives
to be measured.

 When this decision is made, you then proceed to determine the total
number of items for each topic and process objectives as follows:
 To obtain the number of items per topic, you multiply the percentage of
each by the total number of items to be constructed and divide by 100.

 This you will record in the column in front of each topic in the extreme
right corner of the blueprint.

 In the table below, 24% was assigned to Air Pressure. The total number
of items is 25 hence 6 items for the topic (24% of 25 items = 6 items).
 To obtain the number of items per process objective, we also
multiply the percentage of each by the total number of items for
test and divide by 100.
 These will be recorded in the bottom row of the blueprint under
each process objective. In the table below:
I. The percentage assigned to comprehension is 32% of the
total number of items which is 25. Hence, there will be 8
items for this objective (32% of 25 items).
II. To decide the number of items in each cell of the blue print,
you simply multiply the total number of items in a topic by
the percentage assigned to the process objective in each row
and divide by 100.
 This procedure is repeated for all the cells in the blue print.
For example, to obtain the number of items on wind under
knowledge, you multiply 28% by 4 and divide by 100 i.e. 1.
Instructional Objectives
Contents Total Percent

Knowledge Comprehension Application Analysis Synthesis Evaluation

Air pressure
24%
2 2 1 1 - - 6
Wind
16%
1 1 1 1 - - 4
Temperature
28%
2 2 1 1 - 1 7
Rainfall
20%
1 2 1 - 1 - 5
Clouds
12%
1 1 - 1 - - 3
Total
7 8 4 4 1 1 25
Percent 100%
28% 32% 16% 16% 4% 4%
 There are also other ways of developing a test blue print.

 One of this is a way of showing the distribution of test items


among the content areas and the type of test items to be
developed from each content area.

For example, the table of specification that we have seen earlier can
be prepared in the following way.
Item Types

Contents True/ Short Multiple Total Percent


Matching
False Answer Choice
Air pressure 1 1 1 3 6 24%
Wind 1 1 1 1 4 16%
Temperatur
1 2 1 3 7 28%
e
Rainfall 1 1 1 2 5 20%
Clouds 1 - 1 1 3 12%
Total 5 5 5 10 25
Percent 20% 20% 20% 40% 100%
Item Writing
General guidelines for item writing:
Keep the test blueprint in mind.
Select the type of test item that measures the intended learning outcomes.
Generate more items than specified in the table of specification.
Refrain from providing unnecessary clues to the correct answer.
Eliminate any systematic pattern for answers that would allow students to guess
answers correctly.
Make the instructions for each type of question simple and brief.
Write items that require specific understanding or ability developed in that
subject.
Do not suggest the answer to one question in the body of another question.
 Determine an appropriate difficulty level for students.
 Give enough time to allow an average student to complete the task.

 Build in a good scoring guide at the point of writing the test items.
 Exclude extraneous or irrelevant information.

 Use simple, precise, and unambiguous wording.


 Show the mark or point value for each question.
 Criticize the test by one or more colleagues.
Types of classroom tests

1. Objective
2. Subjective tests
1. Objective test items
 Objective tests are highly structured and require the test taker
to select the correct answer from several alternatives or to
supply a word or short phrase to answer a question.

 They are called objective because they have a single right or best
answer that can be determined in advance.
Types of Objective test

 The objective test can be classified into:


 Those that require the examinee to supply the answer to the test
items (free-response type) and;
 Those that require the examinee to select the answer from a given
number of alternatives (fixed response type).
 The free-response type consists of the short answer and
completion items sometimes called as supply type items.
 While the fixed response type is commonly further divided into
true-false or alternative response, matching items and multiple-
choice items sometimes called as selection type test items.
Short Answer/Completion Test Items
 The short-answer items and completion test items are essentially the
same that can be answered by a word, phrase, number or formula.
 They differ in the way the problem is presented.

 The short answer type uses a direct question,

 Where as the completion test item consists of an incomplete


statement requiring the student to complete.
Short answer item: In which year did the Ethiopians defeat the Italian invaders at
Adwa?

Completion item: The Ethiopian forces defeated the Italian invaders at Adwa in the
year _____
Advantages of short answer/completion items
 The short-answer test items are one of the easiest to construct.

 They reduces the possibility that students will obtain the correct
answer by guessing.

Disadvantages of short answer/completion items


 They are unsuitable for assessing complex learning outcomes.

 Difficulty for scoring, this is especially true where the item is not
clearly phrased to require a definitely correct answer and the
student’s spelling ability.
Suggestions to make good short-answer type
• Word the item so that the required answer is both brief and
specific.
Examples
o Poor: An animal that eats the flesh of other animals is
______________.
o Better: An animal that eats the flesh of other animals is called
_______________ (carnivorous).
• Do not take statements directly from textbooks to use as a basis
for short-answer items.
Examples
o Poor: Chlorine is _____________.
o Better: Chlorine belongs to a group of elements which
combine with metals to form salt. It is therefore, called a
________ (halogen).
 A direct question is generally more desirable than an incomplete
statement.
Examples
– Poor: Yuri Gagarin made his orbital flight around the earth in
_____(1961)(but the answer could be also in space shuttle)
– Better: When did Yuri Gagarin make his orbital flight around the
earth? (1961)
 If the answer is to be expressed in numerical units, indicate the type of
answer wanted.
Examples
– Poor: Sound travels _________________ in a second.
– Better: Sound travels ________________ meter in a second.
 When completion items are used, do not use too many blanks.
______________________
______________________
_____________________
The Alternative Response Test Item

 The alternative response test item commonly called the true-false


test item.
 The true-false option is commonly used consists of item with
declarative statement to which the examinee is asked to give
either of two options concerning the item.

 The two options could be true or false, right or wrong.


Advantages of true/false items
 It is easy to construct alternative response item but the validity and reliability of such
item depend on the skill of the item constructor.
 They do not require the student much time for answering.
 Allow a teacher to cover a wide range of content by using a large number of such items.

 They can be scored quickly, reliably, and objectively by any body using an answer
key.

 If carefully constructed, true/false test items have also the advantage of measuring higher
mental processes of understanding, application and interpretation.
Disadvantages of true/false items
 When they are used exclusively, they tend to promote
memorization of factual information: names, dates, definitions.
 Can often lead a teacher to favor testing of trivial/little knowledge.
 They encourage students for guessing.
 Can often lead a teacher to write ambiguous statements due to the
difficulty of writing statements which are clearly true or false.
 Do not discriminate b/n students of varying ability as well as
other test items.
 Can often include more irrelevant clues than do other item types.
Suggestions to construct good quality true/false test items
 Don’t use all inclusive words such as “all, always, never, none,
no” etc within the framework of a true-false test item.
Example
– Poor: The set of integers includes the set of all natural
numbers. T/F
– Better: The set of integers includes the set of natural
numbers. T/F
 Don’t use indefinite terms such as “greatly, usually,
frequently, and sometimes” etc.
Examples
– Poor: Validity is usually of more concern to the tester than
reliability in testing. T/F
– Better: Validity is adjudged more important than reliability
in testing. T/F
Suggestions to construct good quality true/false test items

 Avoid negative statements, and never use double negatives.


In Right-Wrong or True-False items, negatively phrased
statements make it needlessly difficult for students to decide
whether that statement is accurate or inaccurate.
Examples
– Poor: There is no odd item in the following elements
“sodium and potassium.” T/F
– Better: The elements, “sodium and potassium” have nothing
in common. T/F
 Restrict single-item statements to single concepts.
Don’t use more than one single idea per item.
Examples
– Poor: From the construction stage, validity and reliability of
a test could be ensured by the use of a test blueprint. T/F
– Better: A test blueprint enhances content validity from the
construction stage. T/F
Suggestions to construct good quality true/false test items

 Use an approximately equal number of items, reflecting the


two categories tested. Or the number of true statements and false
statements should be approximately equal (40 – 60%).

 Make statements representing both categories equal in length.

 Don’t allow distinctive difference in terms of lengths between a


true statement and a false one.
Matching Items
 The matching test items usually consist of two parallel columns.
 One column contain a list of word, number, symbol or other
stimuli (premises) to be matched to a word, sentence, phrase or
other possible answer from the other column (responses) lists.

 The examinee is directed to match the responses to the appropriate


premises.

 Usually, the two lists have some sort of relationship.


 Sometimes the premises and responses list is an imperfect
match with more list in either of the two columns and the
direction indicating what to be done.
– For instance, the examinee may be required to use an item
more than once or not at all, or once.
 This deliberate procedure is used to prevent examinees from
matching the final pair of items on the basis of elimination.
Merits of Matching Items
 It is possible to measure a large amount of related factual
material in a relatively short time.

 The guess factor can be controlled by skillfully constructing the


items such that the correct response for each premise must also
serve as a plausible response for the other premises.  

 The scoring is simple and objective.


Limitations of Matching Items
 It is restricted to the measurement of factual information.

 Many topics are unique and cannot be conveniently grouped in


homogenous matching clusters and it is sometimes difficult to get
homogenous materials clusters of premises and responses.

 It requires extreme care during construction in order to avoid


encouraging serial memorization rather than association.
Suggestions construction of good matching items
 Use fairly brief lists, placing the shorter entries on the right.
Column A (premises) Column B(responses )

1. Special date that you spent with your friends A) echoic memory
2. Watching sport program in TV B) Iconic memory
3. Radio information C) Semantic memory
D) Episodic memory

 Employ homogeneous lists.


 Include more responses than premises.
 List responses in a logical order.
 Describe the basis for matching and the number of times a
response can be used(“Each response in the list at the right
may be used once, more than once, or not at all.”)
 Try to place all premises and responses for any matching item
on a single page.
Multiple-Choice Items

 It can effectively measure many of the simple learning outcomes.


 It can measure a variety of complex cognitive learning outcomes.

 A multiple-choice item consists of a problem (item stem) and a list of


suggested solutions (alternatives, choices or options).

 The correct response is called the key answer, the remaining


alternatives are called distractors.
 A direct-question (best-answer) multiple-choice item

• Which of the following European countries as suffered more from


the consequences of the Second World War?

A. Germany B. Britain C. France D. Russia/USSR/


 An incomplete-statement (correct-answer) multiple-choice item

• The Second World War was started in the year__

A. 1936 B. 1939 C. 1941 D. 1945


Advantages of multiple-choice item

 It is widespread applicability to the assessment of cognitive skills and


knowledge.

 Cleverly constructed multiple-choice items can present very high-level


cognitive challenges to students.

Easy to score.
 Objective to score.

Disadvantages of multiple-choice items


 It measures problem-solving behavior at the verbal level only.
 It is very difficult and time consuming to construct.
Suggestions for Good Multiple-choice items
 The question or problem in the stem must be self-contained.
Examples
• Poor: - Light:-
A. is reflected most when reaching the surface of colored object.
B. is transmitted most through a translucent body.
C. is transmitted most through transparent body.
D. traveling in a block of a glass refracts into the air before striking the
boundary surface.
• Better: - Through which of the following objects is light
transmitted most?
A. Translucent
B. Transparent
C. Opaque
D. Black
Suggestions for Good Multiple-choice items
 The item stem should be free from irrelevant material and be clear.
Examples
• Poor: As animals need food, locomotion is vital for them and many aspects of
their structure are related to this behavior. Which structures are used by
paramecium for this purpose?
A) Nematocyts
B) Flagella
C) Pseudopodia
D) Cilia
• Better: Which one of the following structures are used by paramecium for
Movement?
A) Nematocyts
B) Flagella
C) Pseudopodia
D) Cilia
Suggestions for Good Multiple-choice items
 Use a negatively stated item stem only when significant learning outcomes require it.
Negative item stem should be avoided as possible. However, for the learning outcome
that requires safety and avoiding harmful conditions, you can use negatives. If you use
negatively stated items, highlight making them capital, bold or underlined.
• Examples
• Poor: Which one of the following is not located north of Addis Ababa?
A)Fiche
B) Gohatsion
C) Shashemene
D) Debrebrihan
• Better: Which one of the following is located south of Addis Ababa?
A) Fiche
B) Gohatsion
C) Shashemene
D) Debrebrihan
• Poor: Which one of the following is not a safe driving practice on ice roads?
A) Accelerating slowly
B) Jamming on the brakes
C) Holding the wheel firmly
D) Slowing down gradually
• Better: All of the following are safe driving practices on ice roads except:
A) accelerating slowly
B) jamming on the brakes
C) holding the wheel firmly
D) slowing down gradually
Suggestions for Good Multiple-choice items
 Each alternative must be grammatically consistent with the item’s
stem
Examples
• Poor: An electric transformer can be used:-
A) to increase the voltage of alternating current.
B) for storing up electricity.
C) it converts electrical energy into mechanical energy.
D) alternating current is charged to direct current.
• Better: An electric transformer can be used to:-
A) increase the voltage of altering current.
B) store up electricity.
C) convert electrical energy into mechanical energy.
D) change alternating current to direct current.
Suggestions for Good Multiple-choice items
 All distracters should be plausible (look like the answer).
Examples
• Poor: Who discovered the North Pole?
A) Christopher Columbus
B) Ferdinand Magellan
C) Marco Polo
D) Robert Peary
• Better: Who discovered the North Pole?
A) Roald Amundsen
B) Richard Byrd
C) Robert Scott
D) Robert Peary
Suggestions for Good Multiple-choice items
 Verbal association between the stem and the correct answer should
be avoided.
Examples
• Poor: A four sided figure whose opposite sides are parallel is:-
A) a trapezoid
B) an octagon
C) a parallelogram
D) a hexagon
• Better: A four sided figure whose opposite sides are parallel is:
A) a trapezoid
B) an octagon
C) a rhombus
D) a hexagon
Suggestions for Good Multiple-choice items

 Verbal association between the stem and the correct answer should
be avoided.
 An item should contain only one correct or clearly best answer.

 Items used to measure understanding should contain some novelty


but not too much novelty.
 The relative length of the alternatives should be approximately
equal. Longer alternative tend to be the correct answer that it may
provide a clue.
 Never use “all of the above” as an answer choice, but use “none of
the above” to make items more demanding.
2. Subjective test items

Essay items
Essay tests are tests consisting of questions (items) designed to
elicit from the learners through freedom of response.
You should use essay questions in the measurement of complex
achievement.
Essay questions should also be used to measure those learning
outcomes that cannot be measured by objective test items.
Students have the freedom to express or state the answers in their
own words.
Classification of Essay Items
 There are two types of essay items. These are:
1. Extended response
 In this type, questions are asked in a way that the answers demand that the student
is not limited to the extent to which he has to discuss the issues raised or question
asked.
Example
 Describe the sampling technique used in research studies.
 Explain the various ways of preventing car accident in Ethiopia.
2. Restricted response
In this type, the questions are so structured that the students are limited, the scope of the
response is defined and restricted.
The answers given are to some extent controlled.

Example
Give three advantages and two disadvantages of essay tests.

State four uses of tests in education.


Advantages of Essay Items
 Essay items are an effective way to measure higher-level cognitive objectives.
 They are less time-consuming to construct.
 Students do not memorize facts, but try to get a broad understanding of
complex ideas, to see relationships, etc.
 They present a more realistic task to the student.
Disadvantages of Essay Items
 They are few in number.
 They require a long time to read and score.
 They are difficult to score objectively and reliably.
 Research shows that a number of factors can bias the scoring.
 Scores are influenced by quality of handwriting, neatness, spelling,
grammar, vocabulary, etc.
Suggestions for the construction of good essay items

 Restrict the use of essay questions to those learning outcomes that


can not be measured satisfactorily by objective items.

 For each question, specify the point value.

 Employ more questions requiring shorter answers rather than


fewer questions requiring longer answers.

 Don’t employ optional questions.


Arrangement of test items
 True-false items should be grouped together,
 Matching items
 Short answer or completion items
 Multiple choice items
 Essay items
Its advantages:
to motivate students
to retain the same mental set
to facilitate scoring
Unit Three
Test Administration, Scoring and Interpretation
Test Administration
Test administration refers to the procedure of actually presenting the
learning task.

This procedure is as important as the process of preparing the test.


Validity and reliability of test scores can be greatly reduced when test is
poorly administered.

While administering test all examinees must be given fair chance to


demonstrate their achievement.
Administering Tests
 Conducive physical conditions involve:
 Adequate work place
 Light
 Ventilation/fresh air
 Free from noise
 Conducive psychological conditions involve:
 Creating good motivation
 Creating positive attitude,
 Other good internal states of the examinees.
 It is also concerned with selecting convenient and accurate
procedures for scoring the results.
Common guidelines for administering tests

 Consider the suitability of testing place for the students.


 Make sure that the students understand the directions.
 Keep time accurately and inform students of the time left at
regular time intervals.
 Do not talk unnecessarily before the test. Threatening and
warning

 Remind students to check their copies.


 Avoid giving hints to pupils who ask about individual items.
 Collect tests uniformly.
 Discourage cheating.
Techniques to prevent cheating

 Take special precautions (safety measure)

 Have students clear of the tops of their desk

 Proctor the testing session carefully (e.g., walk around)


 Use special seating arrangement.
 Use two or more forms of the test (code).
 Educate them cheating is unnecessary.
 Proper control during examination time.
Scoring the Test
 Scoring is an activity to assign a value that can represent the
degree of correct response to an assessment task.

 In the evaluation of classroom learning outcomes marking


schemes are prepared alongside the construction of the test
items in order to score the test objectively.
Rigor of Scoring
Scoring Objective Test
1. Manual Scoring
 In this method of scoring the answer to test items are scored by direct
comparison of the examinees answer with the marking key.
2. Machine Scoring
Usually for a large number of examinees, a specially prepared answer
sheets are used to answer the questions.
The answers are normally shaded at the appropriate places assigned to
the various items.
These special answer sheets are then machine scored with computers
and other possible scoring devices using certified answer key prepared
for the test items.
Scoring an essay items
 There are two common methods of scoring essay questions.
1. The Point or Analytic Method
 Each answer is compared with already prepared ideal
marking scheme (scoring key).
 It is preferable to the global scoring approach because it helps
to minimize Halo effects and leniency error.
2. The Global/Holistic Rating Method
 The examiner first sorts the response into categories of
varying quality based on his general or global impression on
reading the response and then correct it.
Additional strategies used to minimize bias in scoring essay items

 Pay attention only to the significant and relevant aspects of the


answer.
 Apply uniform standards to all papers.

 Give comments and the correct answer for a question.

 Have positive attitude towards each students.

 Keep all professional ethics.


Kinds of scores
 Nominal
 Categorical data
 Ordinal
 Order
 Satisfaction,
 Happiness ad Discomfort
 Interval
 Gives us the order of values + the ability to quantify the difference between each one.
 Temperature and time
 Ratio
 Ratio scales give us the ultimate–order, interval values, plus the ability to calculate ratios since
a “true zero” can be defined.
 Absolute zero
 Height and weight
Methods of Interpretation of Test results
 Test interpretation is a process of assigning meaning and
usefulness to the scores obtained from classroom test.
1. Criterion/Absolute grading –Referenced Interpretation
 Criterion-referenced interpretation is the interpretation of test
score based on the conversion of the score into a description of the
specific tasks that the learner can perform.
2. Norm/Relative-Referenced Interpretation
 Is the interpretation of test score based on the conversion of the
score into some type of derived score that indicates the learner’s
relative position in a clearly defined referenced group.
Measures of Central Tendency

 It is often important to summarize characteristics of a


distribution of test scores.

 Tells us about the shape and nature of the distribution.

 There are three basic measures of central tendency – the mean,


the median and the mode.
The Mean

The mean is:


 The mean is the most widely used measure of central tendency.
 It is the average of a set of scores computed simply by adding together
all scores and dividing by the number of scores.
(X)/N
 Calculate the mean of the following data Math’s Class: 82, 93, 86, 97, 82
 Sum the scores (X): 82 + 93 + 86 + 97 + 82 = 440
 Divide the sum (X = 440) by the number of scores (N = 5):
440 / 5 = 88
Mean = 88
The Median
 In some circumstances, the mean may not be the best indicator of
students’ performance.
 In this case the median is usually considered a better indicator of
student performance.
 Median – This is the value of a variable such that half of the observations
are above and half are below this value i.e. this value divides the distribution
into two groups of equal size.

 When the number of observations is odd, the median is simply equal to the
middle value.

 When the number of observations is even, we take the median to be the


average of the two values in the middle of the distribution.
  Example 1 Example 2 Example 3 Example 4
  Scores Scores Scores Scores
  50 50 49 50
  48 49 48 49
  48 48 48 47
  47 46 47 47
  45 46 45 45
  44 43 44 45
  43 43 43 45
  42 42 42 44
  42 41 42 42
  41 41 41 41
      38 41

Find median of this data: 90, 70, 10, 100, 20, 50


Mode
This is the most frequently occurring value in the distribution.

For example, the following test scores, 7, 7, 7, 20, 23, 23, 24, 25,
26 have a mode of 7.

A set of scores may sometimes have two or more modes and in


such cases are called bimodal or multimodal respectively.
The Mode
• The mode is the score that
occurs most frequently in a 6

set of data 5

Frequency
3

0
75 80 85 90 95
Score on Exam 1
Bimodal Distributions

• When a distribution has two 6


“modes,” it is called bimodal 5

Frequency
3

0
75 80 85 90 95
Score on Exam 1
Multimodal Distributions

• If a distribution has more 6


than 2 “modes,” it is 5
called multimodal
4

Frequency
3

0
75 80 85 90 95
Score on Exam 1
Relations Between the Measures of Central Tendency

 In symmetrical distributions, the median, mode and mean are equal.


– For normal distributions, mean = median = mode.
 In positively skewed distributions, the mean is greater than the median.
 When the distribution has some extremely high scores.
 In negatively skewed distributions, the mean is smaller than the median.
 When the distribution has some very low scores.
Comparison of mean, median and mode
 Mode
– Good for nominal variables.
– Good if you need to know most frequent observation.
– Quick and easy.
 Median
– Good for “bad” distributions.
 Good for distributions with arbitrary ceiling or floor.
 Mean
– Generally preferred except for “bad” distribution.
– Most commonly used statistic for central tendency.
Measure of Variability/dispersion
 Measures of dispersion are descriptive statistics that describe how
similar a set of scores are to each other.
– The more similar the scores are to each other, the lower the
measure of dispersion will be.

– The less similar the scores are to each other, the higher the
measure of dispersion will be.
125
100
 Which of the distributions of 75
scores has the larger dispersion? 50
25
0
1 2 3 4 5 6 7 8 9 10

The upper distribution has 125


100
more dispersion because the 75
scores are more spread out. 50
25
0
 That is, they are less similar 1 2 3 4 5 6 7 8 9 10
to each other

104
 The three most commonly used measures of variability are the range,
variance, and standard deviation.
The Range
 It is the simplest and crudest measure of variability calculated by
subtracting the lowest score from the highest score.

 For example, if the score of 10 students in a certain test is: 5, 7, 8, 10,


12, 13, 14, 15, 17, 19, then the range will be 19 -5 = 14.
 The range is used when
– you have ordinal data or
– you are presenting your results to people with little or no
knowledge of statistics.
Variance
 Population Variance:
 2
  (X  X ) 2

N
 The variance is equal to the average squared deviation from the
mean.
 To compute, take each score and subtract the mean and
square the result. Then find the average over scores. That is
variance.
 High variance means that most scores are far away from the
mean.
 Low variance indicates that most scores cluster tightly about
the mean.
Computing the Variance

(N=5) X X X X (X  X ) 2

5 15 -10 100
10 15 -5 25
15 15 0 0
20 15 5 25
25 15 10 100
Total: 75 0 250
Mean: Variance Is  50
The Standard Deviation
 The most useful measure of variability, or spread of scores is the
standard deviation.
 It is essentially an average of the degree to which a set of scores
deviates from the mean.
 If the Standard Deviation is large, it means the numbers are
spread out from their mean. Common for heterogeneous group/s.
 If the Standard Deviation is small, it means the numbers are
close to their mean. Common for homogeneous group/s.
 The procedure for calculating a Standard Deviation involves the
following steps:
 Compute the mean.
 Subtract the mean from each individual’s score.
 Square each of these individual scores.
 Find the sum of the squared scores (∑X2).
 Divide the sum obtained in step 4 by N, the number of students, to
get the variance.
 Find the square root of the result of step 5. This number is the
standard deviation (SD) of the scores.
 The formula for the standard deviation (SD) is:

 X  X 
2

SD 
N
Standard deviation = variance
Variance = standard deviation2
Computing the Standard Deviation

(N=5)X X X X ( X  X )2
5 15 -10 100
10 15 -5 25
15 15 0 0
20 15 5 25
25 15 10 100
Total: 75 0 250
Mean: Variance Is  50

Sqrt SD Is  √50 =
Measures of Relationship
 If we have two sets of scores from the same group of people, it is often
desirable to know the degree to which the scores are related.
 For example, we may be interested in the relationship between the test
scores of students for the English Subject and their overall scores of
other subjects.
 The degree of relationship is expressed in terms of coefficient of
correlation.
 The value ranges from -1.00 to +1.00.
 A perfect positive correlation is indicated by a coefficient of +1.00 and
a perfect negative correlation by a coefficient of -1.00.
 A correlation of 0 indicates no relationship between the two sets of
scores. E.g. the relationship between your shoe size and your salary.
 Obviously, the larger the coefficient (positive or negative), the higher
the degree of relationship expressed.
 A positive correlation indicates that the variables increase together.
 A negative correlation indicates that as one variable increases, the other
decreases.
 There are two common measures of relationship expressed as
correlation coefficients.
1. Pearson Product-moment correlation coefficient
 The most commonly used and most useful correlation coefficient.
 It is indicated by the symbol r.
 The Pearson correlation evaluates the linear relationship between two continuous
variables.
The formula for obtaining the coefficient of correlation is:
Or other alternative formula is:

 XY    X   Y 
N  N  N 
r   
2 2
X X

2

 Y   Y
2



N  N  N  N 
  
 The following steps serve as a guide for computing a Pearson
product-moment correlation coefficient.
 Begin by writing the pairs of scores to be studied in two columns.

 Make certain that the pair of scores for each student is in the same
row.
 Square each of the entries in the X column and enter the result in
the X2 column.
 Square each of the entries in the Y column and enter the result
in the Y2 column.
 In each row, multiply the entry in the X column by the entry in
the Y column, and enter the result in the XY column.
 Add the entries in each column to find the sum of (∑) each
column.
 Substitute the obtained values in the formula.
cont….
Student Score Score X2 Y2 XY
X Y
1 20 24 400 576 480
2 18 21 324 441 378
3 16 23 256 529 368
4 14 20 196 400 280
5 14 18 196 324 252
6 12 14 144 196 168
7 11 16 121 256 176
8 10 12 100 144 120
9 8 10 64 100 80
10 7 12 49 144 84
130 170 1850 3110 2386
(∑X) (∑Y) (∑X2) (∑Y2) (∑XY)
N = 10
cont….

X  Y 130
  13  y 2  3110  311
N 10
N 10
Y
 Y

170
 17
N 10  XY  2386  238.6
 x2  1850  185 N 10
N 10
The formula for obtaining the coefficient of correlation is:

 XY X

  Y



N  N  N 
r   
2 2
    X
X 2

     Y
Y 2


N  N  N  N 
   

2386  130  170 


  
10  10  10 
r 
2 2
1850   130  3110   170 
   
10  10  10  10 

238.6  1317  17.6


r  r  0.94
185  169  311  289  16 22
The formula for obtaining the coefficient of correlation is:
N  XY   X  Y 
r
N X  X  N  Y  Y 
2 2 2 2

10( 2386)  130 170 


r
10(1850)  130  10(3110 )  170 
2 2

23860  22100
r 
18500  16900 31100  28900

1760
r  0.94
1876
Cont…

2. Spearman Rank Difference Correlation


 Spearman correlation is often used to evaluate relationships
involving ordinal variables.

 It is easier to compute with a small number of cases than the


Pearson product-moment correlation coefficient.
Cont…

• Where
– Σ = Sum of
– d = Difference in rank
– n = Number in group
Cont…
Students English Maths Rank Rank d d2
(mark) (mark) (English) (maths)
1 56 66
2 75 70
3 45 40
4 71 60
5 62 65
6 64 56
7 58 59
8 80 77
9 76 67
10 61 63
Cont…
Students Maths Rank (English) Rank (maths) d d2
(mark)
1 66 9 4 5 25
2 70 3 2 1 1
3 40 10 10 0 0
4 60 4 7 3 9
5 65 6 5 1 1
6 56 5 9 4 16
7 59 8 8 0 0
8 77 1 1 0 0
9 67 2 3 1 1
10 63 7 6 1 1
54
Where d = difference between ranks and d2 = difference squared.
We then calculate the following:
We then substitute this into the main equation with the other
information as follows:
Interpretation of correlation coefficient
Chapter Four
Reliability and Validity
Reliability
A reliability is the ability of test to provide a consistent test score
on repeated measurement.

A reliable test is one that yields consistent scores when a person


takes two alternate forms of the test or when an individual takes the
same test on two or more different occasions.

Reliability can be considered as the degree to which test scores are


free from errors of measurement.
128
Factors that affect reliability
Error of measurement
Lack of consistency in scoring, direction and use of answer sheet
affects reliability.
Over all structure should be structured to minimize error of
measurement.
Test length
 The longer the test the greater reliability assuming other factors
constant.

Saturday, September 17, 20 129


22
 Item difficulty
 Items with moderate difficulty enhances reliability.

 Sample heterogeneity: the more heterogeneous the test items, the


higher the reliability.

 Guessing: As the probability of correctly guessing answers increases, the


reliability coefficient decreases.
 Reliability is a necessary, but not a sufficient condition for validity.
 General guide line for interpreting reliability coefficients
Reliability coefficient value interpretation
.90 and up excellent
.80-.89 good
.70-.79 adequate
Below .70 may have limited applicability 130
Validity
Refers to a test's accuracy. A test is valid when it measures what it
is intended to measure.
Types of validity
Content validity
 Does the test emphasize what you have taught?
A test has content validity to the extent that it adequately samples
the content or behavior domain that it is designed to measure.
Representative sampling means selecting items in proportion
to their emphasis or importance.

Saturday, September 17, 20 131


22
Criterion-related validity
 Test score is related to some criterion.
 Predictive-future performance-job or academic success.

Construct validity
 Relates to whether the test is an adequate measure of the underlying
construct.
Face validity
 Refers simply to whether or not a test "looks like" it measures what it is
intended to measure.
Factors influencing validity
 Unclear direction
 Too difficult vocabulary and sentence structure
 Inappropriate level of difficulty of the test items
 Poorly constructed test items
 Ambiguity
 Too short test
 Identifiable pattern of answers
 Factors in test administration and scoring

Saturday, September 17, 20 133


22
CHAPTER Five
Item Analysis
 Item analysis is the process of “testing the item” to determine
specifically whether the item is functioning properly in
measuring what the entire test is measuring.
 Item analysis begins after the test has been administered and
scored.
 It involves detailed and systematic examination of the testees’
responses to each item to determine the
 difficulty level,
 discriminating power of the item, and
 effectiveness of each option
General Purpose for Item Analysis

To select the best available items for future use and keep it
in the item bank.

To find out structural or content defects in the items.

To detect learning difficulties of the class as a whole.

To identify for individual students area of weakness and


need of remediation.
The Process of Item Analysis
The item analysis procedures
(example of 40 test takers)
Arrange the 40 test papers by ranking them in order from the highest to the
lowest score.
Select the best 10 papers (upper 25% of 40 testees) with the highest total
scores and the least 10 papers (lower 25% of 40 testees) with the lowest total
scores.
Drop the middle 20 papers (the remaining 50% of the 40 testees) because
they will no longer be needed in the analysis.
Draw a table in readiness for the tallying of responses for item analysis.
For each of the 10 test items, tabulate the number of testees in the upper and
lower groups who got the answer right or who selected each alternative (for
multiple choice items).
 Compute the difficulty of each item (percentage of testees who got the
item right).

Compute the discriminating power of each item (difference between


the number of testees in the upper and lower groups who got the item
right).

 Evaluate the effectiveness of the distracters in each item


(attractiveness of the incorrect alternatives) for multiple choice test items).
Computing Item Difficulty
•The difficulty of an item may be defined as the proportion of the
examinees that marked the item correctly.
•The difficulty index P for each of the items is obtained by using the
formula:
  Item Difficulty (P) = No of testees who got item right (T) X 100
Total No of testees responding to item (N)

i.e. P = T X 100
N
Thus for item 1 in table 3.1,
P = 14 X 100 = 0.7 X 100 = 70%
20
 The difficulty level of items should not be “too easy” or “too difficult”
Item difficulty interpretation

P-Value Percent Range Interpretation

> or = 0.75 75-100 Easy

< or = 0.25 0-25 Difficult

between .25 & .75 26-74 Average


Computing Item Discriminating Power (D)
Item discrimination power is an index which indicates how
well an item is able to distinguish between the high achievers
and low achievers given what the test is measuring.

Test item regarded as having positive discriminating power if


the examinees that rank higher in ability answer it correctly
more often than those who rank lower in ability answer.
 The higher the discriminating index, the better is an item in
differentiating between high and low achievers.
 Usually, if item discriminating power is a:
 Positive value when a larger proportion of those in the high
scoring group get the item right compared to those in the low
scoring group.
 Negative value when more testees in the lower group than in the
upper group get the item right.
 Zero value when an equal number of testees in both groups get
the item right; and
 1.00 when all testees in the upper group get the item right and all
the testees in the lower group get the item wrong.
 -1.00 when all testees in the upper group get the item wrong and
all the testees in the lower group get the item right.
It is obtained from this formula:
No of high scorers who No of low scorers who
Item Dp (D) = got item right (H) - got item right (L)
Total No of testees in upper group (n)
That is,
D=H–L
n
Hence, for item 1 in table 3.1 the item discriminating power D is
obtained :
D = H - L = 10-4 = 6 = 0∙60
n 10 10
Evaluating the Effectiveness of Distracters
The distraction power of a distractor is its ability to differentiate
between those who do not know and those who know what the item
is measuring.
A good distracter attracts more testees from the lower group
than the upper group.

Formula:

No of low scorers who No of high scorers who


Option Distractor Power (Do) = marked option (L) - marked option (H)
Total No of testees in upper group (n)
 
That is,
Do = L - H
n
For item 1 in table 3.1: effectiveness of the distracters are:
For option A: Do = L – H = 2 - 0 = 0∙20
n 10
B: The correct option starred (*)
C: Do = L – H = 1 - 0 = 0∙10
n 10
D: Do = L – H = 3 - 0 = 0∙30
n 10
E: Do = L – H = 0-0 = 0∙00
n 10
Options with positive distraction power are good distractors while the one
with negative distracter must be changed or revised and those with zero
should be improved on because they are not good (they failed to distract the
low achievers).
Item Banking

 Items and tasks are recorded on records as they are


constructed.

 Such a file is especially valuable in areas of complex


achievement, when the construction of test items and
assessment tasks is difficult and time consuming.
Thank You
For Your
Attention!

You might also like