Cuwrrer 6
ITEM ANALYsIg
AND
VALIDATION
& LEARNING OUTCOMES
= Explain the meaning of tem anal
lysis, item validity, reliability, item
Gificulty, discrimination index
> Determine the Validity and reliability of given test items
INTRODUCTION
The teacher normally prepares a draft of the test, Such a draft
is subjected to item analysis and validation in order to ensure that
the final version of the test would be useful and functional. First,
the teacher tries out the draft test to a group of students of similar
characteristics as the intended test takers (try-out phase). From the
try-out group, each item will be analyzed in terms of its ability to
discriminate between those who know and those who do not know
and also its level of difficulty (item analysis phase). The item analysis
will provide information that will allow the teacher to decide whether
to revise or replace an item (item revision phase). Then, finally, the
final draft of the test is subjected to validation if the intent is to make
use of the test as a standard test for the particular unit or grading
period, We shall be concemed with these concepts in this Chapter.
6.1. Item Analysis
There are two important characteristics of an item that will
be of interest to the teacher. These are: (a) item difficulty, and
85‘Assessuent oF Leawunes Ourcowes (Assessuen™ 1)
pl a ls ee ee a
We shall learn how to measure the
ation index
(b) discrimin: in making « deena
characteristics and apply our knowledg
about the item in question.
The difficulty of an item or item difficulty is defined as yy,
number of students who are able to answer the item corregy,
divided by the total number of students. Thus:
Item difficulty = number of students with correct answer
total number of students
The item difficulty is usually expressed in percentage,
Example: What is the item difficulty index of an item if
25 students are unable to answer it correctly while 75
answered it correctly?
Here, the total number of students is 100, hence, the item
difficulty index is 75/100 or 75%.
One problem with this type of difficulty index is that it may not
actually indicate that the item is difficult (or easy). A student who
does not know the subject matter will naturally be unable to answer
the item correctly even if the question is easy. How do we decide on
the basis of this index whether the item is too difficult or too easy?
The following arbitrary rule is often used in the literature :
Range of Difficulty Index Interpretation Action
0-0.25 Difficult Revise or discard
0.26 - 0.75 Right difficulty Retain
0.76 — above Easy Revise or discard
Difficult items tend to discriminate between those who know and
those who do not know the answer, Conversely, easy items cannot
discriminate between these two groups of students. We are therefore
interested in deriving a measure that will tell us whether an item C3?
discriminate between these two groups of students. Such a measure
is called an index of discrimination,Chapter 6 — item Analysis and Validation
An easy way ¢
aifnichlt aa ties S a wud ® Measure is to measure how
: SPECK t0 those j
mead ho fecee in the upper 25% of the
ies oe X18 with respect to those in the love i508 oe
the clas Upper 25% of the ‘
lower 25% found it difficult, then ee the item easy yet the
between these two groups, Thee can discriminate properly
Index of discriminatio,
= DU~DL
Example: Obtain the ind
of discriminati ish
upper 25% of the class iad a ‘scrimination of an item if the
‘
upper 25% got the correct answer) while the lower 25% of the class
had a difficulty index of 0.29
Here, DU = 0.60 while DL = 0.20, thus j fea ae
60-30 Sap, 20, thus index of discrimination
Theoretically, the index of discrimination can range from -1.0 (when
DU =0 and DL = 1) to 1.0 (when DU = 1 and DL = 0). When the
index of discrimination is equal to -1, then this means that all of
the lower 25% of the students got the correct answer while all of
the upper 25% got the wrong answer. In a sense, such an index
discriminates correctly between the two groups but the item itself
is highly questionable. Why should the bright ones get the wrong,
answer and the poor ones get the right answer? On the other hand,
if the index of discrimination is 1.0, then this means that all of the
lower 25% failed to get the correct answer while all of the upper
25% got the correct answer. This is a perfectly discriminating item
and is the ideal item that should be included in the test. From these
discussions, let us agree to discard or revise all items that have
negative discrimination index for although they discriminate correctly
between the upper and lower 25% of the class, the content of the item
itself may be highly dubious. As in the case of the index of difficulty,
we have the following rule of thumb:
Index Range Interpretation Action
-1.0 — -.50 Can discriminate Discard
: but item is questionable
55 - 0.45 Non-discriminating Revise
0.46 — 1.0 Discriminating item IncludeASSeSSwENT oF Learn OuTcowes (Assesswex™ 1)
Example: Consider a multiple choice ‘Pe of test of which the
following data were obtained:
Item Options ‘
B* Cc
1 ; 40 20 20 Total 3
7 OS 0 Upper 25%
0 gee 1g tS Lower 25%
The correct response is B. Let us compute the difficulty index ang
index of discrimination:
Difficulty Index = no. of students getting correct response/total
= 40/100 = 40%, within range of a "good item”
The discrimination index can similarly be computed:
DU = no. of students in upper 25% with correct response/no. of students in
the upper 25%
= 15/20 = .75 or 75%
DL = no. of students in lower 75% with correct response/ no. of students in
the lower 25%
= 5/20 = .25 or 25%
Discrimination Index = DU - DL = .75 - .25 = .50 or 50%.
Thus, the item also has a “good discriminating power”.
It is also instructive to note that the distracter A is not an
effective distracter since this was never selected by the students.
Distracters C and D appear to have good appeal as distracters.
Basic Item Analysis Statistics
The Michigan State University Measurement and Evaluation
Department reports a number of item statistics which aid in
evaluating the effectiveness of an item. The first of these is the index
of difficulty which MSU (http/www.msu.edu/dept/) defines as the
Proportion of the total group who got the item wrong. “Thus a high
index indicates a difficult item and a low index indicates an easy
item. Some item analysts prefer an index of difficulty which is the
Proportion of the total group who got an item right. This index may
be obtained by marking the PROPORTION RIGHT option on the
item analysis header sheet. Whichever index is selected is show!(Chaper 6 — Nem Analysis and Validation
fim 3 igher than 80, with an
0 oF 40 to @ maximum of 60.
The INDEX OF DIS.
i CRIMINATIO)
the proportion of the y,
IN is the difference between
is, when 100% of the upper
answer the item correctly, F
50 difficulty, the j i
More Sophisticated Discrimination Index
Ttem discrimination Tefers to the ability of an item to differentiate
among students on the basis of how well they know the material
being tested. Various hand c;
just high and low scoring groups.
The item discrimination index provided by ScorePak® isa
Pearson Product Moment correlation between student responses
to a particular item and total scores on all other items on the test,
This index is the equivalent of a point-biserial coefficient in this
application. It provides an estimate of the degree to which an
individual item is measuring the same thing as the rest of the items,
Because the discrimination index reflects the degree to which
an item and the test as a whole are measuring a unitary ability or
attribute, values of the coefficient will tend to be lower for tests
Measuring a wide range of content areas than for more homogeneous
tests. Item discrimination indices oe always be interpreted in the
Context of the type of test which is being analyzed. Items with low
discrimination indices are often ambiguously worded and shouldT
Assessuent oF Leaswans OuTcowes (ASSESSMENT 1)
90 (ee
eae ee
ith negative indices should be examined 4,
Jue was obt
mis-keyed, s
negative value may. indicate that the item was 1 ie 50 that
i in unl
students who knew the material tended to choose al eyed, buy
correct, response option.
be examined. Items w!
or example
i saative val ained. For example, ,
determine why a negative
I consistency consist of items with mostly
positive relati ‘ore, In practice, values of the
discrimination index will seldom exceed .50 because of the differing
e distributions. ScorePak® classifies
Tests with high internal
ionships with total test sc
shapes of item and total scor os
item discrimination as “good” if the index is above -30; “fait” if i i
between .10 and.30; and “poor” if it is below .10.
‘A good item is one that has good discriminating ability and has
sufficient level of difficult (not too difficult nor too easy). In the two
tables presented for the levels of difficulty and discrimination there
is a little area of intersection where the two indices will coincide
(between 0.56 to 0.67) which represent the good items in a test,
(Source: Office of Educational Assessment, Washington DC, USA
http://www. washington.edu/oea/services/scanning_scoring/scoring/
item_analysis.html)
At the end of the Item Analysis report, test items are listed
according to their degrees of difficulty (easy, medium, hard) and
discrimination (good, fair, poor). These distributions provide a quick
overview of the test, and can be used to identify items which are not
performing well and which can perhaps be improved or discarded.
UMMARY
The Item-Analysis Procedure for Norm-Provides the following information
1. The difficulty of the item
2. The discriminating power of the item
3. The effectiveness of each alternative
Benefits derived from Item Analysis
1. It provides useful information for class discussion of the test.
2. It provides data which helps students improve their learning,
3. It provides insights and skills that lead to the preparation of better tests
in the future.Chapter 6 - lem Analysis and Validation
Index of Difficulty
p= Ru+e.
whee
Where:
Ru — The number in the Upper group who answered the item correctly.
Ru — The number in the lower gro
T — The total number
up who answered the item correctly.
who tried the item,
Index of item Discriminating
Power
Ru+
oe RL
”’T
Where:
P — percentage who answered the item correctly (index of difficulty)
R_ — number who answered the item correctly
L
— total number who tried the item,
8
P= — x 100 = 40%
20
The smaller the percentage figure the more difficult the item
Estimate the item discriminating power using the formula below:
Ru — RL 6-2
Pepa ean eS
ee OAD
AT 10
The discriminating power of an item is reported as a decimal
Fraction; maximum discriminating power is indicated by an index of 1.00.
Maximum discrimination is usually
found at the 50 percent level of
difficulty
0.00 — 0.20 = Very difficult
0.21 — 0.80 = Moderately difficult
0.81 — 1.00 = Very easy‘Assesswent oF Learun Outcomes (Assesswent 1)
pe
6.2. Validation
After performing the item analysis and cere the items
which need revision, the next step is to validate i instrumen,
The purpose of validation is to determine the cl aren
of the whole test itself, namely, the validity and relia lity of
the test. Validation is the process of collecting and analyzing
evidence to support the meaningfulness and usefulness of the
test.
Validity. Validity is the extent to which a test measures
what it purports to measure or as referring to the appropriateness,
correctness, meaningfulness and usefulness of the specific
decisions a teacher makes based on the test results. These two
definitions of validity differ in the sense that the first definition
tefers to the test itself while the second refers to the decisions
made by the teacher based on the test. A test is valid when it is
aligned to the learning outcome.
A teacher who conducts test validation might want to
gather different kinds of evidence. There are essentially three
main types of evidence that may be collected: content-related
evidence of validity, criterion-related evidence of validity and
Construct-related evidence of validity. Content-related evidence of
validity refers to the content and format of the instrument. How
Criterion-related evidence of validity refers to the
relationship between scores obtained using the instrument and
Scores obtained using one or more other tests (often called
criterion), How strong is this relationship? How well do such
Scores estimate present or predict future fies teen
type? certainoe
‘The teacher then rewrites any item
So checked and resubmits to
the experts and/< i i
‘Or Writes new items to cover those objectives not
heretofore covered by the existing test. This continues until the
experts approve of all items and also until the experts agree that
all of the objectives are sufficiently covered by the test.
In order to obtain evidence of criterion-related validity, the
teacher usually compares scores on the test in question with the
Scores on some other independent criterion test which presumably
has already high validity. For example, if a test is designed to
measure mathematics ability of students and it correlates highly
with a standardized mathematics achievement test (external
criterion), then we say we have high criterion-related evidence
of validity. In particular, this type of criterion-related validity is
called its concurrent validity. Another type of criterion-related
validity is called predictive validity wherein the test scores in
the instrument are correlated with scores on a later performance
(criterion measure) of the students. For example, the mathematics
ability test constructed by the teacher may be correlated
with their later performance in a Division wide mathematics
achievement test.
Apart from the use of correlation coefficient in measuring
criterion-related validity, Gronlund suggested using the so-called
expectancy table. This table is easy to construct and consists of
the test (predictor) categories listed on the left hand side and
the criterion categories listed horizontally along the top of the
chart. For example, suppose that a mathematics achievement test
is constructed and the scores are categorized as high, average,
and low. The criterion measure used is the final average grades
of the students in high school: Very Good, Good, and Needs
Improvement. The two way table lists down the number ofeee ek See ee erga ES
irs of (test,
students falling under each of the possible pairs of (test, Stade)
as shown below:
Grade Point Average
Needs Improvement
Test Score Very Good | Good -
a 0 10
5
Average 10 2
— ; 10 14
The expectancy table shows that there were 20 students getting
high test scores and subsequently rated excellent in terms of their
final grades; 25 students got average scores and subsequently rated
good in their finals; and finally, 14 students obtained low test scores
and were later graded as needing improvement. The evidence for this
particular test tends to indicate that students getting high scores on it
would be graded excellent; average scores on it would be rated good
later; and students getting low scores on the test would be graded as
needing improvement later.
We will not be able to discuss the measurement of construct-
related validity in this book since the method to be used require
sophisticated statistical techniques falling in the category of factor
analysis. j
6.3. Reliability
Reliability refers to the consistency of the scores obtained — how
consistent they are for each individual from one administration of
an instrument to another and from one set of items to another. We
already gave the formula for computing the reliability of a test: for
internal consistency; for instance, we could use the split-half method
or the Kuder-Richardson formulae (KR-20 or KR-21)
Reliability and validity are related c
is unreliable, it cannot yet valid outcomes
validity may improve (or it may not). 1.
shown scientifically to be valid then it is
reliable.
‘oncepts. If an instrument
8. As reliability improves,
‘Owever, if an instrument is
almost certain that it is alsothe following table ig g
educational tests and meas
Reliability
50 - 60
.50 or below
Standard fo)
low.
‘urement ‘ed almost Universally in
Excellent relia
bility; at the level of the best
Standardized test
ts,
Very good for a classroom test
oad for a classroom test; in the range of most.
here are Probably a few items which could be
Improved.
Somewhat low. This test needs to be supplemented
by other measures (e.g., more tests) to determine
Brades. There are probably some items which could
be improved.
Suggests need for revision of test, unless it is quite
short (ten or fewer items), The test definitely needs
to be supplemented by other measures (e.g., more
tests) for grading.
Questionable reliability. This test should not
contribute heavily to the course grade, and it needs
revision.