Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

ED 106 – ASSESSMENT IN LEARNING 1 AY 2022-2023

Chapter 6

ITEM ANALYSIS AND VALIDATION

LEARNING OUTCOMES
 Explain the meaning of item analysis, item validity, reliability, item
difficulty, discrimination index
 Determine the validity and reliability of given test items

Introduction
The teacher normally prepares a draft of the test. Such a draft is subjected to item analysis and
validation in order to ensure that the final version of the test would be useful and functional. First, the
teacher tries out the draft test to a group of students of similar characteristics as the intended test takers
(try-out phase). From the try-out group, each item will be analyzed in terms of it's ability to discriminate
between those who know and those who do not know and also its level of difficulty (item analysis phase).
The item analysis will provide information that will allow the teacher to decide whether to revise or
replace an item (item revision phase). Then, finally, the final draft of the test is subjected to validation if
the intent is to make use of the test as a standard test for the particular unit or grading period. We shall be
concerned with these concepts in this Chapter.

6.1. Item Analysis

There are two important characteristics of an item that will be of interest to the teacher. These
are: a) item difficulty, and b) discrimination index. We shall learn how to measure these characteristics and
apply our knowledge in making a decision about the item in question.

The difficulty of an item or item difficulty is defined as the number of students who are able to
answer the item correctly divided by the total number of students. Thus:

Item difficulty = number of students with correct answer/ total number of students

The item difficulty is usually expressed in percentage.

Example: What is the item difficulty index of an item if 25 students are unable to
answer it correctly while 75 answered it correctly?

Here, the total numbers of students is 100, hence, the item difficulty index is 75/100 or 75%.

One problem with this type of difficulty index is that it may not actually indicate that the item is
difficult (or easy). A student who does not know the subject matter will naturally be unable to answer the
item correctly even if the question is easy. How do we decide on the basis of this index whether the item is
too difficult or too easy? The following arbitrary rule is often used un the literature:

Range of Difficulty Index Interpretation Action

0 - 0.25 Difficult Revise or discard

0.26 - 0.75 Right difficulty Retain

gemmafagustin 1
ED 106 – ASSESSMENT IN LEARNING 1 AY 2022-2023

O.76 – above Easy Revise or discard

Difficult items tend to discriminate between those who know and those who do not know the
answer. Conversely, easy items cannot discriminate between these two groups of students. We are
therefore interested in deriving a measure that will tell us whether an item can discriminate between these
two groups of students. Such a measure is called an index of discrimination.

An easy way to derive such a measure is to measure how difficult an item is with respect to
those in the upper 25% of the class and how difficult it is with respect to those in the lower 25% of the
class. If the upper 25% of the class found the item easy yet the lower 25% found it difficult, then the item
can discriminate properly between these two groups. Thus:

Index of discrimination = DU - DL

Example: Obtain the index of discrimination of an item. If the upper 25% of the class had a
difficulty index of 0.60 (i.e. 60% of the upper 25% got the correct answer) while the lower 25% of the
class had a difficulty index of 0.20.

Here, DU = 0.60 while DL = 0.20, thus index of discrimination = .60 - .20 = .40.

Theoretically, the index of discrimination can range from -1.0 (when DU =0 and DL =1). When the index
of discrimination is equal to -1, then this means that all of the lower 25% of the students got the correct
answer while all of the upper 25% got the wrong answer. In a sense, such an index discriminates correctly
between the two groups but the item itself is highly questionable. Why should the bright ones get the
wrong answer and the poor ones get the right answer? On the other hand, if the index of discrimination is
1.0, then this means that all of the lower 25% failed to get the correct answer while all of the upper 25%
got the correct answer. This is a perfectly discriminating item and is the ideal item that should be included
in the test. From these discussions, let us agree to discard or revise all items that have negative
discrimination index for although they discriminate correctly between the upper and lower 25% of the
class, the content of the item itself may be highly dubious. As in the case of the index of difficulty, we
have the following rule of thumb:

Index Range Interpretation Action

-1.0 - -.50 Can discriminate but item is questionable Discard

-.55 - 0.45 Non-discriminating Revise

0.46 - 1.0 Discriminating item Include

Example: Consider a multiple choice type of test of which the following data were obtained:

Item Options

A B* C D
0 40 20 20 Total
1 0 15 5 0 Upper 25%
0 5 10 5 Lower 25%

The correct response is B. Let us compute the difficulty index and index of discrimination:

Difficulty Index = no. Of students getting correct response/total


= 40/100 = 40%, within range of a "good item"

gemmafagustin 2
ED 106 – ASSESSMENT IN LEARNING 1 AY 2022-2023

The discrimination index can similarly be computed:

DU = no. Of students in upper 25% with correct response/ no. Of students in the upper 25%
= 15/20 = .75 or 75%

DL = no. Of students in lower 75% with correct response/ no. Of students in lower 25%
= 5/20 = .25 or 25%

Discrimination Index = DU - DL =.75-.25=.50 or 50%

Thus, the item also has a "good discriminating power".

It is also instructive to note that the distracter A is not an effective distracter since this was
never selected by the students. Distracters C and D appear to have a good appeal as distracters

Basic Item Analysis Statistics

The Michigan State University Measurement and Evaluation Department reports a number
of item statistics which aid in evaluating the effectiveness of an item. The first of these is the index of
difficulty which MSU defines as the proportion of the total group who got the item wrong. "Thus a
high index indicates a difficult item and a low index indicates an easy item. Some item analysts prefer
an index of difficulty which is the proportion of the total group who got an item right. This index may
be obtained by marketing the PROPORTION RIGHT option on the item analysis header sheet.
Whichever index is selected us shown as the INDEX OF DIFFICULTY on the item analysis print-out.
For classroom achievement tests, most test constructors desire items with indices of difficulty no
lower than 20 nor higher than 80, with an average index of difficulty from 30 or 40 to a maximum of
60.

The INDEX OF DISCRIMINATION is the difference between the proportion of the upper
group who got an item right and the proportion of the lower group who got the item right. This index
is dependent upon the difficulty of an item. It may reach a maximum value of 100 for an item with an
index of difficulty of 50, that is, when 100% of the upper group and none of the lower group answer
the item correctly. For items of less than or greater than 50 difficulty, the index of discrimination has
a maximum value of less than 100. Interpreting the Index of Discrimination document contains a
more detailed discussion of the index of discrimination."

More Sophisticated Discrimination Index

Item discrimination refers to the ability of an item to differentiate among students on the
basis of how well they know the material being tested. Various hand calculation procedures have
traditionally been used to compare item responses to total test scores using high and low scoring
groups of students. Computerized analyses provide more accurate assessment of the discrimination
power of items because they take into account responses of all students rather than just high and low
scoring groups.

The item discrimination index provided by ScorePak® is a Pearson Product Moment


correlation between student responses to a particular item and total scores on all other items on the
test. This index is the equivalent of a point-biserial coefficient in this application. It provides an
estimate of the degree to which am individual item is measuring the same thing as the rest of the
items.

Because the discrimination index reflects the degree to which an item and the test as a
whole are measuring a unitary ability or attribute, values of the coefficient will tend to be lower for
tests measuring wide range of content areas than for more homogeneous tests. Item discrimination

gemmafagustin 3
ED 106 – ASSESSMENT IN LEARNING 1 AY 2022-2023

indices must always be interpreted in the context of the type of test which is being analyzed. Items
with low discrimination indices are often ambiguously where did and should be examined. Items with
negative indices should be examined to determine why a negative value was obtained. For example, a
negative value may indicate that the item was miss-keyed, so that students who knew the material
tended to choose an unkeyed, but correct, response option.

Tests with high integral consistency consists of items with mostly positive relationships
with total test score. In practice, values of the discrimination index will seldom ex ceed the .50 because
of the differing shapes of items and total score distributions. ScorePak® classifies item discrimination as
"good" if the index is above .30; "fair'" if it is between .10 and .30; and "poor" if it is below .10.

A good item is one that has good discriminating ability and has sufficient level of difficult (not
too difficult nor too easy). In the two tables presented for the levels of difficulty and discrimination there is
a little area of intersection where the two indices will coincide (between 0.56 to 0.67) which represent the
good items in a test.

At the end of the item analysis report test items are listed according to their degrees of difficulty
(easy medium hard) and discrimination (good fair poor). These distributions provide a quick overview of
the test and can be used to identify items which are not performing well and which can perhaps be
improved or discarded.

SUMMARY

The Item-Analysis Procedure for Norm- Provides the following information


1. The difficulty of the item
2. The discriminating power of the item
3. The effectiveness of each alternative

Benefits derived from Item Analysis

1. It provides useful information for class discussion of the test.


2. It provides data which help students improve their learning.
3. It provides insights and skills that lead to the preparation of better tests in the future.

Index of Difficulty

Ru+ RL
P=
1
T
2
Where:
RU - the number in the upper group who answer the item correctly.
RL - The number in the lower group who answer the item correctly.
T - The total number who try the item.

Index of item Discriminating Power

Ru+ RL
D=
1
T
2

Where:
P - percentage who answered item correctly ( index of difficulty)
R - number who answered item correctly
T - total number who tried the item.

gemmafagustin 4
ED 106 – ASSESSMENT IN LEARNING 1 AY 2022-2023

8
P= x 100=40 %
20

The smaller the percentage figure the more difficult item

Estimate the item discriminating power using the formula below:


Ru−RL 6−2
D= = =0.40
1 10
T
2

The discriminating power of an item is reported as a decimal fraction;


maximum discriminating power is indicated by an index of 1.00.

Maximum discrimination issues they found at the 50 percent level of difficulty.

0.00 - 0.20 = Very difficult


0.21 - 0.80 = Moderately difficult
0.81 - 1.00 = Very easy

6.2. Validation

After performing the item analysis and revising the items which need revision, the next step is
to validate the instrument. The purpose of validation is to determine the characteristics of the whole test
itself, namely, the validity and reliability of the test. Validation is the process of collecting and analyzing
evidence to support the meaningfulness and usefulness of the test.

Validity. Validity is the extent to which a test measures what it purports to measure or as
referring to the appropriateness, correctness, meaningfulness and usefulness of the specific decisions a
teacher makes based on the test results. These two definitions of validity differ in the sense that the first
definition refers to the test itself while the second refers to the decisions made by the teacher based on the
test. A test is valid when it is aligned to the learning outcome.

A teacher who conducts test validation might want to gather different kinds of evidence. There are
essentially three main types of evidence that may be collected: content-related evidence of validity,
criterion-related evidence of validity and construct-related evidence of validity. Content-related evidence
of validity refers to the content and format of the instrument. How appropriate is the content? how
comprehensive? Does it logically get at the intended variable? How adequately does a sample of items or
questions represent the content to be assessed?

Criterion-related evidence of validity refers to the allure lesion ship between scores obtained
using the instrument and the scores obtained using one or more other tests (often called criterion). How
strong is this relationship? How well do such scores estimate present or predict future performance of a
certain type?

Construct-related evidence of validity refers to the nature of psychological construct or characteristic being
measured by the test. How well does a measure of the construct explain differences in the behavior of the
individuals or their performance on a certain task?

The usual procedure for determining content validity may be described as follows: The teacher
writes out the objectives of the test based on table of specifications and then gives this together with the
test to at least two (2) experts along with a description of the intended test takers. The experts look at the
objectives, read over the items in the test and place a check mark in front of each question or item that they
feel does not measure one or more objectives. They also place a checkmark in front of each object if not
assessed by any item in the test. The teacher then rewrites any item so checked and resubmit to the experts

gemmafagustin 5
ED 106 – ASSESSMENT IN LEARNING 1 AY 2022-2023

and/or writes new items to cover those objectives not hereto for covered by the existing test. This
continues until the experts approve of all items and also until the experts agree that all of the objectives are
sufficiently covered by the test.

In order to obtain evidence of criterion-related validity, lee teacher usually compare scores on
the task and questions with the scores on some other independent criterion test which presumably has
already high validity. For example, if a test is designed to measure mathematics ability of students and it
correlates highly standardized mathematics achievement test (external criterion), then we say we have high
criterion-related evidence of validity. In particular, this type of criterion-related validity is called its
concurrent validity. Another type of criterion-related validity is called predictive validity wherein the test
scores in the instrument or curly add with scores on a later performance (criterion measure) of the students.
For example, the mathematics ability test constructed by the teacher may be correlated with earlier
performance in a Division wide mathematics achievement test.

Apart from the use of correlation coefficient in measuring criterion-related validity, Gronlund
suggested using the so-called expectancy table. This table is easy to construct and consists of the test
(predictor) categories listed on the left hand side and the criterion categories listed horizontally long to top
of the chart. For example, suppose that a mathematics achievement test is constructed in the scores are
categorized as high, average and low. The criterion measure used is the final average grades of the students
in high school: Very Good, Good, and Needs Improvement. The two way table lists down the number of
students falling under each of the possible pairs of (test, grade) as shown below:

Grade Point Average

Test score Very Good Good Needs Improvement

high 20 10 5
average 10 25 5
low 1 10 14

The expectancy table shows that there were 20 students getting high test scores and
subsequently rated excellent in terms of their final grades; 25 students got average scores and subsequently
rated good in their finals; and finally, 14 students obtained low test scores and were later graded as needing
improvement. The evidence for this particular test tends to indicate that students getting high scores on it
would be graded excellent; average scores on it would be rated good later; and students getting low scores
on the test would be graded as needing improvement later.

We will not be able to discuss the measurement of construct related validity in this book since
the method to be used to require sophisticated statistical techniques falling in the category of factor
analysis.

6.3. Reliability

Reliability refers to the consistency of the scores obtained – how consistent they are for each
individual from one administration of an instrument to another and from one set of items to another. We
already gave the formula for computing the reliability of a test: for internal consistency; for instance, we
could use the split-half method or the Kuder-Richardson formulae. (KR-20 or KR-21)

Reliability and validity are related concepts. If an instrument is unreliable, it cannot yet valid
outcomes. As reliability improves, validity may improve (or it may not). However, if an instrument is
shown scientifically to be valid then it is almost certain that it is also reliable.

gemmafagustin 6
ED 106 – ASSESSMENT IN LEARNING 1 AY 2022-2023

The following table is a standard followed almost universally and educational tests and
measurement.

Reliability Interpretation
.90 and above Excellent reliability; at the level of the best
standardized tests

.80-.90 Very good for a classroom test

.70-80 Good for a classroom test; in the range of most.


There are probably a few items which could be
improved.

.60-.70 Somewhat low. This test needs to be supplemented


by other measures (e.g., more tests) to determine
grades. there are probably some items which could
be improved.

.50-.60 Suggest need for revision of test, unless it is quite


short (10 or fewer items). Test definitely needs to be
supplemented by other measures (e.g., more tests)
for grading.

.50 or below Questionable reliability. This test should not


contribute heavily to the course grade, and it needs
revision.

6.4. Exercises

A. Find the index of difficulty in each of the following situations;

1. N = 60, number of wrong answers: upper 25% = 2 lower 25% = 6

2. N = 80, number of wrong answers: upper 25% = 2 lower 25% = 9

3. N = 30, number of wrong answers: upper 25% = 1 lower 25% = 6

4. N = 50, number of wrong answers: upper 25% = 3 lower 25% = 8

5. N = 70, number of wrong answers: upper 25% = 4 lower 25% = 10

B. Which of the items in Exercise A are found to be most difficult?

C. Compute the discrimination index for each of the items in Exercise A.

D. Answer the following questions:

1. A teacher constructed a test which would measure the student’s ability to apply previous knowledge to
certain situations. In particular, the evidence to the student is able to apply previous knowledge are:
• Draw correct conclusions that are based on the information given;
• Identify one or more logical implications to follow from a given point of view;

gemmafagustin 7
ED 106 – ASSESSMENT IN LEARNING 1 AY 2022-2023

• State whether to ideas are identical, just similar, unrelated or contradictory.


• Right test items using the multiple choice type of test that would cover these concerns of the
teacher. Show your test to an expert and ask him to judge whether the items in indeed cover these
concerns.

2. What is an expectancy table? Describe the process of constructing an expectancy table. When do we use
an expectancy table?

3. Enumerate the three types of validity evidence. Which of these types of validity is the most difficult to
measure? Why?

4. What is the relationship between validity and reliability? Can a test be reliable and yet not valid?
Illustrate.

5. Discuss the different measures of reliability. Justify the use of each measure in the context of
measuring reliability.

gemmafagustin 8

You might also like