Professional Documents
Culture Documents
04 Reliability and Validity
04 Reliability and Validity
04 Reliability and Validity
Abstract:
This module deals with defining and determining quality of test instruments and test items. Tests as an
instrument for evaluation need to be accurate, objective, practical and reliable. Further they should be able
to discriminate between good and bad performers and have a uniform difficulty level. This module explains
each of these terms and describes how they can be measured. Module specifically touched on six measures
of test quality, objectivity, practicability, reliability, validity, difficulty level and discrimination index. It
also talks about mathematical measures like mean, median, mode, standard deviation and correlation that
help in measuring test quality.
Objective:
1.
To enable the reader to define the quality of a test and measure it.
a.
b.
To understand the various measurements used in defining quality like mean, median, mode,
standard deviation & correlation
Introduction:
A test needs to evaluate and measure the performance of the candidate, department or an institution.
Measurement is purely quantitative and when an individuals judgment is added it becomes evaluation.
A test should measure what it intended to measure, with considerable accuracy and at the same time it
should be able to discriminate between students of varied abilities.
Subjective judgment leads to inaccuracy and errors. These errors are the standard errors of measurements.
Hence, these need to be identified and eliminated.
There are several valid reasons for analyzing questions and tests that students have completed and that have
already been graded. Some of the reasons include the following:
Identify content that has not been adequately covered and should be re-taught,
Determine if any items need to be revised in the event they are to be used again or become part of
an item file or bank,
Identify items that may not have functioned as they were intended,
1
Validity and reliability are the overarching principles that govern test design. Validity is the extent to which
a test measures what it intends to measure. Reliability is the extent to which the test scores are consistent.
Reliability is a property of the test as a measuring instrument. Other measures like objectivity,
practicability, difficulty level and discrimination index are also some measures of test quality and have
been discussed in the subsequent sections.
Understanding Test Item and Test Quality
There are various forms of assessment techniques available to examiner. They range from assessing
students using a fixed-response multiple choice test or an open-response short answer, long answer or essay
type of exam. These exams serve a variety of purposes. The results may be used to access a students
strengths and weaknesses or plan further instructional activity. They may be used for selection, placement
or for certification. They may be used as tools for appraisals. Regardless of the objective of assessment, all
assessments need to possess certain characteristics and need to have a certain degree of quality. A test is
said to be of good quality if it satisfies the following criteriai:
1.
Objectivity (justice): Objectivity is said to be ensured when the paper setter is given a design/
method to follow. Objectivity of the darts exercise would depend upon how well is the task
defined to the players. A test with good objectivity would define number of attempts, distance
from where to aim, etc.
For example, teachers at several levels of education assess students overall learning by giving
them projects. Often, students are not told anything about the scope of the work. They are also
unaware of what distinguishes a good project from a bad project and how would they be graded.
It has often been observed that students learning is enhanced from a project if the scope of the
project is clearly defined and the student is also told clearly about certain specific performance
characteristics arranged in levels, indicating the degree to which the standard has been met.
If a biology student is asked to maintain a journal on leaf collection, a test with good objectivity
for this project would look like as follows:
Grade A
Grade B
Grade C
Grade D
Appearance/
Neatness
Extremely neat,
with cover page,
leaves dried and
neatly pasted
Untidy, no cover
page & leaves not
dried
Organization
Well organized
and categorized/
catalogued
organized and
categorized/
catalogued with
some errors
organized and
categorized/
catalogued with a
lot of errors
Disorganized and
no cataloguing
Information and
Both common
Both common
Both common
Such information
2
understanding
is missing
Objectivity needs to be maintained not only for the test but also for test items.
2.
Practicability (usability): All test instruments should be easily usable and have simple and clear
instructions for administration of the instrument. For example, an online test may not be practical
in remote areas where internet connectivity is poor. A paper based test would probably be more
appropriate.
3.
Reliability (dependability): A test instrument is said to be reliable if it produces the same result
every time. It is the consistency of measurement. A measure is considered reliable if a person's
score on the same test given twice is similar. The ability of a player to consistently hit around the
bulls eye is his measure of reliability.
There are several ways by which reliability is generally measured: Test-retest, alternate form, split
half, internal consistency (inter-item) and inter-rater.
a.
Test/retest: This is the more conservative method to estimate reliability. In this method, the
scores from repeated tests of same participants, with the same test are compared. The test
instrument remains the same. A reliable test would produce very similar scores. Simply put,
the idea behind test/retest is that you should get the same score on test 1 as you do on test 2.
For example, IQ tests typically show high test-retest reliability.
The reliability of weighing scales in a physics experiment can be tested by recording weight 3
to 4 times with an interval of few minutes.
Test-retest reliability is a measure of stability.
b.
Alternate form reliability: when participants are able to recall their previous responses, test retest procedures fail. In such cases, alternate form reliability is used. As the name suggests, two
or more versions of the tests are constructed that are equivalent in content and difficulty.
For example, the marks in the pre board test should be consistent to the board exam if there is
no change in the underlying conditions between the two.
Teachers also use this technique to create replacement exams for students who have for some
reason missed the main exam.
Alternate form reliability is a measure of equivalence.
3
c.
Split half reliability: this method of reliability compares scores from different parts of the test
such as comparing the scores form even vs. odd numbered questions.
d.
e.
Inter rater reliability: scorer reliability needs to be measured when observers use their
judgment for interpretation.
For example, when analyzing live or video taped behavior and written answers to open ended
essay type questions, different observers take measurement of the same responses. A high
degree of correlation between the scores given by different observers gives high inter-rater
reliability.
There are often more than two judges to judge the performance of gymnasts in a sporting
event.
There are also more than one teachers present during the viva-voce examination of a student.
A high correlation between the scores given by different judges to the gymnasts and teachers
to the students indicates a high inter-rater reliability.
4.
Validity (accuracy): A test instrument should accurately measure what it is designed to test. It is
the strength of our conclusions. Most tests are designed to measure hypothetical constructs like
intelligence or learning which the examiner needs to operationalize. A valid test will measure this
construct (learning) without being influenced by other factors (students motivation level). It
answers the examiners question was I right in giving the student this test/ test item? in the
above example of playing darts, if the student is able to aim the bulls eye correctly, he is valid.
So, he is valid in A and B in the diagram above (though he is less reliable in B). For example, a
The first question tests the students memory and not his/ her understanding of Daffodils.
Validity is also of different types:
a.
Face Validity - the test looks to be a good one: what teachers and students think of the
test. Is it a reasonable way of assessing students? Is it too simple? Or is it too difficult?
The consensus of experts (generally) that a measure represents a concept. It is the least
stringent type of validity.
b.
c.
Content Validity: Content validity is the property of a test such that the test items sample
the universe of items for which the test is designed. Content validity helps us understand
whether a sample of items truly represents the entire universe of items for a particular
topic.
For example a teacher gives her students a list of 200 words and would like to know
whether they have learnt to spell them correctly. She may choose a sample of say 20
words for a small test. We would like to know how representative were these 20 words of
the entire list so that we can generalize that a student who spells 80% of these 20 words
correctly would be able to spell 80% of the entire list correctly.
d.
Criterion Validity: Criterion validity assesses whether a test reflects a set of abilities in a
current or a future setting as measured by some other test. It is of two types predictive
(future) and concurrent validity (present).
Concurrent validity - the test gives similar results to existing tests that have
already been validated. For example, assume that interview as a method has
already been validated as a good indicator for employee performance. A written
technical exam shall have high concurrent validity if it also gives similar results.
For example reading readiness test scores might be used to predict students future
achievement in reading or a test of dictionary skills might be used to estimate students
current skills in the actual use of a dictionary.
Difference between Reliability & Validity: Assume that there are some individuals playing darts.
The success of their skill is based on the fact of how close to the bulls eye can they hit
consistently.
Let there be four persons playing, Person A, B, C, and D and their results are given in figure 1.
Then it can be said from the figure below that Player A is both valid and reliable. Player A not just
achieves the desired result (valid) but also does it consistently (reliable).
5.
Difficulty level: A question paper or any test instrument is generally administered to a group,
which is of about the same age and in the same grade/ standard. Thus, the test instrument must be
made to a difficulty level suitable to the group. Item difficulty is simply the percentage of students
taking the test who answered the item correctly. The larger the percentage getting an item right,
the easier the item. The higher the difficulty index, the easier the item is understood to be.
For example, in the questions below, which item is more difficult?
a.
b.
It is relatively easier to recognize the individual in the second question than the first.
Also for example, an English test item that is very difficult for an elementary student will be very
easy for a high school student.
Difficulty index tells us how difficult the item is or how many people got that item correct. It is
calculated as follows:
U c Lc
T
7
Where, Uc is the number of people in the upper group who answered the item correctly, L c is the
number of people in the lower group who answered the item correctly. T is the total number of
responses to the item.
For example, in a class, if out of the top 10 students 9 gave a correct response to the question
Who is the president of India? and if out of the bottom 10 students, only 4 gave a correct
response to the same, the difficulty level of the question would be:
94
0.65 65%
20
This means that only 65 % of the students could answer the question correctly.
6.
Discrimination Value: Even though it has been stated that a test instrument must be suited for a
homogenous group yet it should be able to distinguish between the different ability levels of
different individuals being tested. The darts test should be able to discriminate between a novice,
an amateur and an expert.
A good item discriminates between those who do well on the test and those who do poorly. The
item discrimination index, D can be computed to determine the discriminating power of an item. If
a test is given to a large group of people, the discriminating power of an item can be measured by
comparing the number of people with high test scores who answered that item correctly with the
number of people with low scores who answered the same item correctly. If a particular item is
doing a good job of discriminating between those who score high and those who score low, more
people in the top-scoring group will have answered the item correctly. Discrimination index D is
given by:
U c Lc
T /2
Where, Uc is the number of people in the upper group who answered the item correctly, L c is the
number of people in the lower group who answered the item correctly; U and L are the number of
people in the upper and lower groups respectively. T is the total number of responses to the item.
For example, if 15 out of 20 persons in the upper group answered a particular question correctly
and 5 out of 30 people in the lower group answered the same question correctly, then,
15 5
10
0.4
20 30 2 25
The higher the discrimination index, the better the item because such a value indicates that the
item discriminates in favor of the upper group, which should get more items correct.
An item that everyone gets correct or that everyone gets incorrect will have a discrimination index
equal to zero.
When more students in the lower group than in the upper group select the right answer to an item,
the item actually has negative discrimination index
While it is important to analyze the performance of individual test items (reliability, difficulty level,
discrimination value, etc.) it is also important to analyze the overall performance of the complete test or its
subsections. These criteria are measured using certain statistical measures primarily based on measures of
central tendency mean, median, mode and standard deviation (measure of dispersion). The mean, median
and mode show how the test scores cluster together and the standard deviation shows how widely the
scores are spread out.
Mean (also called average): For a data set, the mean is the sum of the observations divided by the number
of observations.
Mean
1 n
xi
n i 1
For example, the arithmetic mean of 34, 27, 45, 55, 22, 34 (six values) is (34+27+45+55+22+34)/6 = 217/6
36.167.
Median is described as the number separating the higher half of a data set from the lower half.
For example, consider the dataset {1, 2, 2, 2, 3, 9}. The median is 2 in this case.
Mode is the value that occurs the most frequently in a data set.
For example, the mode of the sample [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] is 6.
Standard deviation of a data set is a measure of the spread of its values. It is a measure of dispersion that
takes every test score into account. Simply put, it the average amount that each students score deviates
(differs) from the mean of the class. The standard deviation is usually denoted with the letter .
1 n
xi x
n i 1
For example, the standard deviation of 34, 27, 45, 55, 22, 34 (six values) is 12.06.
These measures of central tendency and dispersion show how appropriately a test has been designed for its
intended purpose. They help the examiner determine the level of difficulty required hand how well
different levels of students can be differentiated. If the test results show skewness, either there is clustering
of marks towards the top or clustering towards the bottom, the examiner may conclude that the test
designed is too easy or too difficult for the students.
9
Correlation: This concept lays foundations for most concepts of test analysis. It tells the examiner the
extent to which two or more sets of results agree with each other.
For example,
Case 1: The results of two tests for the same set of students yielded the following results.
Student No
Test 1 Rank
Test 2 Rank
This shows that the students ranked identically on the two tests, that is, all ranks are same for both the tests.
This shows a perfect positive correlation or a correlation of +1.
Case 2: If the results of two tests for the same set of students yielded the following results:
Student No
Test 1 Rank
Test 3 Rank
Here the ranks are as different from each other as it is possible to be. The student who was ranked 1 in first
test was ranked last in the second test and vice versa. This shows a perfect negative correlation or a
correlation of -1.
Case 3: If the results of two tests for the same set of students yielded the following results:
Student No
Test 1 Rank
Test 4 Rank
10
6
5
4
3
2
1
0
0
2
T est 1 Rank
T est 4 Rank
This graph shows that there is no visible pattern between the Test 1 Ranks and Test 4 Ranks. Hence it can
be said that there is no correlation.
However, in most situations there will be some amount of association. And to measure this association
whether positive or negative the coefficient of correlation is used. The following table may be used as a
basis for interpreting coefficient of correlationii
Correlation
Small
Medium
Large
Negative
0.3 to 0.1
0.5 to 0.3
1.0 to 0.5
Positive
0.1 to 0.3
0.3 to 0.5
0.5 to 1.0
nXY XY
n( X 2 ) (X ) 2 n(Y 2 ) (Y ) 2
Points to remember:
A good test satisfies the criteria of objectivity, practicability, reliability, validity, difficulty level
and discriminatory power.
Objectivity is said to be ensured when the paper setter is given a design/ method to follow.
All test instruments should be easily usable and have simple and clear instructions for
administration of the instrument.
A test instrument is said to be reliable if it produces the same result every time.
The test instrument must be made to a difficulty level suitable to the group.
11
A test item should be able to distinguish between the different ability levels of different
individuals being tested.
Exercises
Q1. If a vocabulary test was conducted with persons from various age groups, determine for the testing
authority if there was any relationship between the age and the marks obtained.
x = age of person
y = marks obtained
Total
x2
y2
xy
28.4
81
806.56
255.6
15
29.3
225
858.49
439.5
24
37.6
576
1413.76
902.4
30
36.2
900
1310.44
1086
38
36.5
1444
1332.25
1387
46
35.3
2116
1246.09
1623.8
53
36.2
2809
1310.44
1918.6
60
44.1
3600
1944.81
2646
64
44.8
4096
2007.04
2867.2
76
47.2
5776
2227.84
3587.2
415
375.6
21623
14457.72
16713.3
nXY XY
n( X 2 ) (X ) 2 n(Y 2 ) (Y ) 2
12
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Correct answers
Amit
Prakash
Rahul
Gina
Tom
Ritu
Kriti
Prerna
Bhim
Arjun
Score
Amit
Prakash
Rahul
Gina
Tom
Ritu
Kriti
Prerna
Bhim
Arjun
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
1
1
1
0
1
1
1
0
0
1
0
1
1
1
1
1
1
1
1
1
1
0
1
1
1
0
1
0
1
0
1
1
1
1
1
0
1
1
0
0
1
1
1
1
0
0
1
1
0
0
1
1
1
1
0
1
0
1
0
0
1
1
1
0
0
1
0
0
0
0
1
1
1
0
1
1
0
1
0
0
1
1
1
0
1
1
1
0
0
0
1
1
1
1
1
1
1
1
1
1
Total
correct
9
9
10
6
7
7
7
6
3
3
1 in the above table indicates a correct response and 0 indicates an incorrect response.
Mean
9 9 10 ... 3
6.7
10
Median (The middle score when all scores are put in rank order) = 7
Mode (Score(s) occurring most often) = 7
Range (Low score to high score) = 3-10
Arranging the above table in descending order of total score,
Rahul
Amit
Q1
1
1
Q2
1
0
Q3
1
1
Q4
1
1
Q5
1
1
Q6
1
1
Q7
1
1
Q8
1
1
Q9
1
1
Q10
1
1
Total
10
9
13
Prakash
Tom
Ritu
Kriti
Gina
Prerna
Bhim
Arjun
1
1
1
1
0
0
0
1
1
1
1
1
1
1
1
1
0
1
0
1
1
0
1
0
1
1
0
1
1
1
0
0
1
0
0
1
1
1
0
0
1
0
1
0
1
1
0
0
1
0
1
0
0
0
0
0
1
1
1
0
0
1
0
0
1
1
1
1
0
0
0
0
1
1
1
1
1
1
1
1
9
7
7
7
6
6
3
3
Let us consider a students getting a score of 7 and above as the upper group and those getting below 7 as
the lower group
Uc
Lc
D
Q1
6
1
50.00%
U c Lc
to calculate item difficulty:
T
Q2
5
4
10.00%
Q3
4
2
20.00%
Q4
5
2
30.00%
Q5
4
2
20.00%
Q6
4
2
20.00%
U c Lc
T /2
Q7
4
0
40.00%
Q8
5
1
40.00%
Q9
6
0
60.00%
Q10
6
4
20.00%
Discrimination Index:
Q1
Uc
Lc
D
6
1
1.00
Q2
Q3
5
4
0.20
Q4
4
2
0.40
5
2
0.60
Q5
4
2
0.40
Q6
4
2
0.40
Q7
4
0
0.80
Q8
5
1
0.80
Q9
6
0
1.20
Q10
6
4
0.40
Q 3: A BPO firm wants to re-examine its recruitment strategy for tele-callers. It had some past data on
performance of existing employees in their jobs and the scores on 3 tests that they had scored at the time of
their recruitment. Examine these scores and suggest a future recruitment strategy for the firm.
successful
tele caller
1
1
1
0
1
0
1
0
0
english
grammar
test
9
10
9
4
9
5
9
8
2
Vocabulary
test
3
3
4
5
0
9
2
9
6
performance
in verbal
ability test
8
7
8
4
9
4
7
3
5
14
1
0
0
0
1
1
1
0
1
0
1
0
0
1
0
1
7
7
6
4
8
6
8
5
10
5
8
6
5
8
3
7
3
2
0
0
10
10
0
4
7
0
0
10
5
6
4
10
10
5
2
6
8
8
7
4
9
3
10
5
4
9
5
9
Answer:
Correlation between the construct Successful tele caller and the test scores would measure the construct
validity of the tests. A high correlation would indicate the appropriateness of the test. The correlation can
be obtained using the formula:
nXY XY
n( X 2 ) (X ) 2 n(Y 2 ) (Y ) 2
0.770359
Vocabulary test
-0.00542
0.897702
The results show that the Verbal ability test is the most valid test in measuring the performance of a telecaller, followed by the English Grammar test. The vocabulary test has no correlation with the job
performance and therefore can be discontinued with.
Tips for further study:
There are statistical measures to measure and interpret reliability and validity like cronbach alpha, kappa
coefficient, etc. These can be further studied from the book titled Statistics for the Social Sciences by
Victoria L. Mantzopoulos published by Prentice Hall, Englewood Cliffs, NJ (1995).
.
15
Colleges like IcfaiTech College of Engineering make use of the principles of Standard deviation, mean,
range to access the reliability of test scores between different teachers teaching the same subject. Some
colleges like IBS, Hyderabad also use such measures extensively.
Bibliographical References:
Airasian, Peter W. (2000). Assessment in the classroom. A concise approach. Boston: McGrawHill.
Linn, Robert L. & Gronlund, Norman E. (2000). Measurement and assessment in teaching. Upper
Saddle River, NJ: Prentice-Hall, Inc.
Wiersma, William & Jurs, Stephen G. (1985). Educational measurement and testing. Boston:
Allyn and Bacon, Inc.
Gronlund, N.E., & Linn, R.L. (1990). Measurement and evaluation in teaching (6 th Ed). New
York: MacMillan
Wood, D.A. (1960). Test construction: Development and interpretation of achievement tests.
Columbus, OH: Charles E. Merrill Books, Inc
Nunnally, J.C. (1972). Educational measurement and evaluation (2 nd Ed). New York: McGrawHill
Anderson J C, Clapham C & Wall D. (1995). Language Test Construction & Evaluation.
Cambridge University Press
Salkind N J. (2006). Tests & Measurement for People who think They Hate Tests &
Measurement. Sage Publications, Inc.
Linn R L. & Miller M D. (2005) Measurement & Assessment in Teaching (9 th Ed). Merrill
Prentice Hall
Developing the perfect test is the unattainable goal for anyone in an evaluative position. Even when
guidelines for constructing fair and systematic tests are followed, a plethora of factors may enter into a
student's perception of the test items. Looking at an item's difficulty and discrimination will assist the test
developer in determining what is wrong with individual items. Item and test analysis provide empirical data
about how individual items and whole tests are performing in real test situations.
16
Test designers need to accomplish some requirements concerning validity, objectivity and reliability for the
items and for the test itself; they also have to follow some logical procedures.
ii
Even though guidelines for interpreting the coefficient of correlation have been given however, all such
criteria are in some ways arbitrary and should not be observed too strictly. This is because the interpretation
of a correlation coefficient depends on the context and purposes. A correlation of 0.9 may be very low if
one is verifying a physical law using high-quality instruments, but may be regarded as very high in the
social sciences where there may be a greater contribution from complicating factors.
17