Download as pdf or txt
Download as pdf or txt
You are on page 1of 7





Volume: 6
Pages: 499-504
Document ID: 2022PEMJ455
DOI: 10.5281/zenodo.7486865
Manuscript Accepted: 2022-25-12
Psych Educ,2022, 6: 499-504, Document ID: PEMJ455, doi:10.5281/zenodo.7486865, ISSN 2822-4353
Research Article

Discrimination and Difficulty Indices of a Senior High School Entrance Examination

Using Classical Test Theory
Jeffrey Imer C. Salim*
For affiliations and correspondence, see the last page.

Measurement of the psychological capacities of a person is done worldwide through the use of
achievement testing. It is thereby important that the institution that uses achievement tests create
correct, relevant and reliable test constructs in order to come up with the beneficial results. This study
was done to evaluate the Discrimination and Difficulty Indices of the Annual Senior High School
Entrance Examination, which consists of 75 English, 30 Science, 40 Mathematics, and 25 Aptitude
multiple-choice questions, of the Senior High School Department of Mindanao State University -
Tawi-Tawi College of Technology and Oceanography using the Classical Test Theory. Descriptive
quantitative design was employed and raw data from the scored answer sheet of 200 examinees was
utilized. Stratified sampling was applied to the raw data. Then, a computer application, Statistical
Program for Social Sciences (SPSS), was employed to determine the discrimination and difficulty
indices. The study concluded that the most of multiple-choice items of the examination have
difficulty values less than 0.5, which means these items are difficult for the takers, and discrimination
values higher than 0.2 which can be considered good items. The results also implied that the test
constructs are highly reliable. The study recommends further enhancement of the examination.

Keywords: psychological testing, achievement tests, discrimination index, difficulty index, classical
test theory

Introduction Literature Review

Measuring a person's mental capabilities dates back to

Testing Psychological Capabilities
2000 years ago in China where the earliest written
exam was recorded. It was a civil service examination Psychological testing is a process for selecting,
given to people, regardless of social status, to qualify administering and interpreting scores on a test in an
for a government position. applied setting. A test with stellar psychometric
properties might be used to make judgements that are
In 1886, psychological tests started to be developed not fair. Test fairness is an important social issue. The
and established. Brown and Thomson (1921) had psychometric properties of tests, including information
created an important contribution to the development about the test score biases, should always be one factor
of test theory and factor analysis through a treatise on that informs the use of tests in applied settings.
psychophysics, correlation, and ability testing. According to Price (2017), the primary goal of
psychological measurement is to describe the
The problem of improving and quantifying psychological attributes of individuals and the
psychological measurement is addressed by doing differences among them. Measurement theory, in
psychological testing. addition, is a branch of applied statistics that describes
and evaluates the quality of measurements (which
The Senior High School Department of Mindanao includes the responses process that generates specific
State University – Tawi-Tawi College of Technology score patterns by persons), with the objective of
and Oceanography conducts an annual entrance improving usefulness and accuracy.
examination for prospect students across the Tawi-
According to Gregory (2000), psychological measures
Tawi Archipelago. This examination consists of 75 or assessment measures involve quantification
English, 30 Science, 40 Mathematics, and 25 Aptitude techniques. In other words, psychological measure is a
multiple-choice questions. It is crafted to assess the vibrant process, repeatedly changing, but retaining its
ability of the students to cope with the standards set original structures. Moreover, a psychological
forth by the university for its senior high school. measurement can be complicated process with the
following characteristics: scores or categories,

Jeffrey Imer C. Salim 499/504

Psych Educ,2022, 6: 499-504, Document ID: PEMJ455, doi:10.5281/zenodo.7486865, ISSN 2822-4353
Research Article

behavior samples, norms and standards, standardized crucial antecedent to the reliability and validity studies
procedures, and prediction of non-test behavior. (Bandalos, 2018).

Essay, multiple choice, and performance items are Different theories can be used to evaluate the different
cognitive items that are used in academic achievement perspectives of a test and the items on it. Two of them
tests. These are often widely categorized into objective are the Classical Test Theory (CTT) and the Item
items and performance assessments. The former is Response Theory (IRT). They are used in the
more structured and usually have only one correct educational measurement to develop, evaluate and
answer. They are divided into two categories: study test items. These are based on different
selection-or-recognition-types of items and supply- assumptions and also use different statistical
types items. Examples of the selection type include approaches. Their concerns are not only to develop,
multiple-choice, true or false, and matching-type tests evaluate, or determine the reliability and validity of
wherein the respondent is required to distinguish the test but also to improve the quality of test items
correct answer from among those provided. The latter, holistically (Awopeju, 2008).
supply-type items, on the other hand, require the
respondent to generate the right answer such as In measurement theory, inconsistencies across test
sentence completion or short-answer tests. items, occasions, and raters are known as measurement
errors, and a theory known as classical test theory is
The most versatile of all item types are the multiple- used to describe the effects of measurement error on
choice items. It is often concluded that multiple-choice test scores. (Bandalos, 2018).
items can only measure rote recall of information,
when they are cleverly constructed, are capable of Classical Test Theory (CTT)
tapping into higher-level cognitive process such as
analysis and synthesis of information. Items that One of the world’s oldest measurement theories of
require to detect respondent similarities or differences, behavioral or psychological measurement is Classical
interpret graphs or tables, make comparison, or mold True Score Theory or often called Classical Test
previously learned material into a new context Theory (CTT). According to Gullicksen (1950), CTT
emphasizes on higher-level cognitive processes. And is called “classical” because it is regarded to be the
these are appropriate for wide variety of subject first operational use of mathematics to describe this
matter. Another benefit of multiple-choice items is the relationship.
fact that they can provide useful diagnostic
in fo rm a tio n r eg ar d in g the r esp o n d en t’s (Teo, 2013 as cited by Sallil, 2017) said that, the
misunderstandings. Fails or distractors or incorrect primary feature of CTT is its adherence to learning
options must be based on common misconceptions or theories that follow notions of classical and operant
errors (Bandalos, 2018) conditioning for example, behaviorism, social learning
theory, motivation. In CTT the domain with its
Establishing the psychometric properties of the test theoretical parameters, can be accurately sampled by
items to then promote a higher outcome-based results the test items or exercises. It focuses to determine the
of the test questionnaire requires psychological degree to which the examinee has mastered the domain
testing. which is the implied individual’s true score which is
inferred through responses to the test’s stimuli.
To better use psychological testing, item analysis is
done on each of the questions in the test questionnaire. The foundation for CTT model was laid down by
An important phase in the development of a test is Spearman (1907). He stated that any observed test
item analysis. It will reveal if an item is too easy or too score can be seen as the composite of two hypotheticsl
difficult or scored incorrectly. Moreover, it will also components which are a true score and a random error
show a difference between skilled and unskilled component.
According to De Champlain in 2009 as cited in Sallil
A term that refers to a wide range of strategies, both (2017), the main advantage of CTT is the fact that it is
qualitative and quantitative, that is used to assess the based on relatively weak assumptions that are easy to
quality of pool items is called item analysis. These are meet with real data and modest sample size. In
usually used during the scale development process to addition, CTT is easy to apply in many testing
help choose the best set of items from a pool of situations (Hambleton & Jones, 1993). CTT are also
potential candidates. The procedures in item analysis most common paradigm for scale development and
are basic to the scale of development process and are validation, barriers observed score into True Score +

Jeffrey Imer C. Salim 500/504

Psych Educ,2022, 6: 499-504, Document ID: PEMJ455, doi:10.5281/zenodo.7486865, ISSN 2822-4353
Research Article

Error, and probability of a given item response is a

function of person to whom item is administered and
nature of item. This model is simple to use and
requires little mathematical knowledge on the part of Therefore:
the user. In most medical education settings, where the
aim is to develop assessments that will be used locally,
with little or no intention to generalize beyond that
setting, The Classical Test Theory beneficial in
assessing the difficulty and discrimination values of a
certain test item and the reliability with which scores
are measured by an examination. Some of the
disadvantages of CTT are: it does not promote sample-
free estimates of population values, which means that
item difficulty and discrimination, as well as,
reliability estimates are dependent upon test scores
from samples.

The basis of CTT is on the assumption that an

examinee has an observed score and a true score. A
combination of an estimate of the true score of that
test-taker, plus/minus some unobserved error is the
observed score of a test-taker. The true score reflects Figure 1. The distribution of observed scores around
the knowledge of the examinee, but with consideration the true score
of contamination by different sources of errors.
Moreover, item characteristics, item difficulty, and Figure 1 shows us the distribution of observed scores
item discrimination, the values that are dependent around the true score. Moreover, the error scores are
upon the distribution of examinee proficiency within seen as being random. If theses error scores were not
the sample are utilized by the CTT. indeed random, they will have to cancel each other if
repeated testing was done. Moreover, the average of
Awopeju (2008) stated that although the assumptions these repeated scores would not be equal to the true
upon which classical test theory is based allow it to be score. In CTT, the error scores are treated as random
applied to an assortment of test construction, these and this will result in a normal distribution of observed
same assumptions appear to have shortcomings in the scores around the true scores. (Bandalos, 2018).
classical test theory model.
Statistical indices based on CTT has a weak
Classical Test Theory approaches are still used today, assumption and easier to compute, manipulate and
however, there is also a modern test theory which is understand; thereby, it is easy to use (Hambleton &
known as the Item Response Theory (IRT). CTT has Jones, 1933).
clear shortcomings, thus the reason that modern test
theory emerged. IRT was developed to address such Difficulty and Discrimination Indices
issues brought about by CTT.
Item Difficulty Index of a total number of examinees
The CTT has dominated the methods used in the were calculated by Osarumwense and Oyedeji (2015)
application of test theories to assessments. Charles using the formula:
Spearman figured out how to correct a correlation
coefficient due to measurement error and how to solve
the reliability index needed in making such correction
in 1904. This became the Spearman’s model, which
was expressed in the following form:

Jeffrey Imer C. Salim 501/504

Psych Educ,2022, 6: 499-504, Document ID: PEMJ455, doi:10.5281/zenodo.7486865, ISSN 2822-4353
Research Article

The percentage sample was used for the computation

for the Difficulty Index. The scripts were arranged in
descending order of the performance of the examinees
and the first 27% of the scripts called the upper group
U and the last 27% of the scripts called the lower Table 2. Interpretation of the Discrimination Index (D)
group L were taken:

For better understanding on the values of the item
difficulty index of CTT, the intervals with the The research design used for his study is the
corresponding interpretation in Table 1 will be used. descriptive quantitative design, which involves
observing and describing the behavior of a data
Table 1. Interpretation of the Difficulty Index (P)
(quantitative data) without influencing it in any way.
Scored answer sheets of the Senior High School
Entrance Examination of Minandao State University –
Tawi-Tawi College of Technology and Oceanography
given on November 2018 was used as data of this
study, with the necessary approval for use by the MSU
TCTO Admissions Office. To prevent bias, a stratified
sampling was applied. Respondents were grouped into
different strata (per municipality) in order to have
The Discrimination Index, on the other hand, is
proper distributions of the test takers. From the strata,
computed using the difference between the percentage
a random envelope containing the answered sheets
of students in the upper group (PU), i.e., the top 27%
scorers, who obtained the correct response, and the
were picked until the desired number of respondents
percentage of those in the lower group (PL), i.e., the was taken.
bottom 27% scorers, who obtained the correct
response; thus: Each correct and wrong answers were tallied using MS
Excel. – 1 for correct answers, and 0 for wrong
answers. Moreover, the name and total scores of the
students were represented by numerical values. The
study used the formula for the Classical Test Theory
(CTT) using the Statistical Program for Social
Sciences (SPSS), to determine the difficulty and
discrimination indeces of the test. A statistician was
For better understanding on the values of the item consulted for the proper use of the program.
discrimination index of CTT, the intervals with the
corresponding interpretation on Table 2 will be used.

Jeffrey Imer C. Salim 502/504

Psych Educ,2022, 6: 499-504, Document ID: PEMJ455, doi:10.5281/zenodo.7486865, ISSN 2822-4353
Research Article

and 0.691 in Aptitude, Mathematics, and Science,

Results respectively, which are interpreted as reliable. While
the test in Language has a reliability index of 0.925;
The following are the results generated using the and, is interpreted as highly reliable.
Statistical Program for Social Sciences (SPSS).

Figure 2. Difficulty Index of the High School Entrance Conclusion

Examination using CTT
The following conclusions are inferred based on the
results of the study. The multiple-choice items of the
entrance examination of the Senior High School
Department of Mindanao State University-Tawi-Tawi
College of Technology and Oceanography are reliable,
per Classical Test Theory. And most items have
difficulty index less than 0.5, which means that these
are good items. Furthermore, test items with zero
discrimination values are very few. This means that
Figure 3. Discrimination Index of the High School there are few to be revised and improved. And, most
Entrance Examination using CTT items have discrimination values more than 0.2, which
are considered good items. Annual revision in the
examination to reach its optimal reliability is highly


My sincerest gratitude to Mr. Ummar Sallil, MSc, Mr.

Ladznar Laja, PhD, and Ms. Annabel Wellms, the
Mathematics and Sciences Department of College of
Arts and Sciences of Mindanao State University –
Tawi-Tawi College of Technology and Oceanography,
my family, and to Almighty Allah for helping me
Table 3. The Reliability Test for Classical Test Theory
during the course of this study.
by Subject


Awopeju, O. A., (2008). Comparative Analysis of Classical Test

Theory and Item Response Theory-Based Item Parameter Estimates
of Senior School Certificate Mathematics Examination. doi:
Discussion Bandalos, D. L., (2018). Measurement Theory and Application for
the Social Sciences. The Guildford Press New York London. pp 63-
The results showed that most of the items have
69, 120, 157, 159, 404, 407, 420.
difficulty values less than 0.5. This means that such
items in the test are difficult for the takers. There were Crocker, L. & Algina J. (1986). Introduction to Classical and
three items that were very difficult for the takers. Modern Test Theory. Harcourt Brace Jovanovich College
These are item numbers 17, 29, and 108. On the Publishers: Fort Worth, pp 527.
contrary, item number 63 is considered to be the Gregory, R.J. (2000). Psychological Testing. 3rd Edition. Illinios:
easiest. Allyn and Bacon, Inc.

Furthermore, results showed that items with zero Gullicksen H. (1950). Theories of Mental Test Score. New York.
discrimination values are very few. This implies that
Haladyna, T., & Downing, S. (2004). Construct-irrelevant variance
such items need to be improved or revised. Most of the in high-stake testing. Educational Measurement: Issues and
items have discrimination values higher than 0.2, Practice, 23(1), 17-27
which can be considered good items.
Hambleton, R. K., Jones, R. W.. Comparison of Classical Test
Table 3 shows the reliability indices of 0.714, 0.739,

Jeffrey Imer C. Salim 503/504

Psych Educ,2022, 6: 499-504, Document ID: PEMJ455, doi:10.5281/zenodo.7486865, ISSN 2822-4353
Research Article

Theory and Item Response Theory and their Application to Test Price, L. R. (2017). Theory into Practice. The Guild Press New
Development. York London. pp. 5 Sallil, U., 2017. Estimating Examinee’s Ability in Computerized

&rep1&type=pdf Adaptive Testing and Non-Adaptive Testing using 3 parameters IRT
Hassan, S. & Hod, R. (2017). Use of Item Analysis to Improve the
Quality of Single Best Answer Multiple Choice Question in Spearman, C. (1907). Demonstration of formulae for true
Summative Assessment of Undergraduate Medical Students in measurement of correlation. American Journal of Psychology, 18,
Malaysia. Education in Medicine Journal. 2017;9(3):33-43. 161-169.

Kelly, T.L. (1939). The selection of upper and lower groups for the
Affiliations and Corresponding Information
validation of test items. Journal of Educational Psychology, 30,
1724. Jeffrey Imer C. Salim
Mindanao State University
Lord F.M., Novick M (1969). Statistical theories of mental test
scores. Reading, MA: Addison-Wesley. Tawi-Tawi College of Technology and
Oceanography - Philippines
Osarumwense, H. J., Oyedeji, S. O. (2015). Empirical Comparison
of Methods of Establishing Item Difficulty Index of Test Items Using
Classical Test Theory (CTT).

Jeffrey Imer C. Salim 504/504

You might also like