Professional Documents
Culture Documents
Module Iii Educ 105
Module Iii Educ 105
Module Iii Educ 105
MODULE III
1. construct paper and pencil test in accordance with the guidelines in test
construction;
2. choose a test suited to the topics that was discussed within a grading
period/semester;
3. use Bloom’s taxonomy as their guide with developing their test
4. explain the meaning of item analysis, item validity, reliability, item
difficulty and discrimination index;
5. determine the validity and reliability of given test items.
There are four lessons in the module. Read each lesson carefully then
answer the exercises/activities to find out how much you have benefited from it.
Work on these exercises carefully and submit your output on time
In case you encounter difficulty, discuss this with your teacher during virtual
meeting.
Lesson 1
TRUE-FALSE TEST
Binomial-choice or alternate response tests are tests that have only two options
such as true or false, right or wrong, yes or no good or better, check or cross out
and so on. A student who knows nothing of the content of the examination would
have 50% chance of getting the correct answer by sheer guess work. Although
correction-for-guessing formulas exist, it is best that the teacher ensures that a
true-false item is able to discriminate properly between those who know and
those who are just guessing. A modified true-false test can offset the effect of
guessing by requiring students to explain their answer and to disregard a correct
answer if the explanation is incorrect. Here are some rules of thumb in
constructing true-false items.
the textbook word for word and, thus, acquisition of higher level thinking skills
is not given due importance.
Rule 6. Avoid specific determiners or give-away qualifiers. Students quickly learn
that strongly worded statements are more likely to be false than true, for
example, statements with "never" "no" "all" or "always." Moderately worded
statements are more likely to be true than false. Statements that are moderately
worded use "many" "often" "sometimes" "generally" "frequently" or "some" usually
should be avoided. e.g. Executives usually suffer from hyperacidity. The
statement tends to be correct. The word "usually " leads to the answer.
Rule 7. With true or false questions, avoid a grossly disproportionate number of
either true or false statements or even patterns in the occurrence of true and
false statements.
1. T 6. F 1. T 6. F
2. F 7. F or 2. F 7. T
3. F 8. F 3. T 8. F
4. F 9. F 4. F 9. T
5. F 10. F 5. T 10. F
The multiple choice type of test offers the student with more than two (2)
options per item to choose from. Each item in a multiple choice test consists of
two parts: (a) the stem and (b) the options. In the set of options, there is a
"correct" or "best" option while all the others are considered "distracters." The
distracters are chosen in such a way that they are attractive to those who do not
know the answer or who are guessing but at the same time, have no appeal to
those who actually know the answer. It is this feature of multiple choice type
tests that allows the teacher to test higher order thinking skills even if the
options are clearly stated. As in true-false items, there are certain rules of thumb
to be followed in constructing multiple choice tests.
GUIDELINES FOR CONSTRUCTING MULTIPLE CHOICE ITEMS
1) Do not use unfamiliar words, terms and phrases. The ability of the item to
discriminate or its level of difficulty should stem from the subject matter rather
than from the wording of the question.
Example: What would be the system reliability of a computer system whose slave
and peripherals are connected in parallel circuits and each one has a known time
to failure probability of 0.05?
A student completely unfamiliar with the terms "slave" and "peripherals
“may not be able to answer correctly even if he knew the subject matter of
reliability.
2) Do not use modifiers that are vague and whose meanings can differ from one
person to the next such as: much, often, usually, etc.
Example:
Much of the process of photosynthesis takes place in the:
a. Bark
b. Leaf
c. Stem
The qualifier "much" is vague and could have been replaced by more
specific qualifiers like: " 90% of the photosynthetic process" or some similar
phrase that would be more precise. Be quantitative.
Example:
(Poor) As President of the Republic of the Philippines. Corazon Cojuangco
Aquino would stand next to which President of the Philippine Republic
subsequent to the 1986 EDSA Revolution?
(Better) Who was the President of the Philippines after Corazon C. Aquino?
Example:
(Poor) Which of the following will not cause inflation in the
Philippine economy?
(Better) Which of the following will cause inflation in the
Philippine economy?
Example:
The short story: May Day's Eve, was written by which Filipino author?
a. Jose Garcia Villla
b. Nick Joaquin
c. Genoveva Edrosa Matute
d. Robert Frost
e. Edgar Allan Poe
If distracters had all been Filipino authors, the value of the item would be
greatly increased. In his particular instance, only the first three carry the burden
of the entire item since the last two can be essentially disregarded by the
students.
7) All multiple choice options should be grammatically consistent with the stem.
Example:
As compared to the autos of the 1960s autos in the 1980s _________.
A. Traveling slower C. to use less fuel
B. Bigger Interiors D. contain more safety measures
Option A, B and C are obviously wrong for the language smart because
when added to the stem the sentence is grammatically wrong. D is the
only option which when connected to the stem retains the grammatical
accuracy of the sentence, thus obviously is the correct answer.
Example:
If the three angles of two triangles are congruent, then the triangles are:
a. congruent whenever one of the sides of the triangles are
congruent
b. similar
c. equiangular and/therefore, must also be congruent
d. equilateral if they are equiangular
The correct choice, "b," may be obvious from its length and explicitness
alone. The other choices are long and tend to explain why they must be
the correct choices forcing the students to think that they are, in fact,
not the correct answers!
Example:
a. Who will most strongly disagree with the progressivist who claims that
the child should be taught only that which interests him and if he is
not interested, wait till the child gets interested?
A. Essentialist C. Progressivist
B. Empiricist D. Rationalist
b. Which group will most strongly focus its teaching on the interest of the
child?
A. Progressivist C. Perrenialist
B. Essentialist D. Reconstructionist
One may arrive at a correct answer (letter b) by looking at item “a.” that
gives the answer to “b.”
10) Use the "None of the above" option only when the keyed answer is totally
correct. When choice of the "best" response is intended, "none of the above is
not appropriate, since the implication has already been made that the correct
response may be partially inaccurate.
11) Note that use of all of the above" may allow credit for partial knowledge. In a
multiple option item, (allowing only one option choice) if a student only knew
that two (2) options were correct, he could then deduce the correctness of "all
of the above." This assumes you are allowed only one correct choice.
12) Better still use "none of the above" and "all of the above" sparingly but best not
to use them at all.
Here are some guidelines to observe in the formulation of good matching type of
test.
1. Match homogeneous not heterogeneous items. The items to match must be
homogeneous. If you want your students to match authors with their literary
works, in one column will be authors and in the second column must be literary
works. Don't insert nationality for instance with names of authors. That will not
be a good item since it is obviously wrong.
Example of homogeneous items. The items are all about the Filipino heroes, nothing
more.
Match the items in Column A with the items in Column B.
4. To help the examinee find the answer easier, arrange the options alphabetically
or chronologically, whichever is applicable.
5. Like any other test, the direction of the test must be given. The examinees must
know exactly what to do.
Another useful device for testing lower order thinking skills is the supply type of
tests. Like the multiple choice test, the items in this kind of test consist of a
stem and a blank where the students would write the correct answer.
Supply type tests depend heavily on the way the stems are constructed. These
tests allow for one answer only and hence, often test only the students’ recall
of knowledge.
It is, however, possible to construct supply type of tests that will test higher order
thinking as the following example shows:
Example: Write an appropriate synonym for each of the following. Each blank
corresponds to a letter:
Metamorphose: _ _ _ _ _ _
Flourish: _ _ _ _
The appropriate synonym for the first is CHANGE with six (6) letters while the
appropriate synonym for the second is GROW with four (4) letters. Notice that these
questions require not only mere recall of words but also understanding of these words.
Another example of a completion type of text that measures higher order - thinking
skill is given below:
Example: Example; Write G if the item on the left is greater than the item on the right;
L if the item on the left is less than the item on the right; E if the item on the left
equals the item on the right and D if the relationship cannot be determined.
A B
1. Square root of 9 ______________ a. -3
2. Square of 25 ______________ b. 615
3. 36 inches ______________ c. 3 meters
4. 4 feet ______________ d. 48 inches
1. Avoid over mutilated sentences like this test item. Give enough clue to the student.
The _____ produced by the _____ is used by the green _____ to change the _____ and
_____ into _____. This process is called _____.
2. Avoid open-ended item. There should be only one acceptable answer. This item is
open-ended, hence no good test item.
Ernest Hemingway wrote ________.
3. The blank should be at the end or near the end of the sentence. The question must
first be asked before an answer is expected. Like the matching type of test, the
stem (where the question is packed) must be in the first column.
Essays
Essays, classified as non-objective tests, allow for the assessment of higher order
thinking skills. Such tests require students to organize their thoughts on a subject
matter in coherent sentences in order to inform an audience. In essay tests, students
are required to write one or more paragraphs on a specific topic.
Essay questions can be used to measure attainment of a variety of objectives.
1. Comparing
- Describe the similarities and differences between …
- Compare the following methods for ...
2. Relating cause-and-effect
- What are the major causes of ....
- What would be the most likely effects of ...
3. Justifying
- Which of the following alternatives would you favor and why?
- Explain why you agree or disagree with the following statement.
4. Summarizing
- State the points included in ...
TYPES OF ESSAYS
Restricted Essays
It is also referred to as short focused response. Examples are asking students to
"write an example," "list three reasons" or "compare and contrast two
techniques."
question. This procedure also helps offset the halo effect in grading. When all of the
answers on one paper are read together, the grader's impression of the paper as a whole
is apt to influence the grades he assigns to the individual answers.
Rule 6: Evaluate answers to essay questions without knowing the identity of the writer.
The best way to prevent our prior knowledge from influencing our judgment is to
evaluate each answer without knowing the identity of the writer. This can be done by
having the students write their names on the back of the paper or by using code
numbers in place of names.
Rule 7: Whenever possible, have two or more persons grade each answer.
The best way to check on the reliability of the scoring of essay answers is to obtain two
or more independent judgments. Although this may not be a feasible practice for
routine classroom testing. it might be done periodically with a fellow teacher (one who
is equally competent in the area). Obtaining two or more independent ratings becomes
especially vital where the results are to be used for important and irreversible
decisions, such as in the selection of students for further training or for special awards.
Some teachers use the cumulative criteria were each student begins with a score of
100. Points are then deducted every time a teacher encounters a mistake or when a
criterion is missed by the student in his essay.
Rule 8: Do not provide optional questions.
It is difficult to construct questions of equal difficulty and so teacher cannot have valid
comparison of students' achievement.
Rule 9: Provide information about the value/weight of the question and how it will be
scored.
Rule 10: Emphasize higher level thinking skills.
Lesson 2
ITEM ANALYSIS:
DIFFICULTY INDEX AND DISCRIMINATION INDEX
The teacher normally prepares a draft of the test. Such a draft is subjected to
item analysis and validation in order to ensure that the final version of the test would
be useful and functional. First, the teacher tries out the draft test to a group of students
of similar characteristics as the intended test takers (try-out phase). From the try-out
group, each item will be analyzed in terms of its ability to discriminate between those
who know and those who do not know and also its level of difficulty (item analysis
phase). The item analysis will provide information that will allow the teacher to decide
whether to revise or replace an item (item revision phase). Then, finally, the final draft
of the test is subjected to validation if the intent is to make use of the test as a standard
test for the particular unit or grading period.
Item difficulty = number of students with correct answer/ total number of students.
The item difficulty is usually expressed in percentage.
Example: What is the item difficulty index of an item if 25 students are unable to
answer it correctly while 75 answered it correctly?
Here, the total number of students is 100, hence the item difficulty index is 75/100
or 75%.
Another example: 25 students answered the item correctly while 75 students did
not. The total number of students is 100 so the difficulty index is 25/100 or 25 which
is 25%.
One problem with this type of difficulty index is that it may not actually indicate
that the item is difficult (or easy). A student who does not know the subject matter
will naturally be unable to answer the item correctly even if the question is easy.
Difficult items tend to discriminate between those who know and those who do not
know the answer. Conversely, easy items cannot discriminate between these two
groups of students. We are therefore interested in deriving a measure that will tell
us whether an item can discriminate between these two groups of students. Such a
measure is called an index of discrimination.
An easy way to derive such a measure is to measure how difficult an item is with
respect to those in the upper 25% of the class and how difficult it is with respect to
those in the lower 25% of the class. If the upper 25% of the class found the item easy
yet the lower 25% found it difficult, then the item can discriminate properly
between these two groups.
Thus:
Index of discrimination = DU – DL (U - Upper group; L - Lower group)
Example: Obtain the index of discrimination of an item if the upper 25% of the class
had a difficulty index of 0.60 (i.e. 60% of the upper 25% got the correct answer)
while the lower 25% of the class had a difficulty index of 0.20. Here. DU = 0.60 while
DL = 0.20, thus index of discrimination = .60 - .20 = .40.
Discrimination index is the difference between the proportion of the top scorers
who got an item correct and the proportion of the lowest scorers who got the item
right. The discrimination index range is between -1 and +1. The closer the
discrimination index is to +1, the more effectively the item can discriminate or
distinguish between the two groups of students. A negative discrimination index
means more from the lower group got the item correctly. The last item is not good
and so must be discarded.
Example:
Item A B C D
1 0 20 40 20 Total
0 5 15 0 Upper 25%
0 5 5 10 Lower 25%
DL = no. of students in lower 25% with correct response/ no. of students in the
lower 25%
= 5/20 = .25 or 25%
Discrimination Index = DU – DL = .75 - 25 = .50 or 50%.
In the case of the index of difficulty, we have the following rule of thumb:
Difficulty Index
Example:
QUESTION A B C D
Number 1 20 2 0 3
Number 2 10 5 9 1
**Colored numbers are the students who got the correct answer
Compute the difficulty of the item by dividing the number of students who choose
the correct answer (20) by the number of total students (25). Using this formula,
the difficulty of Question #1 (referred to as p) is equal to 20/25 or .80. A "rule-of-
thumb" is that if the item difficulty is more than .75, it is an easy item; if the
difficulty is below .25, it is a difficult item.
Item Number 1 2 3 4 5
No. of Correct Responses 2 10 20 30 15
No. of Students 50 30 30 30 40
Difficulty Index
Lesson 3
After performing the item analysis and revising the items Which need revision,
the next step is to validate the instrument. The purpose of validation is to determine
the characteristics of the whole test itself, namely, the validity and reliability of the
test. Validation 15 the process of collecting and analyzing evidence to support the
meaningfulness and usefulness of the test.
Validity
Validity is the extent to which a test measures what it purports to measure or as
referring to the appropriateness, correctness, meaningfulness and usefulness of the
specific decisions a teacher makes based on the test results. These two definitions of
validity differ in the sense that the first definition refers to the test itself while the
second refers to the decisions made by the teacher based on the test. A test is valid
when it is aligned with the learning outcome.
A teacher who conducts test validation might want to gather different kinds of
evidence.
There are essentially three main types of evidence that may be collected:
Content-related evidence of validity
Criterion-related evidence of validity
Construct-related evidence of validity
Content-related evidence of validity refers to the content and format of the
instrument. How appropriate is the content? How comprehensive? Does it logically get
at the intended variable? How adequately does the sample of items or questions
represent the content to be assessed? Criterion-related evidence of validity refers to
the relationship between scores obtained using the instrument and scores obtained
using one or more other tests (often called criterion). How strong is this relationship?
How well do such scores estimate present or predict future performance of a certain
type?
Construct-related evidence of validity refers to the nature of the psychological
construct or characteristic being measured by the test. How well does a measure of the
Reliability
Reliability refers to the consistency of the scores obtained - how consistent they
are for each individual from one administration of an instrument to another and
from one set of items to another. We already gave the formula for computing
the reliability of a test: for internal consistency; for instance, we could use the
split-half method or the Kuder-Richardson formulae (KR-20 or KR-21)
Reliability and validity are related concepts. If an instrument is unreliable, it
cannot get valid outcomes. As reliability improves, validity may improve (or it
may not). However, if an instrument is shown scientifically to be valid then it is
almost certain that it is also reliable.
.50
Predictive validity compares the question with an outcome assessed at a later
time. An example of predictive validity is a comparison of scores in the National
Achievement Test (NAT) with first semester grade point average (GPA) in college.
Do NAT scores predict college performance? Construct validity refers to the
ability of a test to measure what it is supposed to measure.
Formula:
Reliability Interpretation
.90 and above Excellent reliability; at the level of the best standardized
tests
Lesson 4
The mean, mode and median are valid measures of cento tendency but under different
conditions, one measure becomes m appropriate than the others. For example, if the scores
are extreme high and extremely low, the median is a better measure of cent tendency since
mean is affected by extremely high and extremely low scores.
Median
The median is the middle score for a set of sources arranged from lowest to highest. The mean
is less affected by extremely low and extremely high scores. How do we find the median?
65 55 89 56 35 14 56 55 87 45 92
To determine the median, first we have to rearrange the scores into order of magnitude (from
smallest to largest).
14 35 45 55 55 56 56 65 87 89 92
Our median is the score at the middle of the distribution. In this case, 56. It is the middle score.
There are 5 scores before it and 5 scores after it. This works fine when you have an odd number
of scores, but what happens when you have an even number of scores? What if you had 10 scores
like the scores below?
65 55 89 56 35 14 56 55 87 45
Arrange that data according to order of magnitude (smallest to largest). Then take the middle
two scores (55 and 56) and compute the average of the two scores. The median is 55.5. This
gives us a more reliable picture of the tendency of the scores. There are indeed scores of 55
and 56 in the score distribution.
Mode
The mode is the most frequent score in our data set. On a histogram or bar chart it
represents the highest bar. If a score is in number of times an option is chosen in a
multiple choice test. Therefore, the mode as being the most popular option. Study the
score distribution given below:
14 35 45 55 55 56 56 65 87 89