Chapter 3 - Test Development Process

CHAPTER 3: DEVELOPMENT OF TOOLS FOR CLASSROOM-BASED ASSESSMENT
The painstaking process of classroom assessment dictates the deployment of varied

assessment tools to obtain a substantial view of the quality of learning in the classroom.
Classroom assessment tools are employed to collect information that can further be
interpreted, analyzed, and synthesized to facilitate developmental feedback to students
during and after their engagement with the instructional process. They provide useful and
objective measures of learning outcomes. The assessment information can speak of how
well the learners are doing in a class, how effectively the teachers deliver the instruction,
and what more can be done to ensure an effective instructional process. The more teachers
know about what and how students are learning, the better they can plan to learn.
Learning Outcomes:
At the end of the chapter, the students are able to:
1. Identify the different types of the test;
2. implement the essential steps in the development process that includes:
a. re-examination of the target outcomes;
b. determining the desired competencies to be measured;
c. preparing a Table of Specification (TOS); and
d. construction of valid and appropriate classroom assessment tests for
measuring learning outcomes;
3. calculate for the validity and reliability of a prepared test;
4. Identify blunders in a constructed test; and
5. illustrate the test development process.
Test Development Process

The process of test construction for classroom testing applies the same initial steps
in the construction of any instrument designed to measure a psychological construct. The
process of test development involves three key phases and 12 steps as illustrated in Figure
11.
This material is originally prepared and contains digital signatures. Further reproduction is prohibited without prior
permission from the authors.
Page 49
Figure 11. Test Development Process

Planning phase. This is where the learning outcomes to be assessed and
competencies to be measured are specified (what to test). Based on the target outcomes
and competencies, the teacher decides on the type and method of assessment to be used
(how to test). A table of specifications is prepared to guide the item construction phase.
The important steps in planning for a test are:

1. Re-examine the instructional outcomes. Review the set instructional outcomes.
 Do they cover the various levels of learning taxonomy?
 Do they require the application of higher-order thinking skills (HOTS) or
evoke critical thinking?
2. Determine the competencies to be measured.

 What knowledge, skills, and values are expected for students to learn or
master?
3. Decide on the type and method of assessment to use.

 What test can appropriately measure if the set instructional outcomes are
achieved?
 Can the test cover the learning outcomes intended and essential to be
achieved?
 What test format is best to use?
 How many items can practically be given within the set period of time?
4. Prepare a Table of Specification (TOS)

Table of specification is a test blueprint that details the content area to be
covered in a test, the classification of test items according to test type/format,
and the item number or placement in a test to achieve a fair and balanced
sampling of skills to be tested. There are several formats in the preparation of a
TOS.
Format 1:
Content Number of Items
1. Importance of Research 6
2. Types of Research 12
3. Qualities of a Good Researcher 8
4. The Research Process 14
Total 40
Page 50
Format 2:
Topics Cognitive Level Type of Test Item Total
Number Points
1. Importance of Remembering Enumeration 9-14 6
Research
2. Types of Evaluating Constructed 15-26 12
Research response
3. Qualities of a Understanding True or False 1-8 8
Good
Researcher
4. The Research Creating Creating a 27-40 14
Process diagram
Total 40
Format 3:
Specific Objectives No. of No. of Cognitive Level Item
Class Items Distribution
sessions
K-C A HOTS
1, List the importance of 1½ 6 / 9-14
research
2. Identify and justify the 3 12 / 15-26
type of research that will
best address a given
research question
3. Distinguish a statement 2 8 / 1-8
that describes a good
researcher
4. Create a diagram to 3½ 14 / 27-40

illustrate the research
process
Total 10 40
K – Knowledge; C – Comprehension; HOTS – Higher Order Thinking Skills
Page 51
Format 4:
Content Class Cognitive Level
sessions
(by hour)
Rem Und App An Eval Crea Total Item
Items Dist
1. Importance 1½ / 6 9-14
of Research
2. Types of 3 / 12 15-26
Research
3. Qualities of 2 / 8 1-8
a Good
Researcher
4. The 3½ / 14 27-40
Research
Process
Total 10 40
Rem – Remembering; Under – Understanding; App – Application; An – Analysis;
Eval – Evaluating; Crea – Creating
In deciding on the number of items per subtopic, the formula below is observed:
Number of items = number of class sessions x desired total number of items

Total number of class session
Ex. For the topic on the importance of research, the following are the given:
Number of class sessions – 1 ½
Desired total number of items – 40
Total number of class session – 10
Number of items = 1 ½ x 40 = 6 items

10
Page 52
The length of time, the type of test, and the type of item used are also factors
considered in determining the number of items to be constructed in a test. Gabuyo (2012)
presents an estimated average time to answer a specific type of test. (see Table 4)
Table 4. Average Time to Answer Each Type of Test

TEST TYPE AVERAGE TIME TO ANSWER
True-false 30 seconds
Multiple-choice 60 seconds
Multiple-choice of higher-level learning 90 seconds
objectives
Short answer 120 seconds
Completion 60 seconds
Matching 30 seconds per response
Short essay 10-15 minutes
Extended essay 30 minutes
Visual image 30 seconds
Test Design and Construction Phase. This is where test items are designed and
constructed following the appropriate item format for the specified learning outcomes of
instruction. The test items are dependent upon the educational outcomes and
materials/topics to be tested. This phase includes the following steps:
1. Item Construction. According to Osterlind, 1989), the perils of writing test items
without adequate forethought are great. Decisions about persons, programs,
projects, and materials are often made on the basis of test scores. If a test is made
up of items haphazardly written by untutored persons, the resulting decisions could
be erroneous. Such errors can sometimes have serious consequences for learners,
teachers, and the institution as a whole. Performances, programs, projects, and
materials could be misjudged. Obviously, such a disservice to examinees as well as to
the evaluation process should be avoided if at all possible.
To help classroom teachers improve the quality of test construction, Kubiszyn
and Borich (2007) suggested some general guidelines for writing test items consist of
the following:
 Begin writing items far enough or in advance so that you will have time to
revise them.
 Match items to intended outcomes of an appropriate level of difficulty to
provide a valid measure of instructional objectives. Limit the question to
the skill being assessed.
 Be sure each item deals with an important aspect of the content area and
not with trivia.
 Be sure the problem posed is clear and unambiguous.
 Be sure that the item is independent of all other items. The answer to
one item should not be required as a condition in answering the next
item. A hint to one answer should not be embedded in another item.
 Be sure the item has one or the best answer on which experts would
agree.
Page 53
 Prevent unintended clues to an answer in the statement or question.

 Grammatical inconsistencies such as a or an give clues to the correct
answer to those students who are not well prepared for the test.
 Avoid replication of the textbook in writing test items. Do not quote
directly from the textual materials.
 Avoid tricky questions in an achievement test. Do not waste time testing
how well the students can interpret your intentions.
 Try to write items that require higher-order thinking skills.
Types of Test
To create effective tests, the teacher needs to familiarize himself/herself with
the different types of tests and avoid any pitfalls in test construction through helpful
and definitive guidelines.
Types of Test
Objective Test Subjective Test
Selection Test Supply Test Performance Test
Multiple Short Simulated Product-

True-False Matching Completion Essay
Choice Answer Performance based
Restricted Extended
Figure 12. Types of Test
 Objective Test. This test consists of questions or items that require factual
answers. This test can be quickly and unambiguously scored by the teacher or
anyone who has the answer key. The response options are often structured
that can easily be marked as correct or incorrect, thus minimizing subjectivity or
bias on part of the scorer.
a. Selection Test. In this test type, the students select the best possible
answer/s from the choices that are already given and do not need to
recall facts or information from their memory.
Page 54
a.1 True-false test contains items with only two fixed choices (binary
options). The students simply recognize a correct or an incorrect
statement that can stand for their knowledge and understanding of
facts or information shared.
TRUE-FALSE TEST
Advantages Disadvantages
 True-false items are easy to  There is a high probability of
formulate. guessing.
 The score is more objective  It is not well-suited for
than the essay test. measuring complex mental
 It covers a large range of processes.
content in a short span of  It often measures low-level
time. thinking skills that are limited
 The test is easy to formulate to the ability to recognize,
and quick to check and to recall, and understand
score. information.
 It is easier to prepare
compared to multiple-choice
and matching-type tests.
Guidelines
 Keep the statement direct, brief, and concise but complete.
 Each statement should only focus on a single idea unless it has the
intention to show a cause-and-effect relationship.
 Use approximately the same number of true and false statements.
 Don’t copy statements directly taken from the textbook.
 Specify clearly in the direction where and how the students should
mark their answers.
 Arrange the true and false items in random order to minimize the
chance for students to detect a pattern of responses.
 BEWARE of using
 trivial and tricky questions;
 opinion-based statement, unless such a statement is attributed
to an author, expert, or a proponent;
 superlatives such as best, worst, largest, etc.;
 negative or double negatives. If this cannot be avoided, bold
negative words or underline it to call the attention of the
examinees; and
 clues to the correct choice through specific determiners such as
some, sometimes, and many that tend to appear in the true
statements; and never, always, all, none that tend to appear in
the statements that are false.
Page 55
a.2 Matching type test provides two columns for learners to connect or match words,
phrases, or sentences. Column A at the left side contains the descriptions called
premises and column B on the right side contains the options for answers called
responses. The items in Colum A are numbered while the items in Column B are
labeled with capital letters. The convention is for learners to match the given
response on the right with the premise on the left.
MATCHING TYPE TEST

 It is easy to construct, check,  It assesses only low level of
and grade. cognitive domain such as
 It can cover a lot of content in simple recall or memorization
the given set of tests. of information.
 It provides accurate, efficient,  Answering matching questions
and reliable test scores. is time-consuming for students.
 It is best suited for measuring
the student’s ability to do
associations.
 The effect of guessing is less
compared to true-false and
multiple-choice tests.
Guidelines
 The descriptions and options must be short and straightforward.
 Keep the descriptions and options homogenous or interconnected by
themes.
 Place all descriptions at the left side and marked it with column A and the
options (expressed in shorter form) at the right side and marked it with
column B.
 Make all descriptions and options appear on the same page.
 Allow more options than descriptions or indicate in the directions that
options may be used more than once to decrease the chance of guessing.
 Specify the basis for matching in the direction.
 Avoid too many correct answers.
 When using names, always include the complete name (first and surname)
to avoid ambiguities.
 Arrange the answer choices in a logical order (chronological or alphabetical
order) to help the examinee locate the correct answer quickly.
 Give a minimum of three items and a maximum of seven items for
elementary level and a maximum of seventeen items for secondary and
tertiary levels.
Page 56
a.3 Multiple-choice test requires test-takers to choose the correct answer from the list
of options given. It includes three parts: the stem, the keyed option, and the
incorrect options or alternatives. The stem represents the problem or question
usually expressed in the completion form or question form. The key option is the
correct answer. The incorrect options or alternatives are also called the distractors
or foils.
MULTIPLE-CHOICE TEST
 It can be scored and analyzed  The development of good
efficiently, quickly, and items in a test is time-
reliably. consuming.
 It measures learning  Plausible distractors are hard
outcomes at the various to formulate.
levels from the knowledge to  Test scores can be influenced
evaluation. by other factors such as the
 It measures almost any test-wiseness or reading ability
educational objective. of the examinees.
 It measures broad samples of  It is not effective in assessing
content within a short span of the problem-solving skills of
time. the students
 Its questions/items can  It is not applicable when
further be analyzed in terms measuring the ability to
of validity and reliability. organize and express ideas.
 If an item analysis is applied,
it can reveal the difficulty of
an item and its ability to
discriminate against the good
performing students.
S
Guidelines
 Phrase each question concisely.
 Use simple, precise, and unambiguous wording.
 Avoid the use of trivial and tricky questions.
 Use three to five options to challenge critical thinking and discourage
guessing.
 Present diagram, drawing, or illustration when students are asked to
apply, analyze, or evaluate ideas.
 Use tables, figures, or charts when students are required to interpret
ideas.
 Use pictures, if possible, when students are required to apply concepts
and principles.
Page 57
Guidelines
 The stem should:
o be written in question or completion form. If blanks are
provided in completion form, they are placed at the end and
NOT at the beginning or in the middle of the
sentence/statement.
o be clear and concise (does not use excessive/irrelevant
words).
o avoid using negative words such as not or except. If this
cannot be avoided, they are written in bold or capital letters
to call the attention of the examinee.
o be free from grammatical clues and errors.
b. S Options:
u o are arranged in a logical order
p o are marked with capital letters.
p o are listed vertically beneath the stem.
l o provide only one correct or clearly best answer in each item.
y o are kept independent and do not overlap with options in
other items.
c. o Are homogenous in content to raise the difficulty level of an
d. item.
e. o are of uniform or equal length as much as possible.
f. o avoid or sparingly use the phrases “all of the above” and
g. “none of the above.”
h.  Distractors:
i. o should be plausible and effective but not too attractive to be
j. mistaken by most students as the correct answer. Each
k. distractor should be chosen by at least 5% of the examinees
l. but not more than the key answer.
o should be equally familiar to the examinees
o should not be constructed for the purpose of tricking the
m. examinees
b. Supply Test. The supply test, otherwise known as the constructed-response test,
requires students to create and supply their own answers or perform a certain task
to show mastery of knowledge or skills rather than choosing an answer to a
question. It includes short answer, completion type, essay type items. These tests
can be categorized as either objective or subjective. They are in objective form
when definite answers are asked from the examinees and which answers observe
stable scoring or are not influenced by the judgment of the scorers. On the other
hand, they are in the subjective form when students are allowed to answer items in
the test in their own words or using their original ideas.
Page 58
b.1 Short-answer test contains items that ask students to provide exact answers.
Rather than simply choosing from several options provided, the examinees
either provide clearly-defined answers or compose short phrases for answers.
SHORT ANSWER TEST

 It takes a shorter time to  More emphasis is placed on
answer than an essay test. rote learning.
 The answers are generally  It cannot measure ability and
definite, thus, easier to check attitude.
and to score.  It is weak to measure language
 There is greater objectivity skill/expression.
and reliability in scoring than  Objectivity and accuracy in
essay tests. scoring may be influenced by
 It has more extensive topic the examinee’s handwriting
coverage compared to an and spelling skills.
essay test.
 The answers to test
questions/items are not pre-
selected but supplied by the
examinees.
 There is less chance of
guessing.
 Its preparation and
administration are easier than
an essay test.
Guidelines
 Clearly specify in the test direction how the question should be
answered.
 Frame questions/items using words that are easily understood by the
examinees.
 Restate or do not copy exact wordings from the text.
 Make sure that examinees provide factually correct answers.
Page 59
b.2 Completion test or fill-in-the-blank test requires examinees to supply word/s,

phrase/s symbol/s, or number/s to complete a statement.
COMPLETION TEST
 It is easy to construct.  It is more difficult or tedious to
 It minimizes guessing. score than other objective
 It has wider coverage in terms types of tests.
of content  It is typically not suitable for
measuring complex learning
outcomes.
Guidelines
 Only the keywords to be supplied by the examinees should be omitted.
 The item should require a single-word answer or brief answers.
 Use only one blank per item. Preferably, place it at the end of the
statement.
 Blanks provided should be equal in length. Their length should provide
sufficient space for the answer.
 Do not use indefinite statements that allow varying answers.
 Indicate the units (e.g. cm, ft, inc) when items require numerical
answers.
 Avoid grammatical clues such as a or an.
 Do copy the exact sentences from textbooks.
 Su
bjective Test. This test allows the student to organize and present answers in their
own words or using their original ideas. This test can be influenced by the judgment
or opinion of the examinees and the scorers; nevertheless, it allows assessment of
aspects in students’ performance that are complex and qualitative. Questions raised
may elicit varied answers that can be expressed in several ways.
b.3 Essay test is a subjective type of test that requires examinees to structure long
written response to answer a question. This test measures complex cognitive
skills or processes and is usually scored on an opinion basis. It may require the
examinees to give definitions, provide interpretations, make evaluations or
comparisons, contrast concepts, and demonstrate knowledge of the
relationships (Morrow, et al., 2016).
b.3.1 Restricted response essays set limits on the content and response
given by the students. Limitations in the form of the response are well-
specified in the given essay question/item.
Example: Point out the limitations of objective type of test in 300

words. Your answer will be scored in terms of content
and organization (5pts.), quality of writing (3 pts.), and
grammar usage and mechanics (2 pts.).
Page 60
b.3.2. Extended response essays allow the students wide latitude of

expression in terms of the length and complexity of the response. It is
best suited to measure the examinee’s ability to organize, integrate,
synthesize, and evaluate ideas.
Example: Is a valid test reliable? Thoroughly discuss your answer.
Scoring Rubric:
Descriptor Points
The essay demonstrates complete knowledge and 4
understanding of the topic. It uses clear and precise
language.
The essay demonstrates very good knowledge and 3
understanding of the topic. It uses clear language with
occasional lapses.
The essay demonstrates a good knowledge and 2
understanding of the topic. It uses clear language and
precise language for the most part.
The essay demonstrates little knowledge and 1
understanding of the topic. The language is only partly
clear and accurate.
The essay demonstrates no real knowledge and 0
understanding of the topic. The language is not clear
and inaccurate.
Page 61
ESSAY TEST
 It is most useful in assessing  It is difficult to check and score.
higher-order thinking skills.  It observes inconsistent and
 It is best to develop logical unreliable procedures for
thinking and critical scoring.
reasoning.  Test effectiveness is difficult to
 It takes less time and easy to analyze and establish.
construct.  Its reliability is often low
 It largely eliminates guessing. because of the subjective
 It can effectively reveal scoring of the answers.
personality and measure  It does not allow a larger
opinions and attitudes. sampling of content.
 It gives examinees freedom to  It encourages bluffing. Scoring
plan their answers and may be affected by good
respond within broad limits handwriting, neatness,
grammar, etc.
 It entails excessive use of time
for answering.
 Scores may be affected by
personal biases or previous
impressions.
Guidelines
 Use rubrics for scoring an essay answer.
 Do not begin with who or what in writing your essay question.
 Use clear and unambiguous wording of the essay questions/items.
 Indicate the values or assigned points for every essay question/item.
 All examinees should be required to answer all and the same essay
questions for valid and objective scoring.
 Keep the students anonymous while checking the essay answers.
 Evaluate all answers to one question before going on to the next.
 Make sure that students have ample time to answer the essay test.
c. Performance Test. This assessment type requires students to perform a

task or activity or do an actual demonstration of essential and observable
skills and the creation of products. This may be in the form of simulated
performance or work samples.
c.1 Simulated performance requires examinees to carry out the basic

rudiments of skill in a realistic context rather than simply choosing an
answer from a ready-made list. Examples are recital, dramatic
enactment, role-playing, participating in debate, public speaking, and
entrepreneurial activity, etc.
Page 62
c.2 Product-based focuses on the final assessable output and not on the
actual performance of making the product. Examples are portfolios,
multimedia presentations, posters, ads, and bulletin boards.
PERFORMANCE TEST
 They can measure complex  Scoring procedures are
learning outcomes in a generally subjective and
natural setting. unreliable.
 The students can apply the  They demand a great amount
knowledge, skills, and values of time for preparation,
learned. administration, and scoring.
 They promote more active  They can possibly be costly.
student engagement in an  They rely heavily on students’
activity. creativity and drive.
 They can help identify the
students’ strengths and
weaknesses.
 They provide a more realistic
way of assessing
performance.
Guidelines
 Focus on skill or product to be tested. They should relate to the pre-
determined learning outcomes.
 Provide clear directions on the task or product required. Clearly
communicate expectations.
 Minimize dependence on skills that are not relevant to the intended
purpose.
 Use rubrics to rate performance or product.
2. Test Assembling. After constructing the test items, arrange the test items. There are two
steps in assembling the test: (1) packaging the test; and (b) reproducing the test. Gabuyo
(2012) sets the following guidelines for the assembling of the test.
 Group all test items with similar format.
 Arrange test items from easy to difficult.
 Space the test items for easy reading.
 Keep items and options on the same page.
 Place the illustrations near the description.
 Check the answer key.
 Decide where to record the answer.
3. Writing direction. All test directions must be complete, explicit, and simply worded. The
type of answer that is elicited from the learners must be clearly specified. The number of
items to which they apply, how to record their answers, the basis of which they select
Page 63
answers, and the criteria for scoring or scoring system, and the time allocated for each
type of test (if so required) should also be indicated in the test instructions.
Example:
WEAK BETTER
Direction: Choose the best answer for Direction: Study the rubric below and
each given question. identify the correct answer for the
questions that follow (marked as items
6, 7, 8, 9, and 10). Write the CAPITAL
LETTER corresponding to the correct
answer on the space provided before
each number. (1 point each)
4. Checking on the assembled test items. Before reproducing the test, it is very important
to proofread first the test items for typographical and grammatical errors and make
necessary corrections if any. If possible, let others examine the test. This can save
time during the examination as it will not anymore cause any distraction to the
students.
Table 5. Checklist for Checking the Test Items
Checklist Yes No
1. Are the test items appropriate to measure the set
learning outcome?
2. Does the test allow learners to demonstrate a range of
learning?
3. Are the directions clear, complete, and precise?
4. Does the test use simple and unambiguous wordings?
5. Does the test include varied test types?
6. Can the test be answered within the allotted time?
7. Are the items of the same format grouped together?
8. Are the test types arranged logically – from simple to
more complex types?
9. Are there no tricky and unnecessary clues in the test?
10. Does the test provide realistic and fair guidelines for
scoring?
11. Are there no spelling, punctuation, and grammatical
errors?
12. Does the test, as a whole, have student appeal?
Page 64
Reviewing Phase. In this phase, items in the test are examined in terms of their
validity and usability. The test will then be administered to a sample group for reliability
testing. The initial results will be subjected to an item analysis to determine the
discrimination and difficulty indexes of the test before it will be rolled out to the actual
participants.
1. Validating the test. This is done by checking on the relevance, appropriateness,
and meaningfulness of the test to the purpose it claims to serve. This is an
imperative requirement for the accurate application and interpretation of the
test results.
There are different types of validity:

a. Content validity establishes the relevance and representativeness of the
assessment instrument of the behavior or targeted construct that it is
designed to measure. It answers the question: Is the test completely
representative of what it desires to measure? This form of validation
assesses the quality of items on a test. Experts are asked to review the test
items judgmentally in reference to the learning outcomes and/or
instructional objectives. The preparation of Table of Specification (TOS)
before test construction strengthens the content validity of a test.
b. Criterion-Related validity measures how well scores on one measure relate

to or predict scores on theoretically similar measures. It addresses the
question: Do the results from a test correspond to a different test of the
same thing?
Example: The scores obtained by the students in the teacher-made and

standardized reading tests indicate similar levels of performance.
This validation can further be classified into two – concurrent and predictive
validity.
b.1 Concurrent validity. Concurrent means occurring or existing side by side.

In this form of validation, the criterion and predictor data are collected
simultaneously, hence, you can estimate individual performance on
different tests at approximately the same time. This test is best to use
when you want to diagnose the students’ current criterion status
(Gabuyo, 2012).
Example: A teacher gives his/her students a test designed to measure

language ability. The scores the students will obtain on a test can be
compared with the test scores in a recognized and duly validated test
tool already held or kept by the school. The scores in the teacher-made
test can be correlated with the scores in the validated test tool using a
statistical formula such as the Pearson Product Moment Coefficient of
Correlation to establish its validity.
Page 65
b.2 Predictive validity. This approach of criterion validity utilizes the

student’s current test result to estimate his/her performance on some
criterion measure administered at some point in the future. This is
determined by doing correlational, regression, or path analyses in which
the coefficient between the test of interest and criterion assessment
serves as an index measure.
Example: The SAT/ACT scores of higher education institutions are used to

predict a students’ potential to succeed in their chosen career in college.
c. Construct Validity. This type of validity determines how well a test measures
what it is designed to measure. It is the ability of an assessment tool to
measure a theoretical or unobservable variable quality that it claims to test.
Is the test constructed in a way that it successfully tests what it claims to
test? Does the test measure the concept or construct that it is intended to
measure?
Construct validity has two sub-types which are:
c.1 Convergent validity establishes that a test has a high level of correlation
with another test that measures the same construct.
Example: If the instruments that measure self-concept and self-worth yield

scores that are close enough or with a high level of correlation, the two
measurements converge. The result indicating a high level of
correlation between two tests underpins their validity.
c.2 Discriminant validity shows low or no correlation between two tests

measuring different constructs.
Example: If scores measuring self-worth and depression do not converge, the

instruments used are measuring different constructs. The low or
absence of correlation indicates the discriminant validity between two
tests on self-worth and depression.
Factors to the Validity of a Test
The following are the factors that can lower the validity of a test:
o Ambiguity
o Unclear directions
o Errors in test scoring
o Inappropriate length of the test
o Flaws in test administration
o Poor construction of test items
o Identifiable clues or patterns for answers
o Inappropriate level of difficulty of test items
Page 66
o Incorrect arrangement of test types and test items

o Insufficient time to answer the entire test
o Difficult to understand or incomprehensible wordings
Guidelines to Improve the Validity of a Test
To improve the validity of a test, the following guidelines are set:

o Review your stated objectives/outcomes.
o Make sure that your assessment method matches your set
objectives/outcomes.
o Obtain feedback from other teachers concerning the assessment
method, procedure, and format.
o Involve students. Have them look over the prepared instrument and
identify difficulties.
o Give a reasonable length of the test.
o Ensure the proper administration of the test.
Validity Coefficient
The validity Coefficient is a statistical index used to report evidence of

validity for intended interpretations of test scores and defined as the
magnitude of the correlation between test scores and a criterion variable
(Encyclopedia of Measurement and Statistics, 2017). In most cases, it is the
computed value of the rxy using the Pearson r formula. It is reported as a
number between 0 and 1.00 that indicates the magnitude of the relationship.
As a general rule, the higher the validity coefficient the more beneficial it is to
use the test. According to the US Department of Labor Employment and
Training Administration (1999), validity coefficients of r =.21 to r =.35 are
typical for a single test. Validities for selection systems that use multiple tests
will probably be higher because you are using different tools to
measure/predict different aspects of performance, where a single test is
more likely to measure or predict fewer aspects of total performance.
Validity coefficients can be interpreted as follows:
Table 6. Validity Coefficient Value

Validity coefficient value Interpretation
above .35 Very Beneficial
.21 - .35 Likely to be Useful
.11 - .20 Depends on Circumstances
below .11 Unlikely to be Useful
Example: Teacher Johnny wanted to know if the 30-item test he prepared to

measure his students’ mathematical ability is valid. He administered the
test to his 15 high school students and compared the results with
another test that is already recognized or acknowledged for its validity
Page 67
and used it as a criterion. Is the test developed by teacher Johnny valid?

The following table shows the results of the two tests.
Determine the Validity Coefficient using the Pearson r:
Scores in
Scores in Math Criterion
Students Test (x) Test (y) xy x2 y2
1 14 12 168 196 144
2 20 27 540 400 729
3 25 25 625 625 625
4 16 17 272 256 289
5 30 30 900 900 900
6 23 25 575 529 625
7 10 16 160 100 256
8 28 29 812 784 841
9 20 19 380 400 361
10 18 23 414 324 529
11 9 7 63 81 49
12 27 29 783 729 841
13 29 26 754 841 676
14 24 25 600 576 625
15 5 15 75 25 225
∑x=298 ∑y=325 ∑xy=7121 ∑x2=6766 ∑y2=7715
( )( )
√( ( ) )( ( ) )
( ) ( )( )
√( ( ) ( ) ) ( ( ) ( ) )
√( ) ( )
√( ) ( )
Page 68
Interpretation: The correlation coefficient is 0.88 which indicates that the

validity of the test is high and therefore, very beneficial to serve its
intended purpose.
Coefficient of Determination
Another way of interpreting the findings is to consider the squared

correlation coefficient (rxy)2 otherwise known as the coefficient of
determination. This is a statistical measurement that examines and indicates
how much variation in the criterion can be accounted for by the teacher-
made test (predictor).
Example: Using the preceding value for rxy = 0.88. The coefficient of
determination is 0.7744. This means that 77.44% of the variance in
students’ scores can be attributed to the test or 22.56% (100.00%-
77.44%) cannot be attributed to the test.
2. Pilot testing. This step in the process means conducting a “test rehearsal” to a
try-out or sample group to test the validity of a test. A selected group of
learners tries answering a test and their obtained scores in the test provide
useful feedback prior to the deployment of the test to the target group of
examinees. The data gathered from this experimental procedure helps the
teacher to perform a preliminary analysis of the feasibility, practicality, and
usability of the techniques, methods, and tools used in a test. Early detection of
probable problems and difficulties can take place before the actual test
administration. This will eventually reveal aspects of the test and the conduct of
it that need to be refined or improved.
The following are the important considerations in the pilot testing of a test:
o Create a plan. Identify the smaller sample to be tested. Determine the
time duration, cost, correspondences, and the persons to collaborate
with in the conduct of a test.
o Prepare for a pilot test. Set the time and venue for the conduct of a pilot
test. Check on the condition of the test-takers. Try to eliminate aspects
in the environment that will threaten the validity of the test such as the
lightings, ventilation, and orderliness of the test/examination room.
o Deploy the test. Regulate the test-takers and the condition of the venue.
Make sure that the test can obtain truthful and objective information.
Keep an eye on any form of distraction or interruption during the conduct
of the test. Address the specific needs of the test-takers. Provide clear
and complete answers in case they ask questions or clarifications about
the test.
o Assess and evaluate the pilot test. Reflect on the pilot-testing activity
that took place. Identify flaws in the process and devise a plan so they
can be avoided during the actual test. Organize and collate scores for
further analysis.
Page 69
3. Item Analysis.
After the conduct of pilot-testing, the teachers score and look into the quality
of each item in the test. This procedure helps the teachers identify good items
by doing an item analysis. Item analysis is a process that examines student
responses to individual test items (questions) in order to assess the quality of
those items and the test as a whole (University of Washington-Office of
Educational Assessment, 2020). This is certainly vital in improving items in later
tests. Also, the teachers can clearly spot and eliminate ambiguous or misleading
items, design appropriate remediation, and construct a test bank for future use.
The most common method employed for item-analysis is the Upper-Lower

(U-L) index method (Stocklein cited in Gutierrez, 2020). This analysis provides
teachers with three types of information which include (a) difficulty index, (b)
discrimination index, and (c) distractor or option-response analysis.
The difficulty index is determined in terms of the proportion of students in

the upper 27% and lower 27% group who answered a test item correctly.
The steps (Stocklein cited in Gutierrez, 2020) are as follows:

1. Score the test papers and arrange the total scores from the highest to
lowest.
2. Sort the top and bottom 27% of the papers.
3. Tally the correct answers to each item by each student/test taker
in the upper 27% group.
4. Repeat Step 3 but this time, consider the lower 27%.
5. Get the percentage of the upper group that obtained the correct
answer and call this U.
6. Repeat Step 5 but this time consider the lower group and call this L.
7. Get the average percentage of U and L.
8. Get the difference between U and L percentages.
Formula:
Difficulty Index = % of the upper group who got the item right + % of the upper group who got the item right
2
Table 7. Index Range for Level of Difficulty

To interpret the obtained difficulty
Range of Difficulty Index Description
index, use the table to your right. A good or
0.00 – 0.20 Very Difficult
retained item must have both acceptable
0.21 – 0.40 Difficult
indexes of difficulty and discrimination 0.41 – 0.60 Moderately Difficult
index. The acceptable index of difficulty 0.61 – 0.80 Easy
ranges from 0.41 to 0.60 while the 0.81 – 1.00 Very Easy
acceptable index of discrimination ranges
from +0.20 to +1.00.
Page 70
The discrimination index is the power of the test item to discriminate those
who scored high (in the upper 27%) and those who scored low (in the lower 27%) on
the total test.
Formula:
Discrimination Index = % U – % L
Table 8. Index Range for Level of Discrimination
Index Range Discrimination Level

0.19 and below Poor item should be eliminated or need to be revised
0.20-0.29 The marginal item needs some revision
0.30-0.39 Reasonably good item but possibly for improvement
0.40 and above Very good item
For interpretation and decision:
If an item obtained an acceptable level of difficulty (index ranges from 0.41 to 0.60)
and discrimination (index is 0.40 and above), it is considered a good item and must be
retained. If an item is not or unacceptable in either difficulty or discrimination indices, it is
considered fair and must be revised. Finally, if an item is not or unacceptable in both
indices, it is considered a poor item and therefore, must be rejected.
Table 9. Guide to Making Decision for the Retention, Revision, and Rejection of Test Items
Difficulty Index Discrimination Index Remarks Decision

(0.40-0.60) (0.40 and above)
acceptable acceptable good item retain
acceptable not acceptable fair revise
Not acceptable acceptable fair revise
Not acceptable Not acceptable poor reject
Page 71
SAMPLE RESULTS OF ITEM ANALYSIS

(number of students tested – 50)
Lower
Item Upper 27% Diff Discr
27% Remarks Remarks Decision
Number Index index
14 % 14 %
moderately
1 12 0.86 3 0.21 0.54 0.65 very good retain
difficult
2 14 1.00 7 0.50 0.75 easy 0.50 very good revise
3 7 0.50 10 0.71 0.61 easy -0.21 poor reject
4 12 0.86 6 0.43 0.65 easy 0.43 very good revise
moderately
5 10 0.71 4 0.29 0.50 0.42 very good retain
difficult
Upper 27% = 50 x 0.27 = 13.5 or 14
Lower 27% = 50 x 0.27 = 13.5 or 14
For item 1 Difficulty Index
Under Upper 27 % = = = 0.86
Under Lower 27 % = = = 0.21
Difficulty Index = 0.86 + 0.21 = 0.54

2
For item 1 Discrimination Index
Discrimination Index = % Upper – % Lower
= 0.86 - 0.21 = 0.65
The distractor analysis examines effectiveness or how well the incorrect

choices contribute to the quality of an item in a multiple-choice test. It addresses
the performance of incorrect response options called the distractors. The distractor
should be plausible that it can be chosen by those examinees that are not sufficiently
knowledgeable in the content area. On the other hand, it must not be too attractive
that it can be chosen by the greater proportion of the examinees than the keyed
option (right answer). The proportion of examinees that chose the keyed option
must be, more or less equivalent to the p-value or difficulty index.
Page 72
Example: Let us assume that 100 students took the test. If A is the key (right
answer) and the item difficulty is 0.70, then 70 students answered correctly.
What about the remaining 30 students and the effectiveness of the three
distractors?
Options Remarks
Difficulty A B C D
Index
Key Distractors
If the remaining 30 students chose B,
70 30 0 0 then options C and D are useless in
their role as distractors.
70 15 15 0 If the remaining students selected
options B and C, then option D is a
0.70 useless distractor.
70 10 10 10 This is an ideal situation because
each of the three distractors was
selected by 10 students. Options B,
C, and D now appear as plausible
distractors.
4. Reliability Testing
Effective assessments are dependable and consistent that yield reliable evidence
(Suskie, 2018). Reliability refers to the consistency of measure which can be
affected by the clarity of the assessment tool and the capability of those who use
it. Reliable assessment tools generate repeatable and consistent results over
time (test-retest), across items (internal consistency), and across different raters
or evaluators (inter-rater reliability). If the test tool or instrument is unreliable, it
cannot produce a valid outcome.
a. Test-retest reliability indicates the repeatability of test scores when

administered twice to the same group of examinees with a time interval in
between (e.g. two-week time interval). The two sets of scores are correlated
using the Pearson Product Moment Coefficient of Correlation (r) that
establishes the measure of reliability. The reliability coefficient is expressed
in a range of scores from 0.00 to 1.00 which are denoted as follows: (see
Table 10)
Page 73
Table 10. Level of Reliability Coefficient

(Navarro & Santos, 2012)
Reliability Coefficient Interpretation

0.90 and above  Excellent reliability
 The test is at the level of best-standardized
tests
0.81-0.90  Very good for a classroom test
0.71-0.80  Good for a classroom test
 There are probably a few items that need to
be improved
0.61-0.70  Somewhat low
 The test needs to be supplemented by other
measures (e.g. more tests) to determine
grades
 There are probably some items which could
be improved.
0.51-0.60  Suggests need for revision of test, unless it is
quite short (ten or fewer items
 The test definitely needs to be supplemented
by other measures (e.g. more tests for
grading)
0.50 and below  Questionable reliability
 The test will not contribute heavily to the
course grade
 It needs revision
Example: Professor Oz constructed a reading test and subjected it to a test-

retest method using 15 students to ensure its reliability. The table shows the
scores obtained by the examinees during the first and second administration
of the test observing a 15-day interval in between. Is the test reliable?
The formula for the Pearson r:
( )( )
√( ( ) )( ( ) )
Page 74
Test
Scores Re-test Scores
Students (1st Run) (2nd Run) xy x2 y2
1 17 22 374 289 484
2 49 28 1372 2401 784
3 26 12 312 676 144
4 32 30 960 1024 900
5 12 40 480 144 1600
6 27 31 837 729 961
7 18 17 306 324 289
8 38 27 1026 1444 729
9 33 40 1320 1089 1600
10 24 30 720 576 900
11 9 18 162 81 324
12 34 35 1190 1156 1225
13 33 41 1353 1089 1681
14 46 38 1748 2116 1444
15 29 39 1131 841 1521
∑x=427 ∑y=448 ∑xy=13291 ∑x2=13979 ∑y2=14586
( )( )
√( ( ) )( ( ) )
( ) ( )( )
√( ( ) ( ) ) ( ( ) ( ) )
√( ) ( )
√( ) ( )
The obtained rxy = 0.36 which is lower than 0.50 indicates that the test has a
questionable level of reliability. It will not contribute to successfully meet
the desired course outcomes and therefore needs to be thoroughly reviewed
and revised.
Page 75
b. Internal consistency gauges how well the items in an instrument can produce
consistent or similar results on multiple items measuring the same construct.
If people’s responses to the different items are not correlated with each
other, then it would no longer make sense to claim that test items are all
measuring the same underlying construct.
Another approach to measuring internal consistency is the split-half

correlation. In this method, all items that measure the same thing are
randomly split into two such as the first and second halves of the items or the
even- and odd-numbered items. The score is computed for each set of items,
and the relationship between the two sets of scores is examined. The
Pearson r can be applied to find the correlation coefficient between the two
halves
The Kuder-Richardson Formula 20, or KR-20 also checks the internal

consistency of a test instrument with binary or dichotomous choices such as
true or false, right or wrong. It is similar to performing the split-half method.
A correct answer is scored 1 while 0 is assigned for and incorrect answer.
Formula:
( )
Where:
k = number of test questions/items
pj = proportion of examinees passing the item
qj = proportion of examinees failing the item
σ2 = variance of the total scores of all the people taking the test
Example: A true-false test with 10 questions is administered to 17 students. The

results are listed in the table that follows. Determine the reliability of the
questionnaire using Kuder and Richardson Formula 20.
Page 76
Kuder and Richardson Formula 20

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Total
1 1 1 1 1 1 1 1 1 1 1 10
2 1 1 1 1 1 1 1 1 0 0 8
3 1 1 1 1 1 1 1 1 0 0 8
4 1 1 1 1 1 1 1 1 1 0 9
5 1 1 1 1 1 1 1 1 1 0 9
6 0 0 1 1 1 1 1 1 0 0 6
7 0 0 0 0 0 0 1 0 0 1 2
8 0 1 1 1 1 1 1 1 1 0 8
9 0 1 1 1 1 1 1 1 1 0 8
10 0 1 0 0 1 0 0 0 0 0 2
11 0 1 1 1 1 1 1 1 1 0 8
12 0 0 0 0 0 1 1 1 1 1 5
13 1 1 1 1 1 0 1 1 1 1 9
14 1 1 1 1 1 1 0 1 1 1 9
15 1 0 0 0 1 1 0 1 1 0 5
16 1 1 1 0 0 0 1 1 1 1 7
17 1 1 1 0 0 0 1 0 1 1 6
Total 10 13 13 11 13 12 14 14 12 7 119
p 0.588 0.765 0.765 0.647 0.765 0.706 0.824 0.824 0.706 0.412
q 0.412 0.235 0.235 0.353 0.235 0.294 0.176 0.176 0.294 0.588
pq 0.242 0.18 0.18 0.228 0.18 0.208 0.145 0.145 0.208 0.242 1.958
k 17
var 5.294
KR20 0.670
Cell Entity Formula

B26 k =COUNTA(A3:A19)
L24 ∑pq =SUM(B24:K24)
B27 σ2 =VAR(L3:L19)
B28 =(B26/B26-1)*(1-(L24/B27))
The yielded ρKR20 value of 0.67 means that the reliability of the instrument is
somewhat low but is already within the acceptable range value (0.60 or higher).
The test needs, however, to be supplemented by other measures to serve as the
basis of grades and some items probably need to be improved.
Cronbach’s alpha is the most common measure of internal consistency when there
are multiple options given to answer an item in the test. An instrument that uses
Page 77
the Likert Scale to do assessments in the affective domain can apply this formula to
ensure its ability to draw out reliable answers from the respondents. The formula
for Cronbach’s coefficient alpha is:
( )
Where:
k= the number of items in a measure
= the variance of each individual
= the variance of the total items on the index
This application is easier and more efficiently done using Microsoft Excel.
Example: The researcher used the Likert scale for the respondents to rate their
attitude toward Mathematics as a subject. The respondents showed their
agreement or disagreement on the different items raised in the questionnaire. The
responses of the participants were coded from the lowest (strongly disagree) with
an assigned score of 1 to the highest (strongly agree) with a designated score of 5.
The table reflects the answers of the 19 respondents on the 5-item test.
Cronbach’s Alpha using Microsoft Excel

Items
Respondents Q1 Q2 Q3 Q4 Q5
1 4 4 4 4 4
2 4 4 4 4 4
3 3 3 3 3 3
4 4 5 4 5 5
5 2 2 2 2 2
6 4 4 4 5 5
7 5 2 3 4 5
8 4 4 5 4 5
9 4 3 2 5 3
10 4 4 4 5 5
11 5 2 3 4 5
12 2 3 3 4 5
13 3 3 3 3 3
14 5 4 4 4 5
15 4 4 4 4 5
16 2 3 1 4 3
17 4 4 4 5 5
18 5 5 5 5 5
19 5 5 5 5 5
Page 78
Step 1. On the top menu, click data analysis > two-way ANOVA without
replication > OK.
Step 2. The ANOVA without replication dialogue box will appear. Click the
input range > new worksheet ply > OK.
OK
Highlight
the scores
Page 79
Step 3. The generated output will appear in a new sheet.
Step 4. Apply the formula shown below:

1- ( )
1- ( )
1- ( )
1- 0.119606
0.88096
The 0.88096 represents the instrument has a very high level of

reliability to assess students’ attitude toward Math as a subject
c. Inter-rater reliability measures the extent to which two or more raters or
examiners agree in their judgments or scoring of a test. This method addresses the
issue of consistency in the implementation of a grading system. The basic measure
for inter-rater reliability is a percent agreement between two or more raters.
Page 80
For example, two raters observed five learners who executed the basic steps
in a folk dance. The table shows how each demonstration of skill was rated.
Students Rater 1 Rater 2
1 2 3
2 4 4
3 1 4
4 5 5
5 3 3
The 2 raters agreed on 3 out of 5 scores. Percent agreement is determined by

(a) counting the number of ratings in agreement which is 3, (b) counting the
total number of ratings which is 5, (c) dividing the number of ratings in
agreement by the total number of ratings (3/5), and then (d) convert the
result (0.60) to a percentage (60%). The 60% agreement is at an acceptable
level of reliability.
If there are multiple raters, add columns for the pairing of results and to
indicate the agreement.
Example:
Students Rater Rater Rater R1&R2 R1&R3 R2&R3 Agreement

1 2 3
1 4 3 4 0 1 0 1/3
2 1 2 3 0 0 0 0/3
3 2 4 4 0 0 1 1/3
4 5 5 5 1 1 1 3/3
5 3 3 4 1 0 1/3
Calculate for the mean of fractions in the agreement column.
The obtained result (53%) falls within the range of 0.51 and 0.60. It suggests
that the test has a questionable level of reliability. Revision of the test and
adoption of more tests to provide a valid assessment of students’
performance are called for.
Page 81
Activity A. Preparing Table of Specification. Revisit the answers you indicated in

Activity I B in module 2. Finalize your choice of the topic related to your field of
discipline and target learning outcomes. Based on the finalized topic of choice
and formulated learning outcomes, prepare a Table of Specification using format 4.
Indicate your activity on a separate paper observing the following layout:
a. Orientation – portrait
b. Margin – 1” at the top, bottom, right, and left
c. Size – A4 (8.3 by 11.7 in)
d. Spacing – single
Activity B. Test Design and Construction. Based on the chosen topic and the
specified learning outcomes, construct a test (unit or periodic test). Use a separate
sheet with the layout indicated in activity A. (1 point each)
Exercises.
I. Identification. Specify on the blank before each number the type of

objective test being described in each of the items below.
_______________ 1. This test asks examinees to perform a task or activity.

_______________ 2. A test consists of incomplete statements to be filled in by the
examinees.
_______________ 3. A test that presents a question to be answered by the
examinees in a word or phrase.
________________4. An objective test that requires examinees to choose only one
correct answer from the three or more options provided.
_______________ 5. This test gives a situation or a question that can be addressed
by having students construct a rather long response up to
several paragraphs.
Page 82
II. Short Answer. Study the data in the table below and answer the questions
that follow. Write your answer on the space provided. (1 point each)
Using the results of the 5-item test given to 60 students, Teacher Trixie
wanted to determine which items need to be retained, revised, and rejected. She
encoded the data and determined the number of students who got each of the 5
items correct from both the upper and lower 27th group.
Item Analysis Results
Item Number Upper 27% Lower 27%

1 16 16
2 10 3
3 4 0
4 15 10
5 12 4
1. Which item/s has/have an acceptable level of difficulty?

__________________________
2. Which item does/do NOT have an acceptable discrimination index?

__________________________
3. Which item/s need/s to be retained?

__________________________
4. Which item/s need/s to be revised?

__________________________
5. Which item/s need/s to be rejected?

__________________________
Page 83
III. Identifying errors. Each item below contains blunders/errors/violations

based on the guidelines of test construction. Identify and reflect them in the
box provided at the right. (3 pts. each)
SENTENCE COMPLETION
Fill in the blanks with the correct word to complete the
statement.
1. A test is a ___________ used to establish the
quality, __________, or reliability of ________,
TRUE OR FALSE
Tell whether the statement is true or false.
2. Scoring an essay test is always difficult.
MULTIPLE CHOICE
Choose the best answer.
3. Which is the best way of dealing with discipline
problem in the classroom?
A. Always give test
B. Impose punishment
C. Talking with the child in private
D. Must involve the parents
ESSAY
Construct an essay to answer the question.
4. List the 7-step path to making “ethical decisions.”
List them in their correct progressive order.
progressive order.
MATCHING TYPE
Match column A with B. Letter only.
A B
___1. Multiple choice A. Most difficult to score
___2. True-False B. Students can simply make guesses
___3. Short Answer C. measures greatest variety of
learning outcomes
D. Least useful for educational
diagnosis
Page 84
IV. Critiquing. In 2 to 3 sentences, state your comment on the validity of the

assessment practices of teachers reflected in the following scenarios: 5 points
each).
Scoring Rubric
Level Score Description
Exemplary 5 points  The comment/remark is accurate with all
main points included.
 There is a clear and logical presentation of
ideas.
 Sentences are well-structured and free
from grammatical and/or syntactic errors.
Very Good 4 points  The comment/remark is accurate but
there are minor problems in logic and
construction.
 Few grammatical/syntactic errors are
found.
Good 3 points  One or two major points are missing in the
comment/remark.
 Few grammatical/syntactic errors are
found.
Needs 2 points  The answer does convey a full
Improvement understanding of the lesson.
 The quality of writing is inferior.
Unsatisfactory 1 point  The answer is inaccurate or deviates from
what is asked.
 Sentences are disorganized and contain
major grammatical/syntactic errors.
Scenario 1. Teacher Luna constructed a 50-item summative test in Filipino for grade
six pupils. She prepared a TOS according to the pre-determined topics outlined in
the course program. Being assigned a special assignment, she missed delivering
almost half of the topics that she was supposed to cover. To make up for her
absences, she distributed hand-outs or copies so students could proceed despite her
failure to hold a regular class. Will the test provide a valid result?
Please write your answer here.
Page 85
Scenario 2. Teacher May set this learning outcome for her students to develop at
the end of the lesson: the students are able to demonstrate proficiency in writing
and verbal skills. Considering the time and effort she will exert in checking the
papers, she finally opted to give a short answer test where students will still be
required to construct short sentences. Does her assessment method match her
target outcome? Will she be able to measure what she’s supposed to measure?
Scenario 3. Sir Ben gave a 120 multiple-choice test in Math for his college students
to answer in one hour. Due to lack of time, more than 50% of the students were not
able to finish. The students appealed that the remaining unanswered items will not
be counted and students’ scores will only be based on the number of items they
were able to finish. Sir Ben finally conceded to the students’ request Is his decision
proper? Will it not invalidate the test?
Page 86
V. Problem-Solving. Perform what is asked.
Teacher Marsh prepared a 30-item test to measure the students’ level of

comprehension. She wanted to find out if the instrument she used can elicit
stable results over a period of time. She did two administration of the test to a
group of learners with a gap of 15 days in between. Using the data in the
table, calculate for the coefficient reliability of the instrument, and state your
interpretation.
You will be graded based on the following:
Process 5 points (steps)
Answer 3 points (result of computation)
Interpretation 2 points (description based on reliability index)
10points
Students 1st Run 2nd Run

Please show your process here.
1 22 21
2 13 19
3 24 24
4 25 19
5 16 18
6 3 12
7 23 26
8 21 25
9 22 25
10 15 20
11 16 19
12 19 18
13 21 22
14 3 14
15 4 9
16 2 12
17 16 23
18 8 13 Answer: _________________
19 26 24 Interpretation:
20 24 30 _____________________________
21 30 30 _____________________________
22 16 14
Page 87
VI. Illustrating the Concept. Present a creative and inventive illustration that
depicts the test development process. Label every phase and the
corresponding activities done by the teacher/assessor for a clearer
representation of the process. (10 pts.).
Criteria for scoring:

Content maximum of 5points
o Accurate and reflects
complete understanding of the topic
Presentation maximum of 3 points
o Neat, organized, logical, creative, and
original
Mechanics maximum of 2 points
o Free of spelling, grammar, and
Punctuation error
Page 88
References:
Encyclopedia of Measurements and Statistics. (2017). Validity Coefficient. Retrieved September 4,

2020, from http://methods.sagepub.com/Reference//encyclopedia-of-measurement-and-
statistics/n470.xml#:~:text=The%20validity%20coefficient%20is%20a,intended%20meaning
%20of%20the%20test).
Gabuyo, Y. (2012). Assessment of Learning 1. Manila: Rex Book Store.
Gutierrez, D. (2007). Assessment of Learning Outcomes (Cognitive Domain). Malabon, Metro

Manila: Kerusso Publishing House.
Kubiszyn, Tom & Borich, G. (2007) Educational Testing and Measurement: Classroom Application and
Practice. 8th Ed. Wiley Jossey-Bass
Morrow, J. et al. (2016). Measurement and Evaluation in Human Performance. 5 th edition.

USA: Thomson-Shore, Inc.
Navarro, R., and Santos, R. (2012). Assessment of Learning Outcomes, 2 nd edition. Manila,
Philippines: Lorimar Publishing, Inc.
Osterlind S.J. (1989) What Is Constructing Test Items?. In: Constructing Test Items.
Evaluation in Education and Human Services, vol 25. Springer, Dordrecht. Retrieved
August 21, 2020, from https://link.springer.com/chapter/10.1007%2F978-94-009-1071-
3_1.
Suskie, L. (2018). Assessing Student Learning: A Common Sense Guide. USA: Jossey Bass.
University of Washington-Office of Educational Assessment (2020).Understanding item
analysis. Retrieved September 5, 2020, from
https://www.washington.edu/assessment/scanning-scoring/scoring/reports/item-analysis/.
US Department of Labor Employment and Training Administration. (1999). Understanding test

quality-concepts of reliability and validity. Retrieved September 4, 2020, from https://hr-
guide.com/Testing_and_Assessment/Reliability_and_Validity.htm#:~:text=As%20a%20gener
al%20rule%2C%20the, typical%20for%20a%20single%20test.
Page 89

Chapter 3 - Test Development Process

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 3 - Test Development Process

Uploaded by

Copyright:

Available Formats

CHAPTER 3: DEVELOPMENT OF TOOLS FOR CLASSROOM-BASED ASSESSMENT

The painstaking process of classroom assessment dictates the deployment of varied

Test Development Process

Figure 11. Test Development Process

The important steps in planning for a test are:

2. Determine the competencies to be measured.

3. Decide on the type and method of assessment to use.

4. Prepare a Table of Specification (TOS)

4. Create a diagram to 3½ 14 / 27-40

Number of items = number of class sessions x desired total number of items

Number of items = 1 ½ x 40 = 6 items

Table 4. Average Time to Answer Each Type of Test

 Prevent unintended clues to an answer in the statement or question.

Objective Test Subjective Test

Selection Test Supply Test Performance Test

Multiple Short Simulated Product-

Figure 12. Types of Test

MATCHING TYPE TEST

SHORT ANSWER TEST

b.2 Completion test or fill-in-the-blank test requires examinees to supply word/s,

Example: Point out the limitations of objective type of test in 300

b.3.2. Extended response essays allow the students wide latitude of

Example: Is a valid test reliable? Thoroughly discuss your answer.

c. Performance Test. This assessment type requires students to perform a

c.1 Simulated performance requires examinees to carry out the basic

There are different types of validity:

b. Criterion-Related validity measures how well scores on one measure relate

Example: The scores obtained by the students in the teacher-made and

b.1 Concurrent validity. Concurrent means occurring or existing side by side.

Example: A teacher gives his/her students a test designed to measure

b.2 Predictive validity. This approach of criterion validity utilizes the

Example: The SAT/ACT scores of higher education institutions are used to

Construct validity has two sub-types which are:

Example: If the instruments that measure self-concept and self-worth yield

c.2 Discriminant validity shows low or no correlation between two tests

Example: If scores measuring self-worth and depression do not converge, the

Factors to the Validity of a Test

o Incorrect arrangement of test types and test items

Guidelines to Improve the Validity of a Test

To improve the validity of a test, the following guidelines are set:

The validity Coefficient is a statistical index used to report evidence of

Table 6. Validity Coefficient Value

Example: Teacher Johnny wanted to know if the 30-item test he prepared to

and used it as a criterion. Is the test developed by teacher Johnny valid?

Determine the Validity Coefficient using the Pearson r:

Interpretation: The correlation coefficient is 0.88 which indicates that the

Another way of interpreting the findings is to consider the squared

The most common method employed for item-analysis is the Upper-Lower

The difficulty index is determined in terms of the proportion of students in

The steps (Stocklein cited in Gutierrez, 2020) are as follows:

Table 7. Index Range for Level of Difficulty

Table 8. Index Range for Level of Discrimination

Index Range Discrimination Level

For interpretation and decision:

Difficulty Index Discrimination Index Remarks Decision

SAMPLE RESULTS OF ITEM ANALYSIS

Upper 27% = 50 x 0.27 = 13.5 or 14

Lower 27% = 50 x 0.27 = 13.5 or 14

For item 1 Difficulty Index

Under Upper 27 % = = = 0.86

Under Lower 27 % = = = 0.21

Difficulty Index = 0.86 + 0.21 = 0.54

For item 1 Discrimination Index