Educ9 WilmaRamos

THE ADELPHI COLLEGE INC.
New Street East , Lingayen Pangasinan

College of Education
First Semester 2022-2023
Ramos , Wilma O.
BEED / Educ-9 Assessment Learning 1
CHAPTER VI – ITEM ANALYSIS AND VALIATION
Validity
Reliability
INTRODUCTION
The goal of validation is to ascertain the properties of the entire test itself, namely, the validity and
reliability of the test. This is done after performing the item analysis and revising the items that require revision.
The process of gathering and examining data to support the test's relevance and utility is known as validation.
These are the two important knowledge for educators.

Which are:
a.) Validity
b.) Reliability
Validity
Validity is the extent to which a test measures what it purports to measure or as referring to the
appropriateness, correctness, meaningfulness and usefulness of the specific decisions a teacher makes based on the
test results. These two definitions of validity differ in the sense that the first definition refers to the test itself while
the second refers to the decisions made by the teacher based on the test.
A teacher who conducts test validation might want to gather different kinds of evidence. There are
essentially three main types of evidence that may be collected: content-related evidence of validity, criterion-
related evidence of validity and construct-related evidence of validity. Content-related evidence of validity refers to
the content and format of the instrument. How appropriate is the content? How comprehensive? Does it logically
get at the intended variable? How adequately does the sample of items or questions represent the content to be
assessed?
The correlation between the instrument's score and scores from one or more other tests is referred to as
criterion-related proof of validity (often called criterion). How solid is this connection? How well do these scores
represent current performance or forecast future behavior of a particular kind?
Evidence of validity that is related to a psychological construct or trait being examined by the test is
referred to as construct-related evidence of validity. How effectively does a construct-based measure account for
variations in people's behavior or performance on a particular task?
The usual procedure for determining content validity may be described as follows: The teacher writes out
the objectives of the test based on the table of specifications and then gives these together with the test to at least
two (2) experts along with a description of the intended test takers. The experts look at the objectives, read over the
items in the test and place a check mark in front of each question or item that they feel does not measure one or
more objectives. They also place a check mark in front of each objective not assessed by any item in the test. The
teacher then rewrites any item so checked and resubmits to the experts and/or writes new items to cover those
objectives not heretofore covered by the existing test. This continues until the experts approve of all items and also
until the experts agree that all of the objectives are sufficiently covered by the test.
In order to obtain evidence of criterion-related validity, the teacher usually compares scores on the test in
question with the scores on some other independent criterion test which presumably has already high validity. For
example, if a test is designed to measure mathematics ability of students and it correlates highly with a
standardized mathematics achievement test (external criterion), then we say we have high criterion-related
evidence of validity. In particular, this type of criterion-related validity is called its concurrent validity. Another
type of criterion-related validity is called predictive validity wherein the test scores in the instrument are correlated
with scores on a later performance (criterion measure) of the students.
Example: The mathematics ability test constructed by the teacher may be correlated with their later performance in
a Division wide mathematics achievement test.
Apart from the use of correlation coefficient in measuring criterion-related validity. Gronlund suggested
using the so-called expectancy table. This table is easy to construct and consists of the test (predictor) categories
listed on the left hand side and the criterion categories listed horizontally along the top of the chart. For example,
suppose that a mathematics achievement test is constructed and the scores are categorized as high, average. and
low. The criterion measure used is the final average grades of the students in high school: Very Good, Good, and
Needs Improvement.
Test Score Very Good Good Needs Improvement

High 20 Grade Point Average 10 5
Average 10 25 5
Low 1 10 14
The expectancy table shows that there were 20 students getting high test scores and subsequently rated
excellent in terms of their final grades; 25 students got average scores and subsequently rated good in their finals;
and finally, 14 students obtained low test scores and were later graded as needing improvement. The evidence for
this particular test tends to indicate that students getting high scores on it would be graded excellent, average scores
on it would be rated good later; and students getting low scores on the test would be graded as needing
improvement later.
We will not be able to discuss the measurement of construct related validity in this book since the method
to be used require sophisticated statistical techniques falling in the category of factor analysis.
Reliability
Reliability refers to the consistency of the scores obtained how consistent they are for each individual
from one administration of an instrument to another and from one set of items to another. We already gave the
formula for computing the reliability of a test: for internal consistency; for instance, we could use the split-half
method or the Kuder-Richardson formulae (KR-20 or KR-21)
Reliability and validity are related concepts. If an instrument is unreliable, it cannot yet valid outcomes.
As reliability improves, validity may improve (or it may not). However, if an instrument is shown scientifically to
be valid then it is almost certain that it is also reliable.
The following table is a standard followed almost universally in educational tests and measurement.
Reliability Interpretation
90 and above Excellent reliability; at the level of the best
standardized test
80-90 Very good for a classroom test
.70-80 Somewhat low. This test needs to be supplemented
by other measures (e.g, more tests) to determine
grades. There probably some items which could be
improved.
.60-70 Suggests need for revision of test, unless it is quite
short (ten or fewer items). The test definitely needs
to be supplemented by other measures (e.g, more
test) for grading
.50 or below Questionable reliability. This test should not
contribute heavily to the course grade, and it needs
revision.
REFERENCES:
• Rosita L. Navarro, Ph.D. and Rosita G. Santos, Ph.D. ASSESSMENT LEARNING OUTCOMES
(Second Edition)

Educ9 WilmaRamos

Uploaded by

Copyright:

Available Formats

You might also like

Educ9 WilmaRamos

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Educ9 WilmaRamos

Uploaded by

Copyright:

Available Formats

THE ADELPHI COLLEGE INC.

New Street East , Lingayen Pangasinan

These are the two important knowledge for educators.

Test Score Very Good Good Needs Improvement

You might also like