Validity and Reliability

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 34

VALIDITY AND

RELIABILITY
CHAPTER 4
OVERVIEW

It is not unusual for teachers to receive complaints or comments from students regarding tests and other
assessments. For one, there may be an issue concerning the coverage of the test. Students may have been
tested on areas that were not part of the content domain.
They may also be too complex, inconsistent with the performance verbs in the learning outcomes.
Validity alone does not ensure high quality assessment. Reliability of test results should also be checked.
Questions on reliability surface if there are inconsistencies in the results when tests are administered over
different time periods, sample of questions or groups. Bothe validity and reliability are considered when
gathering information or evidences about student achievement. This chapter discusses the distinctions between
the two.
VALIDITY
Validity is a term derived from the Latin word valid us, meaning strong. In view of assessment, it is deemed valid if
it measures what it is supposed to. In contrast to what some teachers believe, it is not a property of test. It
pertains to the accuracy of the inferences teachers make about students based on the information gathered from
assessment (McMillan, 2007; Five & DiDonato-Barnes, 2013). This implies that the conclusions teachers come up
with in their evaluation of student performance is valid if there are strong and sound evidences of the extent of
students’ learning. Decisions also include those about instruction and classroom climate (Russell & Airasian, 2012).

An assessment is valid if it measures a student’s actual knowledge and performance with respect to the intended
outcomes, and not something else. It is representative of the area of learning or content of the curricular aim
being assessed (McMillan, 2007; Pop ham, 2011). For instance, an assessment purportedly for measuring
arithmetic skills of grade 4 pupils is invalid if used for grade 1 pupils because of
issues ion content (test content evidence) and level of performance (response process evidence). A test the
measures recall of mathematical formulae is invalid if it is supposed to assess problem-solving. This is an example
of validity problems particularly on content related evidence. There are two other sources of information that can
be used to establish validity: criterion-related evidence and construct-related evidence. Each of these shall be
presented here.
A.
CONTENT-RELATED EVIDENCE
Content-related evidence for validity pertains to the extent to which the test covers the entire domain of
content. If a summative test covers a unit with four topics, then the assessment should contain items from
each topic. This is done through adequate sampling of content. A student’s performance in the test may be
used as an indicator of his/her content knowledge. For instance, if a grade 4 pupil was
able to correctly answer 80% of the items in a science test about matter, the teacher may infer that the pupil
knows 80% of the content area.

In the previous chapter, we talked about appropriateness of assessment methods to learning outcomes. A
test that appears to adequately measure the learning outcomes and content is said to posses FACE VALIDITY.
As the name suggests, it looks at the superficial face value of the instrument. It is based on the subjective
opinion of the one reviewing it. Hence, it is considered non-systematic or non
scientific. A test that was prepared to assess the ability of pupils to construct simple sentences with correct
subject-verb agreement has face validity if the test looks like an adequate measure of the cognitive skill.
Another consideration related to content validity is INSTRUCTIONAL VALIDITY - the extent to which an
assessment is systematically sensitive to the nature of instruction offered. This is closely related to
instructional sensitivity which Pop ham (2006, p.1) defined as the “degree to which students’
performances on a test accurately reflect the quality of instruction to promote students’ mastery of
what is being assess.” Yoon & Resnick (1998) asserted that an instructionally valid test is one that
registers differences in the amount and kind of instruction to which students have been exposed. They
also described the degree of overlap between the content tested and the content taught as
opportunity to learn which has an impact on test scores. Let’s consider the Grade 10 curriculum in
Araling Panlipunan (social studies). In the first grading period, they will cover three economic issues:
unemployment, globalization and sustainable development. Only two were discussed in class but
assessment covered all three issues. Although these were all identified in the curriculum guide and
may even be found in a textbook, the question remains as to whether the topics were all taught or not.
Inclusion of items that were not taken up in class reduces validity because students had no opportunity
to learn the knowledge or skill being assessed.
To improve the validity of assessment, it is recommended that the teacher constructs a two-dimensional grid called
TABLE OF SPECIFICATIONS (ToS). The ToS is prepared before developing the test. It is a test blueprint that identifies the
content area and describes the learning outcomes at each level of the cognitive domain (No tar, et al., 2004). It is a tool
used in conjunction with lesson and unit planning to help teachers make genuine connections between planning,
instruction, and assessment (Five & DiDonato-Barnes, 2013) it assures teachers that they are testing students’ learning
across a wide range of content and readings as well as cognitive processes requiring higher order thinking. Table 4.1 (see
p. 54) is an example of an adapted ToS using the learning competencies found in the Math curriculum guide. It is a two-
way table with learning objectives or content matter on the vertical axis and the intellectual process on the other.
Carey (as cited by No tar, et al., 2004) specific six elements in ToS development: (1) balance among the goals selected for
the examination; (2) balance among the levels of learning; (3) the test format; (4) the total number of items; (5) the
number of items for each goal and level of learning; and (6) the enabling skills to be selected from each goal framework.
The first three elements were discussed in the previous chapter. As to the number of items, that would depend on the
duration of the test which is contingent on the academic level and attention span of the students. A six-year old Grade 1
pupil is not expected to accomplish a one-hour test. They do not have the tolerance to sit in an examination that long.
The number of items is also determined by the purpose of the test or its proposed uses. Is it a power test or a speed
test? Power tests are intended to measure the range of the student’s capacity in a particular area, as opposed to a
speed test that is characterized by time-pressure. Hopkins (1998) argued that classroom tests should be a power test and
that students should be given ample time to answer the test items. He reasoned that there is little correlation anyway
between a student’s ability and his/her working rate on a test.
Meanwhile, determining the number of items for each topic in the ToS depends on the instructional time. This
means that topics that consumed longer instructional time with more teaching-learning activities should be given
more emphasis. This element is reflected in the third column of the ToS (Table 4.1). An end-of-period test covering
only half of the six-week period is not representative of the students’ overall learning achievement.

The last element denotes the level of performance being achieved. The teacher should test at the highest level.
Suppose that grade 3 pupils are expected to construct declarative and interrogative sentences in their English class.
The test items should allow them to do just that and not just examine their skill of distinguishing declarative from
interrogative sentences. Nonetheless, it is assumed that if student can perform a complex skill, then they can also
accomplish the lower levels.
Validity suffers if the test is too short to sufficiently measure behavior and cover the content. Adding more items
to the test may increase its validity. However, an excessively long test may be taxing to the students. Regardless of
the trade-off, teachers must construct tests that students can finish within a responsible time. It would be helpful
if teachers also provide students with tips on how to manage their time.
B.
CRITERION-REALATED EVIDENCE
Criterion-related evidence for validity refers to the degree to which test scores agree with an external criterion. As
such, it is related to external validity. It examines the relationship between an assessment and another measure of the
same trait (McMillan, 2007). There are three types of criteria (Nitko & Brook hart, 2011): (1) achievement test scores; (2)
ratings, grades and another numerical judgments made by the teacher; and (3) career data. An achievement test
measures the extent to which learners had acquired knowledge about a particular topic or area, or measured certain
skills at a specific point in time as a result of planned instruction or training, a summative test on “earth and space” given
to grade 10 students in science at the end of the first quarter can serve as a criterion. A readiness test in earth science
can be compared to the results of the periodical test through correlation. Established personality inventories can be used
as criteria to validate newly developed personality measures. Results of vocational interest surveys can be compared to
career data to determine validity.

Criterion-related evidences are of two types: concurrent validity and predictive validity. CONCURRENT VALIDITY
provides an estimate of a student’s current performance in relation to a previously validated or established measure. For
instance, a school has developed a new intelligence quotient (IQ) test. Results from this test are statistically correlated to
the results from a standard IQ test. If the statistical analysis reveals a strong correlation between the two sets of scores,
then there is high criterion validity. It is important to mention that data from the two measures are obtained at about the
same time.
PREDICTIVE VALIDITY pertains to the power or usefulness of test scores to predict future performance. For instance,
can scores in the entrance/admission test (predictor) be used to predict college success (criterion)? If there is a
significantly high correlation between entrance examination scores and first year grade point averages (GPA) assessed
a year later, then there is a predictive validity. The entrance examination was a good measurement tool for student
selection. Similarly, an aptitude test given to high school students is predictive of how well they will do in college. The
criterion is the GPA. Educators may be interested in exploring other criteria like job proficiency ratings, annual income,
or twenty-first-century skills like citizenship.
In testing correlations between two data sets for both concurrent and predictive validity, the PERSON
CORRELATION COEFFECIENT (r) OR SPEARMSN’S RANK ORDER CORRELATION may be used. The square of the
correlation coefficient (r^2) is called the coefficient of determination. In the previous example where the entrance
examination is the predictor and the college GPA is the criterion, a coefficient of determination r^2=0.25 means 25%
of student’s college grades may be explained by their scores in the entrance examination. It also implies that there are
other factors that contribute to the criterion variable. Teachers may then look into other variables like study habits
and motivation.
C.
CONSTRUCT-RELATED EVIDENCE
A construct is an individual characteristic that explains some aspect of behavior (Miller, Linn & Gronlund, 2009).
Constructed-related evidence of validity is an assessment of the quality of the instrument used. It measures the extent
to which the assessment is a meaningful measure of an unobservable trait or characteristic (McMillan, 2007). There are
three types of construct-related evidence: theoretical, logical and statistical (McMillan, 2007).
A good construct has a theoretical basis. This means that the construct must be operationally defined or explained
explicitly to difference it from other constructs. Confusion in the meaning of the construct will render assessment
results and inferences dubious. Motivation for instance is a latent variable. In measuring student’s motivation to learn,
it is only right to ask the question, “what makes students wants to learn in school?” does motivation here pertain to
the interest or satisfaction shown by students in doing a particular set
of tasks or the reasons why they are actively engaged in the tasks? Conley, Karabenick & Arbor (2006), in their study,
measured students’ motivation to learn through a survey that included measures of self-efficacy for learning, tasks
value and students’ personal achievement goals, focusing on the domain of Mathematics. To get substantive evidence
of construct validity, internal relations among the items in the instrument were sought. Through a statistical; process
called factor analysis, they found that data on task value and achievement goals were consistent with the theory of the
construct.
In 1955, Lee Cronbach and Paul Meehl insisted that to provide evidence of construct validity, one has
to develop a homological network. It is basically a network of laws that includes the theoretical framework
of the construct, an empirical framework on how it is going to be measured, and specification or linkages
between the two frameworks (Trochim, 2006). Unfortunately, the homological network does not provide a
practical approach to assessing construct validity.
Construct validity can take the form of a differential groups study. For instance, a test problem-solving
strategy is given to two groups of students-those specializing in Math and those specializing in Social
Studies. If the Math group better than the other group, then there is evidence of construct validity.
Another form is an intervention study wherein a test is given to a group of students who are weak in
problem-solving strategies. After being taught the construct the same groups of students are again
assessed. If there is a significant increase in test scores, the results support the construct validity of the
test. These are logical evidences of construct validity.
There are two methods of establishing construct validity: convergent and divergent validation. On the one hand,
CONVERGENT validity occurs when measures of constructs that are related are in fact observed to be related.
DIVERGENT (or discriminate) validity, on the other hand, occurs when constructs that are unrelated are in reality
observed not to be. Let’s consider a test administered to measure knowledge and skills in geometrical reasoning
which is far different from a reading construct. Hence, comparison of test results from these two constructs will show
lack of communality. This is what we mean by discriminates validity. Convergent validity is obtained when test scores
in geometrical reasoning turn out to be highly correlated to scores from another geometry test measuring the same
construct. These construct-related evidences rely on statistical procedures. In 1959, Campbell and Fiske developed a
statistical approach called MULTITRAIT MULTIMETHOD MATRIX (MTMM) – a table of correlations arranged to
facilitate the assessment of construct validity, integrating both convergent and divergent validity (Trochim, 2006). In
contrast to the homological network, the MTMM employs a methodological approach. However, it was still difficult to
implement. McMillan (2007) recommends, for practical purposes, the use of clear definitions and logical analysis as
construct-related evidences.
UNIFIED CONCEPT OF VALIDITY
In 1989, Messick proposed a unified concept of validity based on an expand theory of construct validity which addresses score
meaning and social values in test interpretation and test use. His concept of unified validity “integrates considerations of content,
criteria, and consequences into a construct framework for the empirical testing of rational hypotheses about score meaning and
theoretically relevant relationship” (Messick, 1995, p.741). He presented six distinct aspects of construct validity: content, substantive,
structural, generalizability, external and consequential. The descriptions’ that follow are based on the capsule descriptions given by
Messick (1995).
The content aspects are parallel to content-related evidence which calls for content relevance and representativeness. The
substantives aspects pertain to the theoretical constructs and empirical evidences. The structural aspects assess how well the scoring
structure matches the construct domain.
The generalizability aspects examine how score properties and interpretations generalize to and across populations groups,
contexts and tasks. This is called EXTERNAL VALIDITY. Criterion-related evidence for validity is related to external validity as the
criterion may be an externally-defined gold standard. The external aspects include convergent and discriminate evidences taken from
Multitrait-Multimethod studies. Finally, consequential aspects pertain to the intended and unintended effects of assessment on
teaching and learning.
In view of consequential validity, Messick (1995) explained that the social consequences of testing may be positive if it leads to
improved educational policies, or negative if beset with bias in scoring and interpretation or unethical use. The study made by Conley,
Karabenick & Arbor (2006) on student’s motivation to learn (math) contained a discussion of the ways which data had been reported
and used as consequential evidence. They wanted to show that students are motivated in different ways and as such, different teacher
interventions are needed. Hence, they wrote that the intended consequence of the use of the scores in their study was to change the
content of school’s professional development and action plans. Below is an excerpt (Conley, Karabenick & Arbor, 2006, p.11):
Algebra 1A students started the year with the lowest scores on interest, mastery, and efficacy (e.g., they saw math as
less interested than other math students, they were less likely to focus on understanding, and were the least
confident in their math ability). However, they saw math as just as useful as other students in the school, and had
similar levels of focus on competition.

 Change – Algebra 1A students had a more adaptive pattern of change than other students at this school; the drops
were generally smaller than for students in other courses. They saw math as less useful and were less focused on
learning but slightly more confident in their math ability.
 Goals for next year - help students see how math is useful, and importantly, help focus students on learning and
developing (rather than just demonstrating) ability.

As for alternative assessments, Sam bell, McDowell & Brown (1997) wrote that such assessments (which include
performance tasks) appear to have strong consequential validity because they incorporate meaningful, engaging and
authentic tasks. Moreover, students are actively involved in the assessment process. In their study, student
respondents perceived alternative assessment to be an accurate measure of student learning. They related
alternative assessment to authentic tasks, hence found them beneficial and rewarding in the long run. Their
perceptions of conventional assessment were quite negative because the student respondents viewed testing as
divorced from effective learning.
Positive consequences were explained well by McMillan (2007). These consequences pertain to how assessment
directly impact students and teachers. Positive consequences on students include the effects of assessment on how
they study, how they are motivated and how they relate to their teachers. Teachers who often provide tests on
identification and enumeration encourage memorization. Teachers who provide problem-solving activities and
performance assessments compel students to hone their skills. When students are informed of the standards and
how they are to be assessed and graded. They assume a shared responsibility for their learning. When
assessment is authentic and done with fairness, assessment can positively impact their motivation. Teacher-student
relationship is also strengthened when teachers provide feedback to students concerning their performance.
Teachers are likewise affected by the nature of the assessment they give. If assessment calls for recall of facts,
teachers tend to resort to the lecture method. However, if reasoning and skills are to be assessed, then teachers
devise plenty of classroom learning experiences and situations that would call for knowledge applications and higher
order thinking. Oral questioning and performance tasks become typical methods for acquiring evidences of students
learning. The methods of assessment used by teachers mirror the kind of educator they are.
Traditional teachers lecture all day long and teach to the test. The purpose of assessment is to report grades.
However, twenty-first century teachers design alternative assessments that target multiple intelligences. This is
possible by utilizing more authentic assessment tools such as portfolios, projects, journal, among others. The purpose
of assessment is to reinforce learning and measure success.
VALIDITY OF ASSESSMENT METHODS
In the previous sections, validity of traditional assessments was discussed. What about the other assessment
methods? The same validity concepts apply.
Developing performance assessments involves three steps: define the purpose, choose the activity and develop
criteria for scoring. The first set is about determining the essential skills students need to develop and the content
worthy of understanding. To acquire validity evidence in terms of content, performance assessments should be reviewed
by qualified content experts. In choosing the activity, Moskall
(2003) laid down five recommendations. These are intrinsically associated to the validity of the assessment.
1. The selected performance should reflect a valued activity.
2. The completion of performance assessments should provide a valuable learning experience.
3. The statement of goals and objectives should be clearly with the measureable outcomes of the performance activity.
4. The task should not examine extraneous or unintended variables.
5. Performance assessments should be fair and free from bias.
In scoring, a rubric or rating scale has to be created. Teachers must exercise caution because distracting factors like
student’s handwriting and neatness of the product affect rating. Additionally, personal idiosyncrasies infringe on the
objectivity of the teacher/rater which lowers the validity of the performance assessment.
In controlled conditions, oral questioning has high validity. Nonetheless, a checklist defining the outcomes to be
covered and standards/criteria to be achieved can ensure validity of the assessment. It is recommended though for
summative purposes, there should be a structured or standard list of questions.
For observations, operational and response definitions should accurately describe the behavior of interest. It is
highly valid if evidence is properly recorded and interpreted. Additionally, validity is stronger if additional assessment
strategies are used with observation like interviews, surveys and quantitative methods like tests. In qualitative
research this is called TRIANGULATION – a technique to validate results through cross verification from two or more
sources.
Validity in self-assessment is described by Ross (2006) as the agreement between self-assessment ratings with
teacher judgments or peer rankings. Studies on validity of self-assessments have mixed results. In many cases,
student self-assessment scores are higher compared to teacher ratings (Ross, 2006). This is especially true for
younger learners because of cognitive biases and wishful thinking that leads to distorted judgments. Self-assessment
ratings also tend to bloat when these contribute to the students’ final grades. To increase validity, students should
be informed of the domain in which the task is embedded. They should be taught how to objectively assess their
work based on clearly defined criteria and dismiss any interest bias. Informing them that their self-assessment
ratings shall be compared to those made by their teacher and peers may also induce them to make accurate
assessment of their own performance.
No single type of instrument or method of data collection can assess the vast array of learning and development
outcomes in a school program (Miller, Linn & Gronlund, 2009). For this reason, teachers should use a wide range of
assessment tools to build a complete profile of students’ strengths, weaknesses and intellectual achievements. The
teacher may opt to give multiple choice questions to assess knowledge, understanding and application of theories.
However, selected response tests do not provide students opportunity to practice and demonstrate their writing skills.
Hence, these should be balanced with other forms of assessment like essays. Additionally, direct methods of
assessments should be coupled with indirect methods such as student surveys, interviews and focus groups. While
direct assessment examines actual evidences of student outcomes, indirect assessment gathers perceptive data or
feedback from students or other persons who may have relevant information about the quality of the learning
process.
THREATS TO VALIDITY
Miller, Linn & Gronlund (2009) identified ten factors that affect validity of assessment results. These factors are defects
in the construction of assessment tasks that would render assessment inferences inaccurate. The first four factors
apply to traditional tests and performance assessments. The remaining factors concern brief-constructed response and
selected-response items.
1. Unclear test directions
2. Complicated vocabulary and sentence structure
3. Ambiguous statements
4. Inadequate time limits
5. Inappropriate level of difficulty of test items
6. Poorly constructed test items
7. Inappropriate test items for outcomes being measured
8. Short test
9. Improper arrangement of items
10. Identifiable pattern of answers
McMillan (2007) had down his suggestion for enhancing validity. These are as follows:

.Ask others to judge the clarity of what you are assessing.


Check to see of different ways of assessing the same thing give the same result.
Sample a sufficient number of examples of what is being assessed.
Prepare a detailed table of specifications.
Ask others to judge the match between the assessment items and the objectives of the assessment.
Compare groups known to differ on what is being assessed.
Compare scores taken before to those taken after instruction.
Compare predicted consequences to actual consequences.
Compare scores on similar, but different traits.
Provide adequate time to complete the assessment.
Ensure appropriate vocabulary, sentence structure and item difficulty.
Ask easy questions first.
Use different methods to assess the same thing.
Use only for intended purposes.
RELIABILITY
Reliability talks about reproducibility and consistency in methods and criteria. An assessment is said to be
reliable if it produces the same results if given to an examine on two occasions. It is important then to stress that
reliability pertains to the obtained assessment results and not to the test or any other instrument. Another point is
that reliability is unlikely to turn out 100% because no to tests will consistently produce identical results. Even the
same test administered to the same group of students after a day or two will have some differences. There are
environmental factors like lighting and noise that affect reliability. Student error and physical well-being of examinees
also affect consistency of assessment results.
For a test to be valid, it has to be reliable. Let us look at an analogous situation. For instance, a weighing scale is
off by 6 pounds. You weighed a dumbbell foe seven consecutive days. The scale revealed the same measurement,
hence the results are reliable. However, the scale did not provide an accurate measure and therefore is not valid.
From the foregoing, reliability is a necessary condition for validity but not a sufficient one. Similarly, a test can be
found reliable but it doesn’t imply that the test measures what it supports to.
Reliability is expressed as a correlation coefficient. A high reliability coefficient denotes that if a similar test is
read ministered to the same group of students, test results from the first and second testing are comparable.
TYPES OF RELIABILTY
There are two types of reliability: internal and external reliability. Internal reliability assesses the consistency of
results across items within a test whereas external reliability gauges the extent to which a measure varies from one
use to another.

SOURCE OF RELIABILITY EVIDENCE


In terms of sources of reliability evidence, there are five classes: evidence based on stability, evidence based on
equivalent forms, evidence base on internal consistency, evidence based on scorer or rater consistency and
evidence based on decision consistency.
A. Stability
The test-retest reliability correlates scores obtained from two administrations of the same test over a period of
time. It is used to determine the stability of test results over time. It assumes that there is no considerable change
in the construct between the first and second testing. Hence, timing is critical because characteristics may change if
the time interval is too long. A short gap between the testing sessions is also not recommendable because subjects
may still recall their responses. Typically, test-retest reliability coefficients for standardized achievement and
aptitude test are between 0.80 and 0.90 when the interval between testing is 3 to 6 months (nitko & brook hart,
2011). Note that a reliability coefficient is an index of reliability.

B. Equivalence
Parallel forms of reliability ascertain the equivalent of forms. In this method, two different versions of an
assessment tool are administered to the same group of individuals. However, the items are parallel, i.e. they probe
the same construct, base knowledge or skill. The two sets of scores are then correlated I order to evaluate the
consistency of results across alternate versions. Equivalent forms are ideal for makeup tests or action researches
that would utilize pre-and post-tests. An equivalent test is not just a matter of rearranging the items. New different
items must be thought of but measuring the same construct. This is where the difficulty lies. For specific skills test
like addition of signed numbers, it would be relatively easy. However, for complex or subjective constructs, it would
require time and effort to prepare. Moreover, it is rather impractical to subject students to answer two forms of a
test.
C. Internal consistency
In internal consistency implies that a student who has mastery learning will get all or most of the correctly while a
student who knows little or nothing about the subject matter will get all or most of the items wrong. To check for
internal consistency, the split half method is done by dividing the test into two – separating the first half and the
second half of the test or by odd and even numbers, and then correlating the results of the two halves. The Spearmen-
Brown formula is applied. It is a statistical correlation to estimate the reliability of the whole test and not each half of
the test.
𝑊ℎ𝑜𝑙𝑒 𝑡𝑒𝑠𝑡 𝑟𝑒𝑙𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 2x reliability on ½ tests
1 + reliability on ½ tests
For instance, assume that the correlation between the two half scores is 0.70. The Spearman-Brown reliability
estimate for the full length test is 0.82.
Spilt-half is effective for large questionnaires or test with several items measuring the same construct. To improve
the reliability of the test employing this method, items with low correlations are either removed or modified.
There are two other ways to establish internal consistency: CRONBACH ALPHA and KUDER-RICHARDSON (KR)
20.21. The Cronbach alpha is a better measure than split-half because it gives the average of all the split-half
reliabilities. It measures how well items in a scale (i.e.1=strongly disagree to 5 = strongly agree) correlate with one
another (Salvucci et.al., 1997). The Kuder-Richardson 20/21 formula is applicable to dichotomous items (0/1). Items of
the test are scored 1 if marked correctly, otherwise zero.
For internal consistency, the range of reliability measures are rated as follows: less than 0.50-the reliability is low;
between 0.50 and 0.80 – reliability is moderate; and greater than 0.80 – the reliability is high (Salvucci, et al., 1997).
D. Scorer or rater consistency
People do not necessary rate in a similar way. They may have disagreements as to how responses or materials
truly reflect or demonstrate knowledge of the construct or skill being assessed. More so, certain characteristics of the
raters contribute to errors like bias, halo effect, mood, and fatigue, among others. Bias is partially or a display of
prejudice in favors of or against a student or group. Hence, a teacher playing favorites in classes sway him/her to give
high marks to students based on his/her preferences or liking, and not on the students’ performance. The halo effect is
a cognitive bias, allowing first impressions to color one’s judgment of another person’s specific traits. For instance,
when a teacher finds a student personable or well-behaved, this way lead the teacher to assume that the student is
also diligent and participative before he/she can actually observe or gauge objectively . The halo effect can be traced
back in 1920 to Edward Thorndike’s study entitled, “A constant error in psychological ratings”. The problem with the
halo effect is that teachers who are aware of it may not have any idea that it is already occurring in their judgments.
Just as several test items can improve the reliability of a standardized test, having multiple raters can increase
reliability. Interrater reliability is the degree to which different raters, observers or judges agree in their assessment
decisions. To mitigate rating errors, a wise selection and training of good judges and use of applicable statistical
techniques are suggested.
Inter-rater reliability is useful when grading essays, writing samples, performance assessment and portfolios.
However, certain conditions must be met. There should be a good variation of products to be judged, scoring criteria
are clear and raters are knowledgeable or trained on how to use the observation instrument (McMillan, 2007).
To estimate inter-rater reliability, the Spearman’s rho or Cohen’s Kappa may be used to calculate the correlation
coefficient between or among the ratings. The first is used for ordinal data while the other for nominal and discrete
data.
E.
Decision consistency
Decision consistency describes how consistent the classification decisions are rather than how consistent
the scores are (Nitko & Brookhart, 2011). Decision consistency is seen in situations when teacher decide who will
receive a passing or fail mark, or considered to posses’ mastery or not.
Let us consider the levels of proficiency adapted in the Philippines at the onset of the K to 12 programs. In
reporting grades of K to 12 learners at the end of each quarter, the performance of students are described based
on the following levels of proficiency: beginning (74% and below); developing (75%-79%); approaching
proficiency (80%-84%); proficient (85%-89%); and advanced (90% and
above). Suppose two students receive marks of 80 and 84, then they are both regarded to have ‘approaching
proficiency’ in the subject. Despite the numerical difference in their grades and even though 84 is just a
percentage point shy from 85, the teacher on the basis of the foregoing classification can infer that both
students have not reached the proficient or advanced levels. If decisions are consistent, then these students
would still be classified in the same level regardless of the type of assessment method used.
McMillan (2007) gives a similar explanation. Matching of classifications is done by comparing scores from
two administrations of the same tests. Suppose students were judged as beginners, proficient and advanced. In
a class of 20 students, results in the first testing showed that 5 are at the beginning level, 10 at the proficient
level and 5 are advanced. In the second testing involving the same students, 3 to 5 students who were previously
evaluated as beginners are still in the same level, all 10 proficient students remain in that level, and 4 of 5
students were able to retain advanced level of proficiency. Hence, there are 17 matches which is equivalent to
85% consistency.
MEASUREMENT ERRORS
An observation is composed of the true value plus measurement error: measurement errors can be caused by
examinee-specific factors like fatigue, boredom, lack of motivation, momentary lapses of memory and carelessness.
Consider a student who lacked sleep for one reason or another. He/she may perform poorly in an examination.
His/her physical condition during the assessment does not truly reflect what he/she knows and can do. The same
can be said for another student who may have attained a high score in the same examination and yet most of
his/her responses were made through guessing. Test-specific factors are also causes of measurement errors.
Teachers who provide poor or insufficient directions may likely cause students to answer test items differently and
inaccuracy. Ambiguous questions would only elicit vague and varied responses. Trick questions purposely intended
to deceive test takers will leave students perplexed rather than enlightened. Aside from examinee and test-specific
factors, error can arise due to scoring factors. Inconsistent grading systems, carelessness and computational errors
lead to imprecise or erroneous student evaluations.
Classical test theory gives the formula X = T + E where X is the observation – a measured score like a
student’s performance in a test, T is the true value, and E is some measurement error. The error component
includes random error and systematic error. Random errors are those that affect reliability while systematic errors
impact validity. To illustrate, suppose you measure you’re mass three times using a scale and get slightly different
values, then there are random errors. However, if the scale gives higher measurement values by 2 grams then the
errors are systematic.
Random errors, called noise, produce random fluctuations in measurement scores. These are errors associated
with situations like outside racket during the conduct of a test. Subjective scoring poor instructions, ambiguous
questions and guessing. Random error does not alter the average performance of a group but it affects the variability
around the average. The standard error of measurement (SEM) is an index of the expected variation of the observed
scores due to measurement error. It estimates how a student’s test scores tend to be distributed around his/her true
score had the student been tested repeatedly nursing the same instrument. Better reliability means a lower SEM. SEM
pertains to the standard deviation of measurement errors associated with test scores. From the test scores of set
students, the mean, standard deviation (Sx) and test score reliability (rxx) are obtained. SEM is computed using the
formula: 𝑆𝐸𝑀 = √1 − 𝑟𝑥𝑥 . A student’s true score may then be estimated by relating SEM to the normal curve.
Because it is impractical to administer tests repeatedly and impossible to hold certain variables constant (no new
learning or recall of questions), the observed score (X) can be regarded as the mean. The SEM is subtracted and added
to it to get a range, called CONFIDENCE INTERVAL or score band, where the true score lies. The score band (𝑋 ± 2𝑆𝐸𝑀)
gives a reasonable limit for estimating the true score. For instance, scores in an achievement test revealed a standard
deviation of 6.33 and Cronbach alpha reliability estimate of 0.90. The calculated SEM is 2. Suppose a student receives a
score of 32, how do we make an interpretation about his/her true score? We can say that with 68% confidence, the
student’s true score lies within one standard deviation away drop 32, i.e. between 30 and 34; 95% confident that
his/her true score falls between 28 and 36 (X+-2SEM); and 99% that the true score is in the 26 and 38 range (𝑋 ±
3𝑆𝐸𝑀).
Systematic errors are referred to as bias. They tend to shift all measurements in a systematic way consequently
displacing the mean. While random errors can be reduced by getting the average multiple measurements, systematic
errors can be reduced by identifying and removing the errors at the source.
Reliability indicates the extent to which scores are free from measurement errors. As pointed out, lengthening or
increasing the number of items in a set can increase reliability. Teacher-made summative tests and standardized
tests are more reliable compared to informal observations conducted over a limited period of time. To increase
reliability, ample amount of observation is needed to detect patterns of behavior.

RELIABILITY OF ASSESSMENT METHODS


Between a well-constructed objective tests and performance assessment, the former has better reliability
(Miller,Linn & Gronlund, 2009; Harris, 1997). Performance assessment is said to have low reliability because of
judgmental scoring. Inconsistent scores may be obtained depending on the raters. This may be due to inadequate
training of raters or inadequate specification of the scoring rubrics (Harris, 1997). Additionally, in a performance
assessment, there is a limited sampling of course content. But as Harris (1997) explained,
constraining the domain coverage or structuring the responses may raise the reliability of performance
assessments. Reliable scoring of performance with exemplars and/or rater training (Johnson & Svingby, 2007).
As for oral questioning suggestions for improvement in reliability of written tests may also be extended to oral
examinations like increasing the number of questions, response time and number of examiners, and using a rubric
or marking guide that contains the criteria and standards.
To reliably measure student behavior, observation instruments must be comprehensive to adequately
sample occurrences and non occurrences of behavior, but still manageable to conduct. Direct observation data,
according to Hintze (2005) can be enhanced through inter-observer agreement and intra-observer reliability. The
first pertains to consistency of observation data collected by multiple teachers or observes, while the other refers
to the consistency of data collected on behavior multiple times by a single teacher or observer.

Self-assessments, according to Ross (2006), have high consistency across tasks, across items and over short
time periods. This is especially true self-assessments are done by students who had been trained in how to
evaluate their work. Greater variations are expected with younger children.
Below are ways to improve reliability of assessment results (Nitko & Brook hart, 2011).

 Lengthen the assessment procedure by providing more time, more questions and more observation whenever
practical.
 Broaden the scope of the procedure by assessing all the significant aspects of the largest learning performance.
 Improve objectivity by using a systematic and more formal procedure for scoring student performance. A scoring
scheme or rubric would prove useful.
 Use multiple markers by employing inter-rater reliability.
 Combine results from several assessments especially when making crucial educational decisions.
 Provide sufficient time to students in completing the assessment procedure.
 Teach students how to perform their best by providing practice and training to students and motivating them.
 Match the assessment difficulty to the students’ ability levels by providing tasks that are neither too easy nor too
difficult, and tailoring the assessment to each student’s ability level when possible.
 Differentiate among students by selecting assessment tasks that distinguish or discriminate the best from the least
able students.

Note that some of the suggestions on improving reliability overlap with those concerning validity. This is because
reliability is a
precursor to validity. However, it is important to note that high reliability does not ensure a high degree of validity.

You might also like