Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

CHAPTER 4 Validity and Reliability

OVERVIEW
It is not unusual for teachers to receive complaints or comments from students regarding tests and other
assessments. For one, there may be an issue concerning the coverage of the test. Students may have been tested
on areas that were not part of the content domain, they may not have been given the opportunity to study or learn
from the material. The emphasis of the test may also be too complex, inconsistent with the performance verbs in the
learning outcomes.
Validity alone does not ensure high quality assessment. Reliability of test results should also be checked.
Questions on reliability surface if there are inconsistencies in the results when tests are administered over different
time periods, sample of questions or groups.
Both validity and reliability are considered when gathering information or evidence about student
achievement. This chapter discusses the distinctions between the two.

Intended Learning Outcome (ILO)


At the end of Chapter 4, students are expected to:
• Cite evidences of validity and reliability in teacher-made tests.

ENGAGE

VALIDITY
Validity is a term derived from Latin word validus, meaning strong. In view of assessment, it is deemed valid
if it measures what it is supposed to. In contrast to what some teachers believe, it is not a property of a test. It pertains
to the accuracy of the inferences teachers make about students based on the information gathered from an
assessment (McMillan, 2007; Fives & Didonato- Barnes, 2013). This implies that the conclusions teachers come up
with in their evaluation of student performance is valid if there are strong and sound evidences of the extent of
student’s learning. Decisions also include those about instruction and classroom climate (Russell & Airasian, 2012).
An assessment is valid if it measures a student’s actual knowledge and performance with respect to the
intended outcomes, and not something else. It is representative of the area of learning or content of the curricular
aim being assessed (MCmillan, 2007; Popham, 2011). An assessment purportedly for measuring arithmetic skills of
grade 4 pupils is invalid if used for Grade 1 pupils because of issues on content (test content evidence) and level of
performance (response process evidence). Attest that measures recall, of mathematically formulae is invalid if it is
supposed to assess problem-solving. This is an example of validity problems particularly on content-related evidence.
There are two other sources of information that can be used to establish validity: criterion-related evidence and
construct-related evidence. Each of these shall be presented here.
A. Content-Related Evidence
Content-related evidence for validity pertains to the extent to which the test covers the entire domain of
content. If a summative test covers a unit with four topics, then the assessment should contain items from each topic.
This is done through adequate sampling of content. A student’s performance in the test may be used as an indicator
of his/her content knowledge. For instance, if a Grade 4 pupil was able to correctly answer 80% of the items in a
Science test about matter; the teacher ma infer that the pupil knows 80% of the content area.
In the previous chapter, we talked about appropriateness of assessment methods to learning outcomes. A
test that appears adequately measure the learning outcomes and content is said to possess face validity. As the
name suggests, it looks at the superficial face value of the instrument. It is based on the subjective opinion of the one
reviewing it. Hence, it is considered non-systematic or non-scientific. A test that was prepared to assess the ability of

1|P age
pupils to construct simple sentences with correct subject-verb agreement has face validity if the test looks like an
adequate measure of the cognitive skill.
Another consideration related to content validity is instructional validity – the extent to which an assessment
is systematically sensitive to the nature of instruction offered. This is closely related to instructional sensitivity which
Popham (2006, p. 1) defined as the “degree to which students’ performance on a test accurately reflect the quality of
instruction to promote student’s mastery of what is being assessed.” Yoon & Resnick (1998) asserted that an
instructionally valid test is one that registers differences in the amount and kind of instruction to which students have
been exposed. They also described the degree of overlap between the content tested and the content taught as
opportunity to learn which has an impact on test scores. Let’s consider the Grade 10 curriculum in Araling Panlipuan
(Social Studies). In the first grading period, they will cover three economic issues: unemployment, globalization and
sustainable development. Only two were discussed in class but assessment covered all three issues. Although these
were all identified in the curriculum guide and may even be found in a textbook, the question remains as to whether
the topics were all taught or not inclusion of items that were not taken up in class reduces validity because students
had no opportunity to learn the knowledge or skill being assessed.
To improve the validity of assessments, it is recommended that the teacher constructs of a two-dimensional
grid called Table of Specifications (ToS). The ToS is prepared before developing the test. It is a test blueprint that
identifies the content area and describes the learning outcomes at each level of the cognitive domain (Notar; et al.,
2004). It is a tool used in conjunction with lesson and unit planning to help teachers make genuine connections
between planning, instruction, and assessment (Fives & Didonato-Barnes, 2013). It assures teachers that they are
testing students’ learning across a wide range of content and readings as well as cognitive processes requiring higher
order thinking. Table 4.1 (see p. 54) is an example of an adapted ToS using the learning competencies found in the
Math curriculum guide. It is a two-way table with learning objectives or content matter on the vertical axis and the
intellectual process on the other.
Table 4.1 Sample Table of Specifications (Notar, et.al., 2004)

Course Title: Math


Grade level: V
Periods test is being used: 2
Date of test: August 8, 2014
Subject matter digest: Number and Number Sense
Type of test: Power, Speed, Partially speeded (Circle one)
Test time: 45 minutes
Test value: 100 points
Base number of test questions: 75
Constraints: Test time

2|P age
Item Revised Bloom’s Taxonomy
Learning Objective Type Total
REMEMBER UNDERSTAND APPLY ANALYZE EVALUATE CREATE
No. Level Instructional Q/P/% Q/P
Time
(in minutes)

1 Apply 95 11/16 Matching 6 (10 5 (2) 11/16


16%

2 Understand 55 7/10 MC 5 (2) 5/10


9%

10 Evaluate 40 5/7 Essay 1 (7) 1/7


7%

Total 600 75/100 11/12 23/31 16/34 4/10 3/6 1/7 58/100
100%

Mc – Multiple Choice Q – Questions: P = Points

Carey (as cited by Notar, et. AL., 2004) specified six elements in ToS development: (1) balance among
the goals selected for the examination: (2) balance among the level of learnings: (3) the test format: (4) the total
number of items: (5) the number of items for each goal and level of learning: and (6) the enabling skills to be selected
from each goal framework. The first three elements were discussed by the first three chapter. As to the number of
items, that would depend on the duration of the test which is contingent on the academic levels and attention span of
the students. A six-year old Grade 1 pupil is not expected to accomplish one-hour test. They do not have tolerance
to sit in an examination that long. The number of items is also determined by the purpose of the test or its proposed
uses. Is it a power test or a speed test? Power tests are intended to measure the range of student’s capacity in a
particular area, as oppose to a speed test that is characterized by a time-pressure. Hopkins (1998) argued that the
classroom tests should be a power test and that students should be given example to answer the rest items. He
reasoned that there is a little correlation anyway between student’s ability and his/her working rate on a test.
Meanwhile, determining the number of items for each topic in the ToS depends on the instructional time.
This means that topics that consumed longer instructional time with ore teaching-learning activities should be given
more emphasis. This element is reflected on the third column of the ToS (Table 4.1). an end-of-period test covering
only half of the six-week period is not representative of the students’ overall learning achievement.
The last element denotes the level of performance being achieved. The teacher should test at the highest
level. Suppose that Grade 3 pupils are expected to construct declarative and interrogative sentences in their English
class. The test items should allow to do just that and not just examine their skill of distinguishing declarative from
interrogative sentences. Nonetheless, it is assumed that if a student can perform a complex skill, then they can also
accomplish the lower level.
Test items demand higher order thinking skills obviously require more time to answer, whereas simple
recall items entail the least amount. Nitko & Brookhart (2011) gives the average response fro each assessment task,
as seen in Table 4.2.

3|P age
Type of Test Questions Time Required to Answer
Alternate Response (True-False) 20-30 seconds
Modified True or False 30-45 seconds (Notar, et. al., 2004)
Sentence Completion (one-word fill-in) 40-60 seconds
Multiple Choice with four responses (lower level) 40-60 seconds
Multiple Choice (higher level) 70-90 seconds
Matching Type (5 items, 6 choices) 2-4 minutes
Short answer 2-4 minutes
Multiple Choice (with calculations) 2-5 minutes
Word problems (simple arithmetic) 5-10 minutes
Short essays 15-20 minutes
Data analysis/graphing 15-25 minute
Drawing models/labelling 20-30 minutes
Extended essays 35-50 minutes

Validity suffers if the test is too short sufficiently measure behavior and cover the content. Adding more
items to the test may increase validity. However, an excessively long test may be taxing to the students. Regardless
of the trade-off, teachers must construct tests that students can finish within a reasonable time. It would be helpful if
teachers also provide students with tips on how to manage their time.

B. Criterion-Related Evidence
Criterion-related evidence for validity refers to the degree to which test scores agree with an external criterion.
As such, it is related to external validity. It examines the relationship between an assessment and another measure
of the trait (McMillan, 2007). There are three types of criteria (NItko & Brookhart,2011): {1} achievement test scores;
(2) ratings, grades and other numerical judgements made by the teacher; and (3) career data. An Achievement test
measures the extent to which learners had acquired knowledge about a particular topic or area, or mastered certain
skill at a specific point in the time as a result of planned instruction or training. A summative test on “earth and space”
given to Grade 10 students in science at the end of the first quarter can serve as a criterion. A readiness test in earth
science can be compared to the results of the periodical test through correlation. Established personality inventories
can be used as criteria to validate newly developed personality measures. Results of vocational interest surveys can
be compared to career data to determine validity.

Criterion-related evidence is of two types: concurrent validity and predictive validity. Concurrent validity
provides an estimate of a student’s current performance in relation to a previously validated or established measure.
For instance, a school has developed a new intelligence quotient (IQ) test. Results from this test are statically
correlated to the results from a standard IQ test. If statistical analysis reveals a strong correlation between the two
sets of scores, then there is high criterion validity. It is important to mention that data from the two measures are
obtained at about the same time.

Predictive validity pertains to the power or usefulness of test scores to predict future performance. For
instance, can scores in the entrance/admission test (predictor) be used to predict college success (criterion)? If there
is significantly high correlation between entrance examination scores and first year grade point (GPA) assed a year
later, then there is predictive validity. The entrance examination was a good measurement tool for student selection.

4|P age
Similarly, an aptitude test given to high school students is predictive of how well they will do in college. The criterion
is GPA. Educators may be interested in exploring other criteria like job proficiency ratings, annual income, or twenty-
first century skills like citizenship.
In testing correlations between two data sets for both concurrent and predictive validity the Pearson
correlation coefficient (r) or Spearman’s rank order correlation may be used square of the correlation coefficient
(r²) is called the coefficient of determination. In the previous example where the entrance examination Is the predictor
and the college GPA is the criterion a coefficient of determination r² m=0.25 means 25% of student’s college grades
may be explained by their scores in the examination. It also implies that there are other factors that contribute to the
criterion variable. Teachers may then look into other variable like study habits and motivation.

C. Construct-Related Evidence
A construct is an individual characteristic that explains some aspect of behavior (Miller, Linn &
Gronlund,2009). Construct-related evidence of validity is assessment of the quality of the instrument used. It
measures the extent to which the assessment is a meaningful measure of an unobservable trait or characteristic
(McMillan,2007). There are three types of construct-related evidence: theoretical, logical and statistical
(McMillan,2007).
A good construct has theoretical basis. This means that the construct must be operationally defined or
explained explicitly to differentiate it from other constructs. Confusion in the meaning of the construct will render
assessment results and inferences dubious. Motivation for instance is a latent variable. In measuring student’s
motivation to learn, it is only right to us the question, “What makes students want to learn in school?” Does motivation
here pertain to the interest or satisfaction shown by student s in doing a particular set of tasks or the reason why they
are actively engaged in the tasks? Conley, Karabenick & Arbor (2006), in their study, measured students’ motivation
to learn through a survey that included measures of self-efficacy for learning, task value, and students’ personal
achievement goals, focusing on the domain of Mathematics. To get substantive evidence of construct validity, internal
relations among the items in the instruction were sought. Through a statistical process called factor analysis, they
found that data on task value and achievement goals were consistent with the theory of the construct.
In 1955, Lee Cronbach and Paul Meehl insisted that to provide evidence of construct validity, one has to
develop a nomological network. It is basically a network of laws that includes the theoretical framework of the
construct, an empirical framework of how it is going to be measured, and specifications or linkages between the two
frameworks (Trochim, 2006). Unfortunately, the nomological network does not provide a practical approach to
assessing construct validity.
Construct validity can take the form of a differential groups study. For instance, a test in a problem- solving
strategies is given to two groups of students- those specializing in Math and those specializing in Social Studies. If
the Math group performs better than the other group, then there is evidence of construct validity. Another form is an
intervention study wherein a test is given to a group of students who are weak in problem-solving strategies. After
being taught the construct, the same group of students are again assessed. If there is significant increase in test
scores, the results support the construct validity of the test. There are logical evidences of construct validity.
There are two methods of establishing construct validity: convergent and divergent validation. On the one
hand, convergent validity occurs when measures of constructs that are related are in fact observed to be related.
Divergent (or discriminant) validity, on the other hand, occurs when constructs that are unrelated are in reality observe
not to be. Let’s consider a test administered to measure knowledge and skills in geometrical reasoning which is
different from a reading construct. Hence, comparison of test results from these two constructs wills how lack of
communality. This is what we meant by discriminant validity. Convergent validity is obtained when test scores in
geometrical reasoning turnout to be highly correlated to scores from another geometry test measuring the same
construct. These construct-related evidences rely on statistical procedures. In 1959, Campbell and Fiske developed
a statistical approach called Multitrait-Multimethod Matrix (MTMM) – a table of correlations arranged to facilitate the
assessment of construct validity, integrating convergent and divergent validity (Trochim, 2006). In contrast o the
nomological network, the MTMM employs a methodological approach. However, it was still difficult to implement.

5|P age
McMillan (2007) recommends, for practical purposes, the use of clear definitions and logical analysis as construct-
realted evidences.
Unified Concept of Validity
In 1989, Messick proposed a unified concept of validity based on an expanded theory of construct validity
which addresses score meaning and social values and test interpretation and test use. His concept of unified validity
“integrates considerations of content, criteria and consequences into a construct framework for the empirical testing
of rational hypotheses about score meaning and theoretically relevant relationships” (Messick, 1995, p.741). he
presented six distinct aspects of construct validity: content, substantive, structural, generalizability, external and
consequential. The description that follow are based on the capsule descriptions given by Messick (1995). The
content aspects are parallel to content-related evidence which calls for content relevance and representativeness.
The substantive aspects pertain to the theoretical constructs and empirical evidences. The substantive aspects
assess how well the scoring structure matches the construct domain. The generalizability aspects examine how score
properties and interpretations generalize to and across populations groups, contexts and tasks. This is called external
validity. Criterion-related evidence for validity is related to external validity as the criterion may be an externally-
defined gold standard. The external aspects include convergent and discriminant evidences taken from Multitrait-
Multimethod studies. Finally, consequential aspects pertain to the intended and unintended effects of assessments
on teaching and learning.
In view of consequential validity, Messick (1995) explained that the social consequences of testing may be
positive if it leads to improved educational policies, or negative if beset with bias in scoring and interpretation on
unethical use. The study made by Conley, Karabenick & Arbor (2006) on student’s motivation to learn (math)
contained of the discussion of the ways which data had been reported and used as consequential evidence. They
wanted t5o show that students are motivated in different ways and in such, different teacher interventions are needed.
Hence, they wrote that an intended consequence of the use of the scores in their study was to change the content of
school’s professional development and action plans. Below is an excerpt (Conley, Karabenick & Arbor, 2006, p.11):
Algebra 1A students started the year with the lowest scores in interest, mastery and efficacy (e.g., they saw
math as less interesting than other math students, they were less likely to focus on understanding, and were the least
confident in their math ability). However, they saw math as just as useful as other students in the school and had
similar levels of focus competition.
• Change – Algebra 1A students had a more adaptive pattern of change than other students at this school; the
drops were generally smaller than for students in other courses. They saw math as less useful and less
focused on learning but slightly more confident in their math ability.
• Goals for the next year – Help students see how math is useful, and more importantly, help focus students
on learning and developing (rather than just demonstrating) ability.
As for the alternative assessments, Sambell, McDowell & Brown (1997) wrote that such assessment (which
include performance tasks) appear to have strong consequential validity because they incorporate meaningful,
engaging and authentic tasks. Moreover, students are actively involved in the assessment process. In their study,
student respondents perceived alternative assessment to be an accurate measure of student learning. They related
alternative assessment to authentic tasks, hence found them beneficial and rewarding to the long run. Their
perceptions of conventional assessment were quite negative because the student respondents viewed testing as
divorced from effective learning.
Positive consequences were explained well by McMillan (2007). These consequences pertain to how
assessment directly impact students and teachers. Positive consequences on students include the effects of
assessment on how they study, how they are motivated and how they relate to their teachers. Teachers who often
provide tests on identification and enumeration encourage memorization. Teachers who provide problem-solving
activities and performance assessments compel students to hone their skills. When students ae informed of the
standards and how they are to be assessed and graded, they assumed a share responsibility for their learning. When
assessment is authentic and done with fairness, assessment can positively impact their motivation. Teacher-student
relationship is also strengthened when teachers provide feedback to students concerning their performances.

6|P age
Teacher are likewise affected by the nature of the assessment they give. If assessment calls fro recall of
facts, teachers tend to resort to the lecture method/ however, if reasoning and skills are to be assessed, the teacher
devise plenty of classroom learning experiences and situations that would call for knowledge applications and higher
order thinking skills. Oral questioning and performance tasks become typical methods for acquiring evidences of
students learning. The method of assessments used by teachers mirror the kind of educator they are. Traditional
teachers lecture all day long and teach to the test. The purpose of assessment is to report grades. However, twenty-
first century teachers design alternative assessments that target multiple intelligences. This is possible b utilizing
more authentic assessment tools such as portfolios, projects, journals, among others. The purpose of assessment is
to reinforce learning and measure success.
Validity of Assessment Methods
In the previous sections, validity of traditional assessments was discussed. What about other assessment
methods? The same validity concepts apply:
Developing performance assessments involves three steps: define the purpose, choose the activity and
develop criteria for scoring. The first step is about determining the essential skills students need to develop and the
content worthy of understanding. To acquire validity evidence in terms of content, performance assessments should
be reviewed by qualified content experts. In choosing the activity, Moskall (2003) laid down five recommendations.
These are intrinsically associated to the validity of assessment.
1. The selected performance should reflect a valued activity.
2. The completion of performance assessments should provide a valuable learning experience.
3. The statement of goals and objectives should be clearly aligned with the measurable outcomes of the
performance activity.
4. The task should not examine extraneous or unintended variables.
5. Performance assessments should be fair and free from bias.
In scoring, a rubric or rating scale has to be created. Teachers must exercise cautions because distracting
factors like student’s handwriting and neatness of the product affect rating. Additionally, personal idiosyncrasies
infringe on the objectivity of the teacher/rater which lowers the validity of the performance assessment.
In controlled conditions, oral questioning has high validity. Nonetheless, a checklist defining the outcomes to
be covered and standards/criteria to be achieved can be ensure validity of the assessment. It is recommended though
summative purposes, there should be a structured or standard list of questions.
For observations, operational and response definitions should accurately describe the behavior of interest. It
is highly valid if evidence is properly recorded and interpreted. Additionally, validity is stronger if additional assessment
strategies are used with observation like interviews, surveys and quantitative methods like tests. In qualitative
research, this is called triangulation – a technique to validate results through across verification from two or more
sources.
Validity in self-assessment is described by Ross (2006) as the agreement between the self-assessment
ratings with teacher judgements or peer rankings. Studies on validity of self-assessments have mixed results. In many
cases, student self-assessment scores are higher compared to teacher ratings (Ross 2006). This is especially true
for younger learners because of cognitive biases and wishful thinking that lead to destroyed judgements. Self-
assessment ratings also tend to bloat when these contribute to the students’ final grades. To increase validity,
students should be informed of the domain in which task is embedded. They should be taught how to objectively
assess their work based on clearly defined criteria and dismiss any interest bias. Informing that their self-assessment
ratings shall be compared to those made by their teacher and peers may also induce them to make accurate
assessment of their own performance.
No single type of instrument or method of data collection can assess the vast array pf learning and
development outcomes in a school program (Miller, Linn & Gronlund, 2009). For this reason, teachers should use
wide range of assessment tools to build a complete profile of students’ strengths, weaknesses and intellectual
achievements. The teacher may opt to give multiple choice questions to assess knowledge, understanding and
application of theories. However, selected response tests do not provide students opportunity to practice and

7|P age
demonstrate their writing skills. Hence, these should be balanced with other forms of assessment like essays.
Additionally, direct methods of assessment should be couple with indirect methods such as student surveys,
interviews and focus groups. While direct assessment examines actual evidences of student outcomes, indirect
assessment gathers perceptive data or feedback from students or other persons who may have relevant to
information about the quality of the learning process.
Theories to Validity
Miller, Linn & Gronlund, 2009 identified ten factors that affect validity of assessment results. These factors
are in the construction of assessment tasks that would render assessment inferences inaccurate. The first four factors
apply to traditional tests and performances assessments. The remaining factors concern brief-constructed response
and selected-response items.
1. Unclear test directions
2. Complicate vocabulary and sentence structure
3. Ambiguous statements
4. Inadequate test limits
5. Inappropriate level of difficulty of test items
6. Poorly constructed test items
7. Inappropriate test items for outcomes being measured
8. Short test
9. Improper arrangement of items
10. Identifiable pattern of answers
McMillan (2007) laid down his suggestions for enhancing validity. These are as follows:
• Ask others to judge the clarity of what you are assessing.
• Check to see if different ways of assessing the same thing give the same result.
• Sample a sufficient number of examples of what is being assessed.
• Ask others to judge the match between the assessment items and the objective of the assessment.
• Compare groups known to differ on what is being assessed.
• Compare scores taken before to those taken after instruction.
• Compare predicted consequences to actual consequences.
• Compare scores on similar, but different traits.
• Provide adequate time to complete the assessment.
• Ensure appropriate vocabulary, sentence structure and item difficulty
• Ask easy question first.
• Use different methods to assess the same thing.
• Use only for intended purposes.

Reliability
Reliability talks about reproducibility and consistency in methods and criteria. An assessment is said to be
reliable if it produces the same results if given to an examinee on two occasions. It is important then to stress that
reliability pertains to the obtained assessment result and not to the test or any other instrument. Another point is
unlikely to turn out 100% because no two tests will consistently produce identical results. Even the same test
administered to the same group of students after a day or two will have d=some differences. There are environmental
factors like lighting and noise that affect reliability. Student error and physical well-being of examinees also affect
consistency of assessment results.
For a test to be valid, it has to be reliable. Let us look on analogous situation. For instance, a weighing scale
is off by 6 pounds. You weighed a dumbbell for seven consecutive days. The scale revealed the same measurement,
hence the results are reliable. However, the scale did not provide an accurate measure and therefore is not valid.

8|P age
From the foregoing, reliability is a necessary condition for validity but not a sufficient one. Similarly, a test can be
found reliable but it doesn’t imply that the test measures what it purports to.
Reliability is expressed as a correlation coefficient. A high reliability coefficient denotes that if a similar test is
readministered to the same group of students, test results from the first and second testing are comparable.
Types of Reliability
There are two types of reliability: Internal and External reliability. Internal reliability assesses the consistency
of results across items within a test whereas external reliability gauges the extent to which a measure varies from
one use to another.
Sources of Reliability Evidence
In terms of sources of reliability evidence, there are five classes. Evidence based on stability, evidence based
on equivalent forms, evidence based on internal consistency, evidence based on socrer or rater consistency and
evidence based on decision consistency.
A. Stability
The test-retest reliability correlates scores obtained from two administrations of the same test over a
period of time. It is used to determine the stability of test results over time. It assumes that there is no
considerable change in the construct between the first and second testing. Hence, timing is critical because
characteristics may change if the time interval is too long. A short gap between the testing sessions is also
not recommendable because subjects may still recall their responses. Typically, tets-retest reliability
coefficients for standardized achievement and aptitude tests are between 0.80 and 0.90 when the interval
between testing is 3 to 6 months (Nitko & Brookhart, 2011). Note that a reliability coefficient is an index of
reliability.

B. Equivalence
Parallel form of reliability ascertains that equivalency of forms. In this method, two different methods of
assessment tool re administered to the same group of individuals. However, the items are parallel, i.e. they
probe the same construct, base knowledge or skill. The two sets of scores are then correlated in order to
evaluate the consistency of results across alternate versions. Equivalent forms are ideal for makeup tests for
action researches that would utilize pre- and posy-tests. An equivalent test is not just a matter if rearranging
the items. New and different items must be thought of but measuring the same construct. This is where the
difficulty lies, for specific skills test like addition of signed numbers, it would be relatively easy. However, for
complex or subjective constructs, it would require time and =effort to prepare. moreover, it is rather impractical
to subject students to answer two forms of tests.

C. Internal Consistency
Internal consistency implies that a student who has a mastery learning will get all or most oif the
items correctly while student knows a little more nothing about the subject matter will get all or most od the
items wrong. To check for the internal consistency, the split-half method can be used. The split-half method
is done by dividing the test into two – separating the first half and the second half of the test ikr by idd and
even numbers, and then correlating the results of two halves. The Spearman-Brown formula is applied. It is
a statistical correction to estimate the reliability if the whole test and not each half the test.

Whole test reliability = 2x reliability on ½ test


1 + reliability on ½ test

For instance, assume that the correlation between the two half scores is 0.70. the Spearman-Brown
reliability estimate for the full length test is 0.82.
Split-half is effective for large questionnaires or test with several items measuring the same construct. To
improve the reliability of the test employing this method, items with low correlations are either removed or modified.

9|P age
There are two other ways to establish internal consistency: Cronbach alpha and Kuder-richardson (KR)
20/21. The Cronbach alpha is a better measure than split-half because it gives the average of all split-half reliabilities.
In measure show well items in as scale (i.e. 1 = strongly disagree to 5 = strongly agree) correlate with one another
(Salvucci, et al., 1997). The Kuder-Richardson 20/21 formula is applicable to dichotomous items (0/1). Items of the
test are scored 1 if marked correctly, otherwise zero.
For internal consistency, the range of reliability measures are rated as follows: less than 0.50 – the
reliability is low; between 0.50 and 0.80 – reliability is moderate; and greater than 0.80 – the reliability is high (Salvucci,
et al., 1997).
D. Scorer or Rater Consistency
People do not necessarily rate in a similar way. They may have disagreement as to how responses or
materials truly reflect or demonstrate knowledge of the construct or skill being assessed. More so, certain
characteristics of the raters contribute error like bias, halo effect, mood, fatigue and others. Bias is partially playing
favorites or display of prejudice in favor or against a student or group. Hence, a teacher playing favorites in class can
sway him/her to give high marks to students based on his/her preferences or liking, and not on the students’’
performance. The halo effect is a cognitive bias, allowing first impressions to color one’s judgement of another
person’s specific traits. For instance, when a teacher finds a student personable or well-behaved, this may lead the
teacher to assume that the student is also diligent and participative before he/she can actually observe or gauge
objectively. The halo effect can be traced back in 1920 to Edward Thorndike’s study entitled, “A Constant Error
Psychological Ratings”. The problem with the halo effect is that teachers who are a=ware of it may not have any idea
that is already occurring in their judgements.

Just as several items can improve the reliability of a standardized test, having multiple raters can increase
reliability. Inter-rater reliability is the degree to which different raters, observers or judges agree in their assessment
decisions. To mitigate rating errors, a wise selection and training of good judges and use of applicable statistical
techniques are suggested.

Inter-rater reliability is useful when grading essays, writing samples, performance assessment and
portfolios. However, certain conditions must be met, there should be a good variation of products to be judged, scoring
criteria are clear and rater are knowledgeable or trained on how to use the observation instrument (McMillan, 2007).

To estimate inter-rater reliability, the Spearman’s rho or Cophen’s kappa may be sued to calculate the
correlation coefficient between or among the ratings. The first is used for ordinal data whole other for nominal and
discrete data.

E. Decision Consistency
Decision consistency describe how consistent the clarification decisions are rather than how consistent
the scores are (Nitko & Brookhart, 2011). Decision consistency is seen I the situations when teacher decide who will
receive a passing or fail mark, or considered to possess mastery or not.

Let us consider the levels of proficiency adapted I the Philippines at the onset of the K to 13 program. Ion
reporting grades of K to 12 learners at the end of each quarter. The performance of students is described based on
the following levels of proficiency, beginning (74% and below); Developing (75% - 79%); Approaching Proficiency
(80% - 84%); Proficient (85% - 89%); and Advanced (90% and above). Supposed two students received marks of 80
and 84 then they are birth regarded to have ‘approaching proficiency’ in the subject. Despite the numerical difference
in their grades and even though 84 i9s just a percentage point shy form 85. The teacher on the basis of the foregoing
classification can infer that both students have not reached the proficient or advanced levels. If decisions are
consistent, then these students would still be classified in the same level regardless of the type of assessment method
test.
McMillan (2007) gives a similar explanation, matching of classification is done by comparing scores from
two administration or the same tests, suppose students were judged as beginners, proficient and advanced. In a class
of 20 students, results in the first testing showed that 5 are at the beginning level, 10 at the proficient level and 5 are
advanced. In the second testing involving the same students, 3 of 5 students who were previously evaluated as

10 | P a g e
beginners are still in the same level, all 10 proficient students remain in that level, and 4 of5 students were able to
retain advanced level of proficiency. Hence, there are 17 matches which is equivalent to 85% consistency.

Measurement Errors

An observation is composed of the true value plus measurement error. Measurement errors a=can be
caused by examinee-specific factors like fatigue, boredom, lack of motivations, momentary lapses of memory and
carelessness. Consider a student who lacked sleep for one reason or another. He/she may perform poorly in an
examination. His/her physical condition during the assessment does not truly reflect what he/she knows and can do.
The same can be said for another student who may have attained a high score in the same examination and yet most
of his/her responses were made through guessing. Test-specific factors are also causes of measurement error.
Teachers who provide poor or insufficient directions may likely cause students to answer test items differently and
inaccurately. Ambiguous questions would only elicit vague and varied responses. Trick questions purposely intended
to deceive test takers will leave students perplexed rather than enlightened. Aside from examinee and test-specific
factors, error and arise due to scoring factors. Inconsistent grading systems, carelessness and computational errors
lead to imprecise or erroneous student evaluations.

Classical test theory gives the formula X = T + E where X is the observation – a measured score like a
student’s performance in attest, T is the true value, and E is some measurement error. The error component includes
random error and systematic error impact validity. To illustrate, suppose you measure your mass three times using a
scale and get slightly different values, then there are random errors. However, if the scale gives higher measurement
values by 2 grams then the errors are systematic.

Random errors, called noise, produce random fluctuations in measurement scores. These are errors
associated with situations like outside racket during the conduct of a test, subjective scoring poor instructions,
ambiguous questions and guessing. Random error does not alter average performance of a group but it affects the
variability around the average. The standard error of measurement (SEM) is an index of the expected variation of the
observed scores due to measurement error. It estimates how student’s test scores tend to be distributed around
his/her true score had the student been tested repeatedly using the same instrument. Better reliability means a lower
SEM. SEM pertains to the standard deviation measurement error associated with test scores. From the test score of
a set of students, the mean, standard deviation (Sₓ) and test score reliability (rₓₓ) are obtained. SEM is computed
using the formula: SEM = Sₓ√ 1 − 𝑟ₓₓ . A student’s true score may then be estimated by relating SEM to the normal
curve. Because it is impractical to administer tests repeatedly and impossible to hold certain variables constant (no
new learning or recall of questions), the observed score (X) can be regarded as the mean. the SEM is subtracted and
added to it to get a range, called confidence interval or score band, where the true score lies. THE SCORE BAND
X ± SEM gives a reasonable limit for estimating the true score. For instance, scores in an achievement test revealed
a standard deviation of 6.33 and Cronbach alpha reliability estimate of 0.90. the calculated SEM is 2. Suppose a
student receive a score of 32, how do we make an interpretation about his/her true score? We can say that with 68%
confidence, the student’s true score lies within one standard deviation away from 32, i.e. between 30 and 34; 95%
confident that his/her true score falls between 28 and 36 (X ± SEM); AND 99% that the true score is in the 26 and 28
range (X ± 3SEM).

Systematic errors are referred to as bias. They tend to shift all measurements in a systematic way
consequently displacing the mean. While random errors can be reduced by getting the average multiple
measurements systematic errors can be reduced by identifying and removing the errors at the source.

Reliability indicates the extent to which scores are free from measurement errors. As pointed out,
lengthening or increasing the number of items in a test can increase reliability. Teacher-made summative tests are
more reliable compared to informal observations conducted over a limited period of time. To increase reliability, ample
amount of observation is needed to detect pattern of behavior.

Reliability of Assessment Methods

Between a well-constructed objective tests and performance assessment, the former has better reliability
(Miller, Linn & Gronlund, 2009; Harris, 1997). Performance assessment is said to have low reliability because of
judgmental scoring. Inconsistent score may be obtained depending on the raters. This may be due to inadequate

11 | P a g e
training of rater or inadequate specification of the scoring rubrics (Harris, 1997). Additionally, in a performance
assessment, there is a limited sampling of course content. But as Harris (1997) explained, constraining the domain
coverage or structuring the responses may raise the reliability of performance assessment. Reliable scoring of
performance assessemn5s can be enhanced by the use of analytic and top-specific rubrics complemented with
exemplars and/or rater training (Jonsson & Svingby, 2007).

As for the questioning suggestions for improvement in reliability written tests may Also be extended for
oral examinations like increasing the number of questions, response time and number of examiners, and using a
rubric or making guide that contains the criteria and standards.

To reliably measure the student behavior, observation instruments must be comprehensive to adequately
sample occurrences and non0occurences of behavior, but till manageable to conduct. Direct observation data,
according to Hintze (2005) can be enhanced through inter-observer agreement and inter-observer reliability. The first
pertains to consistency of observations data collected by multiple teachers or observes, while the other refers to the
consistency of data collected on behavior multiple times by a single teacher or observer.

Self-assessment, according to Ross (2006), have high consistency across tasks, across items and over
shot time periods. This is especially true if self-assessments are done by students who had been trained in how to
evaluate their work. Greater variations are expected with younger children.

Below are ways to improve reliability of assessment results (Nitko & Brookhart, 2011).
• Lengthen the assessment procedure by providing more time, more questions, and more observation
whenever practical.
• Broaden the scope of procedure by assessing all the significant aspects of the largest learning performance.
• Improve objectivity by using a systematic and more formal procedure for scoring student performance. A
scoring scheme or rubric would prove useful.
• Use multiple markers by employing inter-rater reliability.
• Combine results from several assessments especially when marking crucial educational decisions.
• Provide sufficient time to students in completing the assessment procedure.
• Teach students how to perform their best by providing practice and training to students and motivating them.
• Match the assessment difficulty to the students’’ ability levels by providing tasks that are neither too easy nor
too difficult, and tailoring the assessment to each student’s ability level when possible.
• Differentiate among students by selecting assessment tasks that distinguish or discriminate the best form the
least able students.

Note that some of suggestions on improving reliability overlap with those concerning validity. This is
because reliability is a precursor to validity. However, it is important to know that high reliability does not ensure a
high degree of validity.

EXPLPORE

Activity 1: ASSESSMENT SCENARIOS

For each of the following situations, determine whether the assessment is valid. Explain your answer in
two or three sentences citing the type of validity.

Scenario 1:

Test constructors in a secondary school designed a new measurement procedure to measure intellectual
ability. Compare to well-established measures of intellectual ability, the new test is shorter to reduce the arduous
effect of la long test on students. To determine its effectiveness, a sample of students accomplished two tests –
standardized intelligence test and the new test with only a few days interval. Results from both assessments revealed
high correlation.

12 | P a g e
Scenario 2:

After the review sessions, a simulated examination was given to graduating students a few months before
the Licensure Examination for Teacher (LET). When the results of the LET came out, the review coordinator found
out that the scores in simulated (mock) examination are not significantly correlated with the LET scores.

Scenario 3:

A new test was used as qualifying examination for Secondary Education freshmen who would like to
major Biological Science. The test was developed to measure students’ knowledge of Biology. The test was then
administered to two groups of sophomores: those specializing in Social Studies and those already majoring Biological
Science. It was hypothesized that the latter group will score better in the assessment procedure. The results indicated
it so.

Scenario 4:

A science teacher gave a test on volcanoes TO Grade 9 students. The test concluded the type of
volcanoes, volcanic eruptions and energy from volcanoes. The teacher was only able to cover extensively the first
two topics. Several items were included on volcanic energy and how energy from volcanoes may be tapped for human
use. Majority of her students got low marks.

Scenario 5:

A teacher handling “Media and Information Literacy” prepared a test on “Current and Future Trends on
Media and Information”. Topic includes massive online content, wearable technology, 3D environment and ubiquitous
learning. Below are the learning competencies:

The learners should be able to:

a. Evaluate current trend in media an information and how they will affect individuals and society in general;
b. Describe massive open online content;
c. Predict future media innovation;
d. Synthesize overall knowledge about media and information skills for producing a prototype of what the
learners think is a future media innovation.

The teacher constructed a two-way table of specification indicating the number of items each topic. The
test items target remembering, understanding and applying levels.

ACTIVITY 2: ASSESSMENT SCENARIOS

For each of the following situations, determine the source of reliability evidence and type of reliability
coefficient. Explain your answer in two to three statements.

Scenario 1:
For sample of 150 grade 10 students, a Science test on Living Things and their environment was testes
for reliability by comparing the scores obtained on the odd-numbered and even-numbered items

Scenario 2:

Below is the table containing ratings of two teachers on the paper submitted by six Grade 9 students
about their “personal missions in life”. In rating the students’ paper, a rubric was developed.

Student Rater A Rater B Rank (A) Rank (B)

A 14 8 5 6

B 15 12 4 4

13 | P a g e
C 18 15 2 2

D 20 16 1 1

E 12 10 6 5

F 17 14 3 3

Mean = 16 Mean = 12.5 rₓ = 0.94

SD = 2.9 SD = 3.1

Scenario 3:

Scores from 100 Grade 7 students were obtained from November administration of a test in Filipino about
panghalip na panao or personal pronouns. These were compared to the scores of the same group from a September
of the same year.

Scenario 4:

Ms. Castro, a 5th-grade Social Studies teacher wanted to find out whether her first-quarter long test was
equivalent of her first-quarter test the same subject last year. Thus, she administered both tests to her students.

Scenario 5:

Mr. Legarse, a 6th-grade Science teacher directed his students to create a portfolio of their work
accomplished during the last quarter for evaluation. Using the portfolios, he classified his students into four groups.
Beginning, Developing, Competent and Accomplished. The results were then compared to the students’’ performance
in the end-of-quarter examination.

ACTIVITY 3: PRO CON GRID

Complete the table containing the advantages and disadvantages of each source of reliability evidence.

Source PROS and CONS


Pros: Only one test is prepared.
Stability
(Test-retest

Cons: students may remember how they responded to items on the first
administration of the test and simply answer correspondingly, thus
overestimating the reliability.

Pros:
Equivalence
(Parallel forms)

Cons:

Pros:
Internal Consistency
(Slit-half)

14 | P a g e
Cons:

Pros:
Score Consistency
(Inter-rater agreement)

Cons:

Pros:
Decision Consistency

Cons:

APPLY
Name: ______________________________________________ Date: ___________________________

CRITIQUING
The learning outcome for this chapter is to cite evidences for validity and reliability in teacher-made tests.
For this application, request for a copy of an assessment plan, table of specifications and test from a grade school or
high school teacher. Below is a summary of sources of validity (Cizek, 2009). Accomplish the table and write your
comments. Attach the assessment evidences. Show your work to your teacher.

Source of evidence Validity Questions Response


What content domain is covered by
Content the test?

Is the test content drawn exclusively


Alignment from the intended domain? Do the
test items or tasks match the
content standards?

Are the test questions or tasks that


Item/Task construction students are asked to perform
clear? Are directions clear?

Are the conditions under which


Administration Controls students are tested conducive to
them performing their best? Are
time constraints appropriate?

Was the test scored appropriately?


Scoring
Do all students had sufficient
Opportunity to Learn opportunities to learn the knowledge
or skills being tested?

15 | P a g e
ASSESS

Name: ______________________________________________ Date: ___________________________

TASK 1: SCENARION-BASED/PROBLEM SOLVING LEARNING


As the department head or principal. What action would you take on the following matters? Provide your
recommendations bas3ed on the principles of validity and reliability.

Scenario 1:
Mr. Roa taught the different elements and principles of art. After instruction, he administered a test about
prominent and sculptors in the 20th century.

Would you recommend revisions? Why?


____________________________________________________________________________________________
____________________________________________________________________________________________
____________________________________________________________________________________________
____________________________________________________________________________________________

Scenario 2:
In a geometry class, the learners have to calculate perimeters and areas of plane figures like triangles,
quadrilaterals and circles. The teacher decided to use alternative assessment rather than tests. Students came up
with Mathematics portfolios containing their writing about geometry.

What would you tell the teacher? Why?


____________________________________________________________________________________________
____________________________________________________________________________________________
____________________________________________________________________________________________
____________________________________________________________________________________________

Scenario 3:
There are two available assessment instruments to measure English skills in grammar and vocabulary.
Test has a high validity but no information concerning its reliability. Test B was tested to have a high reliability index
but no information about its validity. Which one would you recommend? Why?
____________________________________________________________________________________________
____________________________________________________________________________________________
____________________________________________________________________________________________
____________________________________________________________________________________________

16 | P a g e
TASK 2: CRITIQUING
Below is a table of specifications for Grade 3 Science test. Study the ToS. Answer the questions that
follows.

Learning Outcomes Allocated Assessment Point


Topic Time Method No. of Items System

1. Describe different Characteristics 3 hours Selected- 6/ x1 = 6


objects based on their of solids, liquids response Understanding
characteristics. and gases

2. Classify objects as States of matter 3 hours Selected- 6/ x1 = 6


solid, liquid or gas based response Understanding
on some observable
characteristics.

3. Describe the ways on Use and 4 hours Essay 3/ x3 = 9


the proper use and handling of Applying and
handling of solid, liquid solid, liquid and Analyzing
and gas found at home gas
and in school.

4. Describe changes in Phase change 6 hours Selected- 4/ x1 = 4


materials based on the response Understanding
effect of temperature.

Total 16 19 25

Questions:

1. is the number of times in the Table of Specifications (ToS) proportionate to the number of hours spent for each
objective? Why or why not?
__________________________________________________________________________________________

__________________________________________________________________________________________

2. if you have to be true to the ToS, which objective should you have the most and the test number of items?
__________________________________________________________________________________________

__________________________________________________________________________________________

3. Do the specified assessment methods match the learning objectives?


__________________________________________________________________________________________

__________________________________________________________________________________________

4. Below are sample test items. Determine if each is a valid test item in accordance with the objectives (level of
learning and domain). Give your comments.

Objective 1:

It is a process that changes solid to liquid.

17 | P a g e
A. Evaporation B. Freezing C. Melting D. Sublimation

Objective 2:

Which of the following is in a liquid state of matter?


A. Ice B. Glass C. Orange juice D. Sugar

Objective 3:

How would you handle and store insecticides at home knowing that these are hazardous materials?

Objective 4:

When water vaporizes at room temperature, it changes into _________________.


A. fog B. ice C. steam D. water vapor

5. What suggestions do you have to possibly improve validity and reliability?

___________________________________________________________________________________________

___________________________________________________________________________________________

18 | P a g e
CHAPTER 5 Practicality and Efficiency
OVERVIEW
On the top of planning, developing and organizing instruction and managing student conduct, teachers also
have the task of assessing student learning. Time and resources present themselves as issues. For this reason,
classroom testing should be “teacher friendly”. Test development and grading should be practical and efficient.
Practical means “useful”, that is, it can be used to improve classroom instruction and for outcomes assessment
purposes. It likewise pertains to judicious use of classroom time. Efficient, in this context, pertains to development,
administration and grading of assessment with the least waste of resources and effort.

Intended Learning Outcome (ILO)


At the end of Chapter 5, students are expected to:
• Describe how testing can be made more practical and efficient.

ENGAGE

Familiarity with the Method


While it is that teacher utilize various methods of assessment, it would be a waste of time and resources if
the teacher is unfamiliar with the method of assessment. The assessment may not satisfactorily realize the learning
outcomes. The teacher may commit mistakes in the construct or format of the assessment. Consequently, The result
would not credible. Any inference arising from the assessment results would not to be valid or reliable. This may be
a problem for some teachers who had been accustomed to pencil-and-paper tests. As appointed out in Chapter 3,
some learning targets are best achieved using non-tests. A rule of thumb is that simple learning targets require simple
assessments, complex learning targets demand complex assessment (Nitko & Brookhart, 2011).
From the foregoing, it is critical that teachers learn the strengths and weaknesses of each method of
assessment, how they are developed, administered and marked. You can see here how practicality and efficiency
are intertwined with other criteria of high-quality classroom assessment.
Time Required
It may be easier said than done, but a desirable assessment is short yet able to provide valid and reliable
results. Considering that time is a commodity that many full-time teachers do not have, it is best that assessments
are quick to develop but not to the point of reckless construction. Assessments should allow students to respond
readily but not hastily. Assessments should also be scored promptly but not without basis. It should be noted that
time is a matter of choice- it hinges on the teacher’s choice of assessment method. For instance, a multiple choice
text may take time to prepare but it can be accomplished by students in a relatively short period. Moreover, the test
is easily and objectively scored especially now with the aid of optical mark readers. However, there is still a need to
revisit the first principle: “Is it appreciate for your learning targets?” An essay may be better in some occasions as
student are allowed to express their ideas with relatively few restraints. However, essay questions can easily be
thought of, but essays will require considerable time for students to organize their thoughts and express them in
writing. It will also consume time for teachers to read and mark student’s responses. Essay items are good for testing
small groups of student but such advantage decreases as the size grows. Meanwhile performance assessments take
a lot of time in preparation, student response and scoring but they offer an opportunity to assess student on several
learning targets and at different level of performance. Care should be exercised though especially if performance
assessments take away too much of instructional time.
After considering the time issue, let us now discuss the assessments’ reusability. A multiple choice test may
take a substantial amount of time in terms of preparation, but the items when properly constructed may again by used
for different groups. Security is an issue large-scale, high-stakes testing. Nonetheless, it is also a problem in

19 | P a g e
classroom testing because it affects both validity and reliability. If test items were item-analyzed and the test itself
was established to be a valid and reliable evidence of student performance, the many of the items can be used in
succeeding terms or periods as long as the same learning outcomes are targeted. Thus, it is critical that objective
tests are kept secure so that teachers do not have to prepare an entirely new set of test questions every time besides,
much time and energy were spent in the test construction. And maintaining a database of excellent test items show
that test can be recycled and reused.
One suggestion to improve reliability to lengthen the assessment. The longer it is, the higher the reliability.
McMillan (2007) claims that for assessment to provide reliable results, it generally takes thirty to forty minutes for a
single score on a shot unit. He added that more time is required if separate scores are needed for subskills. He shared
a general rule of thumb that six to ten objective items are required if assessing a concept or specific skill.
Ease in Administration
Assessments should be easy to administer. To avoid questions during the test or performance task,
instructions must be clear and complete. Instructions that are vague will confuse the students and they will
consequently provide incorrect responses. This may be a problem with performance assessments that contain long
and complicated directions and procedures. Students’ efforts are futile if directions are not expressed directly and
explicitly. If assessments procedures like in Science experiments are too celebrate, reading and comprehending the
procedures would consume time.
Ease of Scoring
Obviously, selected response formats are the easiest to score compared to restricted and more so
extended=response essays. It is also difficult to score performance assessments like oral presentations, research
papers, among others, objectively is also an issue. Selected-response tests are objectively marked because each
item has one correct or best answer. This contributes to the reliability of the test. Performance assessments however
make use of rubrics, and while this facilitates scoring, care must be observed in rating to ensure objectivity, McMillan
(2007) suggests that for performance assessments, it is more practical to use rating scales and checklist rather than
writing extended individualized evaluations.
Ease of Interpretation
Oftentimes, students are given a score that reflects their knowledge, skills or performance. However, this is
meaningless of such score is not interpreted. Objective tests are the easiest to interpret. By establishing a standard,
the teacher is able to determine right away if a student passed the test. By matching the score with the level of
proficiency, teacher can determine if the student has reached mastery or not. In performance tasks, rubrics are
prepared to objectively and expeditiously rate student’s actual performance or product. It reduces the time spent in
grading because teachers refer to the descriptors and performance levels in the rubric without the need to write long
comments. Nonetheless, the rubric would have to be shared with the students so that they understand the meaning
of the scores or grades given to them.
Cost
Classroom tests are generally inexpensive compared to the national or high-stakes tests. Hence, citing cost
as a reason for being unable to come up with valid and reliable tests is simply unreasonable. As for performance
tasks, examples of tasks that are not considerably costly written and oral reports, debates and panel discussions. Of
course, students would have to pay for their use of resources like computers and copying facilities, among others.
However, they are not as costly as some performance assessments that require plenty and/or elaborate costumes,
documentary films and portfolios. Teachers may remedy this by asking students to consider using second-hand or
recycled materials.
While classroom tests may not be costly compared to some performance assessments, it is relevant to know
that excessive testing may just train students on how to take tests but inadequately prepares them for productive life
as an adult., it is paramount that are methods of assessments are used in line with the learning targets. According to
Darling-Hammond & Adamson (2013), open-ended assessments – essays exams and performance tasks – are more
expensive to score, but they can support more ambitious teaching and learning.

20 | P a g e
It is recommended that one chooses the most economical assessment. McMillan (2007, p.88) in his
explanation of economy said the “economy should be thought of in the long run, and unreliable, less expensive test,
(or any assessment for that matter) may eventually cost more in further assessment.” In the school level, multiple
choice tests are still very popular especially in entrance tests, classroom tests, and simulated board exams. However,
high quality assessments focus on deep learning in which is essential if students are to develop the skills they need
for a knowledge society (Darling-Hammond & Adamson, 2013). That being said, school must support performance
assessments.

EXPLORE

ACTIVITY: METHOD SELECTION


Among the given assessment methods, choose which one is most practical or efficient. Justify your answer.
Note: fundamentally, the choice of assessment method hinges on its appropriateness to achieve learning targets.
Practicality and efficiency should not take precedence.
Factor Assessment Method Reason
Familiarity with Selected- Constructed Teacher Self-assessment
the method response response observation

Familiarity with Brief- Essay Performance Oral questioning


the method constructed assessment
response
Time required Selected- Constructed Teacher Self-assessment
response response observation

Time required Brief- Essay Performance Oral questioning


constructed assessment
response
Ease in Selected- Constructed Teacher Self-assessment
administration response response observation

Ease in Selected- Constructed Teacher Self-assessment


scoring response response observation

Ease of Selected- Constructed Teacher Self-assessment


interpretation response response observation

Cost Selected- Constructed Teacher Self-assessment


response response observation

From your answers, which among the assessment methods are more practical and efficient?
____________________________________________________________________________________________
____________________________________________________________________________________________
____________________________________________________________________________________________

21 | P a g e
APPLY

Name: ______________________________________________ Date: ___________________________


ASSESSMENT SCENARIOS
For each scenario, describe the assessment method in view of the factors of practicality and efficiency.

Scenario 1: Ms. Rivera, a Science teacher, would like her students to create a video
about proper waste management. She looked on the strengths and
limitations of giving a performance task. She is used of giving projects to
students, but has limited knowledge about educational technology. Except
for the content, she does not know how grade technical aspects for the
video.

Familiarity with the method

Time Required

Complexity of Administration

Ease of Scoring & Interpretation

Cost

22 | P a g e
Scenario 2: Ms. De Luna, a Social Studies teacher, is preparing an assessment for her
Grade 6 class on the topic “Mga Hamon sa Kasarian”. For this this topic,
students should be able to examine problems and challenges on
independence after the second world war. The teacher decides that essay
questions can appropriately measure students’’ understanding about
colonial mentality and parity rights. She has almost 50 students and plans
to return the papers within three days.

Familiarity with the method

Time Required

Complexity of Administration

Ease of Scoring & Interpretation

Cost

23 | P a g e
ASSESS

Name: ______________________________________________ Date: ___________________________


TASK: SCENARIO-BASED/ PROBLEM SOLVING LEARNING
As an assessment expert, you were asked for advice on the following matters. Provide your
recommendations.
Scenario 1:
Ms. Lorenzo handle six Grade 5 students in English. She would like to test their skills in spelling. She ahs
two options:
a. Oral spelling test
The teacher pronounces each word out loud and the students write down each word.
b. Spelling bee-type test
Each student is asked individually one-at-a-time to spell specific words out loud.
In terms of practicality and efficiency, which one would you suggest? Why?

____________________________________________________________________________________________

____________________________________________________________________________________________

____________________________________________________________________________________________

____________________________________________________________________________________________

____________________________________________________________________________________________

Scenario 2:

Ms. Chua is preparing a final examination for his third year college students in Philippine Government and
Constitution scheduled two weeks from now. He is handling six classes. He has no teaching assistant. He realized
that after examinations, he has three days to calculate the grades of his students. He is thinking of giving extended
response essays.

What advice would you give him? Which aspect of practicality and 9efficiency should he prioritize? What
type of assessment should he consider?

____________________________________________________________________________________________

____________________________________________________________________________________________

____________________________________________________________________________________________

____________________________________________________________________________________________

____________________________________________________________________________________________

24 | P a g e
Scenario 2:

Ms. Rodriguez will give an end-of-year comprehension and speaking test in English for her Grade 3 pupils
in Pampanga. She handles three sections with 45 students. She meets them daily with 4-5- minutes. She has only
one teaching assistant.

a. Should the test be individually administered or group-administered?

b. Should the directions, examples and prompts be in the mother tongue or in English? Should these be spoken or
written?

c. Should student answers and responses in mother tongue or in English? Spoken or written?

d. Should the method of scoring be based on counting the number of correct answers, and/or follow a holistic
approach (one overall score) or analytic approach) separate scores for each performance criterion)?

Give you recommendations in view of the principle of practicality and efficiency.

____________________________________________________________________________________________

____________________________________________________________________________________________

____________________________________________________________________________________________

____________________________________________________________________________________________

____________________________________________________________________________________________

____________________________________________________________________________________________

____________________________________________________________________________________________

____________________________________________________________________________________________

____________________________________________________________________________________________

25 | P a g e
CHAPTER 6 Ethics
OVERVIEW
This chapter centers on ethical issues and responsibilities of teachers in the assessment process. Russell
and Airasian (2012) defines assessment as more than just a technical activity – it is a human activity. They
explained that assessment has consequences to students and other stakeholders. If you recall the relevance ad
roles of assessment, teachers, as well as students, administrators, policyholders, and other stake holders have a
take on assessment. Assessment is used to form judgements on the nature, scope and extent of students’’
learning. For summative purposes, assessment is used for grading, placement and admission. For formative
purposes, it drives instruction. If students’ scores are used for other purposes different from what was intended, let’s
say to evaluate teacher performance, then there is ethical issue.

Intended Learning Outcome (ILO)


At the end of Chapter 6, students are expected to:
• Recommend actions to observe ethical standards in testing.

ENGAGE

Students’ Knowledge of Learning Targets and Assessments


This aspect of fairness speaks of transparency. Transparency is defined here as disclosure to information to
students about assessments. This includes what learning outcomes are to be assessed and evaluated, assessments
methods and formats, weighting of items, allocated time in completing the assessment and grading criteria or rubric.
By informing students regarding assessment details, they can adequately prepare and recognize the importance of
assessment. They become part of the assessment process. By doing so, assessment becomes learner-centered.
In regard to written tests, it is important that students know what is included and excluded in the test. Giving
them sample questions may help them evaluate their strategies and current levels of understanding. Aside from the
content, scoring criteria should be known or made public (to the students concerned). As for performance
assessments, criteria should be divulged prior to assessments so that students will know what the teacher is looking
for in the performance or product. Following the criteria, that can reflect on their practices and personally identified
their strengths and weaknesses. They can evaluate their work performance and product output and make the
necessary improvements before the scheduled assessment or work submission. For product-based assessments, it
would be instrumental if teachers can provide a sample of the work done by previous students so that students can
acknowledge the kind or quality of work their teacher is expecting from them. However, originality should be
emphasized.
Now, what about surprise tests and pop quizzes? There are teachers who would defend and rationalize their
giving of unannounced assessments. There are studies that would support it. For instance, Graham’s (1999) study
revealed that unannounced quizzes raised test score of mid-range undergraduate students and majority of students
in his sample claimed to appreciate the use of quizzes. This may be due to the feedback process. Graham also found
that unannounced quizzes motivated students to distribute their learning evenly and encouraged them to read
materials in advance. Kamuche (2007) reported that unannounced quizzes showed better academic performance
than the control group with announced quizzes. While unannounced quizzes compel students to come class prepared,
we would not discount the possibility that pop quizzes generate anxiety among students. Graham (cited by Kamuche,
2007) stated that unannounced quizzes tend to increase the examination tension and stress, and did not offer a fair
examination.
Test-taking test is another concern. For instance, some students may be good in answering multiple choice
test items than other students. They may have developed test-taking skills and strategies like reading the directions

26 | P a g e
carefully, previewing the test, answering easy items first, reading the stem and all the options before selecting an
answer, marking vital information in the stem, eliminating alternatives and managing test time effectively. To level the
playing field, all students should be familiar with the test strategies before they take the assessment.\
Related to the above, teachers should not create unusual hybrids of assessment formats. For instance, a
matching type of three columns may leave test-takers perplexed, deciphering th3e test format than having them show
how well they’ve learned the subject matter.
Opportunity to Learn
There are teachers who are forced to give reading assignments because of the breadth of the content that
has to be covered in addition to limited or lost classroom time. Then, there is a little or instructions that follows. This
would certainly put students to a disadvantage because they were not given ample time and resources to sufficiently
assimilate the material. They are being penalized for lack of opportunity. McMillan (2007) asserted that fair
assessments are aligned with instruction that provides adequate time and opportunities for all students to learn.
Discussing an extensive unit in an hour is obviously insufficient. Inadequate instructional approaches would not be
just for the learners because they are not given enough experiences to process information and develop their skills.
They will be ill-prepared for the summative test or performance assessment.
Prerequisite Knowledge and Skills
Student may perform poorly in an assessment if they do not possess background knowledge and skills.
Suppose grade school pupils were taught about inverse proportion. Even if they memorize the meaning of proportion,
they would not be able to develop a schema if they are able to connect the new information with previous knowledge.
If they lack adequate knowledge about ratios and direct proportion, they would not fully grasp the concept of inverse
proportion. Moreover, they would have difficulty solving word problems on proportion if they have weak skills in
multiplication and division which are prerequisite skills. And so, it would be improper if students are tested on the
topic without any attempt or effort to address the gap in knowledge and skills. The problems are compounded if there
are misconceptions.
Another problems emerges if the assessment focuses heavily on prior knowledge and [prerequisite skills.
Going back to the previous examples, if students are asked to solve problems on proportion that were written
verbosely, then their performance in the assessment is considerably reflective of the reading comprehension skills
and vocabulary rather than their problem-solving skills. The same thing can be said for problems that are simply
worded but contains extremely large numbers. The test would simply call for skills in arithmetic calculations. This
would not be faired to the students concerned.
So as not to be unfair, the teacher must identify early on the prerequisite skills necessary for completing an
assessment. The teacher can analyze the assessment items and procedures and determine the process of
knowledge and skills required to answer them. Afterwards, the teacher can administer a prior knowledge assessment,
the results of which can lead to additional or supplemental teacher or students-managed activities like peer-assisted
study sessions, compensatory groups, not swapping and active review. The teacher may also provide clinics or
reinforced tutorials to address gaps in students’’ knowledge and skills. He/she may also recommend reading materials
or advise students to attend supplemental instruction session when possible. These are forms of remediation.
Depending on the case, if warranted, the teacher may advise students to drop the course until they are ready or
retake a prerequisite course. In the undergraduate level, prerequisites are imposed to ensure that students possess
knowledge and skills necessary to advance and become successful in subsequent courses.
Avoiding Stereotyping
A stereotype is a generalization of a group of people based on inconclusive observations of a small sample
of this group. Common stereotypes are racial, sexual and gender remarks. Stereotyping is caused by preconceived
judgements of people one comes in contact with which are sometimes unintended. It is different from discrimination
which involves acting out one’s prejudicial opinions. For instance, a teacher employed in a city school may have low
expectations of students coming from provincial schools or those belonging to ethnic minority groups. Another may
harbor uncomfortable feelings toward students from impoverished communities. These teachers carry the idea that’s
students from such groups are cognitively or affectively inferior. These typecasts are based on socio-economic status.

27 | P a g e
There are also those on gender, race and culture. A professional education teacher may believe that since the
education program is dominated by females, they are better off as teachers than males. A teacher may also have an
opinion that other Asian students are better tin Mathematics than Filipino students, and thus, the latter will require or
instructions. Stereotypes may either be positive or negative. For instance, foreigners would regard Filipinos as
hospitable and hardworking individuals, but Filipinos are also being stereotyped as domestic helpers and caregivers.
Teachers should avoid terms and examples that may be offensive to the students of different gender, race,
religion, culture or nationality. Stereotypes can affect students’ performance in examinations. In 1995, Steel and
Aronson, developed the theory of stereotype threat claiming that for people who are challenged in areas they deem
important like intellectual ability, their fear of confirming negative stereotypes can cause them to falter in their actual
test performance. For instance, a female student who was told that females are academically inferior in Mathematics
may feel a certain level of anxiety, and the negative expectation may cause her to underperform in the assessment.
Jordan and Lovett’s (2009) paper provided a review of literature on stereotype threat. The paper cited researches on
the detrimental effects of stereotype threat such as reduced working memory capacity; abandonment of valued social
identities; disengagement from threatening domains among stereotyped individuals; and lowered self-worth.
To reduce the effect of stereotype threat, simple changes in classroom instruction and assessment can be
implemented such as encouraging diverse students that they excel at difficult tasks, that any responsible student can
achieve high standards, and also by ensuring gender-free and culturally-unbiased test items. A school environment
that fosters positive practices and supports collaboration instead of competition can be beneficial especially for
students in diverse classrooms where ethnic, gender and cultural diversity thrive.
Jordan and Lovett 92009) recommended five changes to psycho-educational assessment to alleviate
stereotype threats.
• Be careful ion asking questions about topics related to a student’s demographic group. This may inadvertently
induce stereotype threats even if the information presented in the test is accurate.
• Place measure of maximal performance like ability and achievement tests at the beginning of assessments
before giving less formal self-report activities that contain topics or information about family background,
current home environment, preferred extracurricular activities and self-perceptions of academic functioning.
• Do not describe test as diagnostic intellectual capacity.
• Determine if there are mediators of stereotype threat that affect test performance. This can be done using
informal interviews or through standardized measures of cognitive interference and test anxiety.
• Consider possibility of stereotype threat when interpreting test scores of susceptible typecast individuals.

Avoiding Bias in Assessment Tasks and Procedures


Assessment must be free from bias. Fairness demands that all learners are given equal chances to do well
(from the task) and get a good assessment (from the rater). Teachers should not be affected by factors that are not
part of the assessment criteria. In correcting an easy for instance, a student’s gender, academic status, socio-
economic background or handwriting should not influence the teacher’s judgement or scoring decisions. This aspect
of fairness also includes removal of bias towards students with limited English or with different experiences when
providing instruction and constructing assessments (Russell & Airasian, 2012). This should not be ignored especially
in the advent of the ASEAN (Association of Southeast Asian NATIONS) Economic Integration in 2015 when there is
greater mobilization of students among educational institutions across ASEAN countries.
There are two forms of assessment bias: offensiveness and unfair penalization) Popham, 2011). This forms
distort performance of individual in a group.
Offensiveness happens if takers get distressed, upset or distracted about how an individual or a particular
group is portrayed in the test. The content of the assessment may have contained slurs or negative stereotypes of
particular ethnic, religious or any other group, causing undue resentment, discomfort or embarrassment on some
directly affected students. They tend to focus on the offensive items and their concentration in answering subsequent
items suffers. Ultimately, they end up not performing as well as they could have, reducing the validity of inferences.
An essay about the traffic congestion sweepingly portraying traffic enforces of the Metropolitan Manila Development

28 | P a g e
Authority (MMDA) as corrupt is an example of bias. This assessment may affect students whose parents working with
the MMDA.
Unfair penalization harms students’ performance due to test content, not because items are offensive but
rather, the content caters to some particular groups from the same economic class, race, gender, et., leaving other
groups at a loss or a disadvantage. For example, a reading comprehension was given. Questions was given by the
teacher based on a given passage. The passage is about the effects of the K to 12. Will students who are not familiar
with the K to 12 basic educational framework answer the questions accurately? Similarly, will male and female
students be able to perform equally well in statistics test that contains several problems and data about sports? What
if a teacher in Computer or Education technology gives a test on how wearable technology can impact various
professions, will students coming from low income families answer as elaborately as those from the upper class who
actually possess wearable gadgets? Consider another situation. Suppose the subject is Filipino or Araling Panlipunan
(Social Studies), and the class has foreign students. Should they be mixed with native speakers in class? Should test
items be constructed containing deep or heavy Filipino words? The aforementioned are situations that illustrate undue
penalization resulting from group membership. Unfair penalization causes distortion and greater variation ins cores
which is not due to differences in ability. Substantial variation or disparity in assessment scores between student
groups is called disparate impact. Popham (2011) pointed out that disparate impact is not tantamount to assessment
bias. Differentiation may yet exist but it may be due to inadequate prior instructional experience. Take for instance
the 2013 National Achievement Test where the NATIONAL Capital region topped other regions under cluster 1
composed pf Eastern Visayas, Western Visayas, Centrsl Luzon, Bicol,and Calabarzon. If the test showed no signs of
bias, then it is insinuated that the disparate impact is due to prior instructional inadequacies or lack of preparation.
To avoid bias during the instruction phase, teachers should heighten their sensitivity twards bias and generate
multiple examples, analogies, metaphors and problems that cut across boundaries. To eradicate or significantly
reduce bias in assessments, teacher can consider a judgmental approach or an empirical approach (Popham, 2011).
Teachers can have their tests reviewed by colleagues to remove offensive words or items. Content-
knowledgeable reviewers can scrutinize the assessment procedure or each item of the test. In developing high-stakes
test, a review panel is usually formed – a mix of male and female members from various subgroups who might be
adversely impacted by the test. On each item, the panelists are asked to determine if it might offend or unfairly
penalize any group of students on the basis of personal characteristics. Each panel members responds and gives
their comments. The mean per item absence-of-bias index is calculated by getting the average of the “no” responses.
If an item is found biased, the item is discarded. Qualitative comments are also considered in the decision to retain,
modify or reject items. Afterwards, the entirety of the test is checked for any bias.
As for the empirical approach, try-out evidence is sought. The test may be pilot-tested to different groups
after which differential item functioning (DIF) procedures may be employed. A test item is labelled with DIF when
people with comparable abilities but from different groups have unequal chances of time success. Item response
theory (IRT), Mantel-Haenszel and logistic regression are common procedures for assessing DIF.
Accommodating Special Needs
Teachers need to be sensitive to the needs of students. Certain accommodations must be given especially
for those who are physically or mentally challenged. The legal basis for accommodation is contained in Sec 12. Of
Republic Act 7277 entitled “An Act Providing for Rehabilitation, Self-Development and Self-Reliance of Disabled
Person and their Integration into Mainstream of Society and for Other Purposes”. The provision talks about access to
quality education – that learning institution should consider the special needs of learners with disabilities in terms of
facilities, class schedules, physical education requirements and other related items. Another is Sec.32 of CHED
Memorandum 09, s. 2013 on “Enhanced Policies and Guidelines on Student Affairs and Services” which states that
higher education institutions should ensure that academic accommodation is made available to persons with
disabilities and learners with special needs.
Accommodation does not mean giving advantage to students with learning disabilities, but rather allowing
them to demonstrate their knowledge on assessments without hindrances from their disabilities. It is distinct from
assessment modification as accommodation does not insinuate altering the construct of the assessment – what the
assessment intended to measure in the first place.

29 | P a g e
Let us consider some situations that require accommodations. For students with documented learning
disabilities who are slow in reading, analyzing and responding to test questions, the teacher can offer extended time
to complete the test. For students who are easily distracted by noise, the teacher can make arrangement for these
students to accomplish the assessment in another room free from distractions, or carry out simple or innovative ways
to reduce unnecessary noise from entering the classroom. For those students who do not have perfect vision, the
teacher can adjust and print the written assessment with a larger font.
The situations above have straightforward solutions. But there are also challenging situations that require
much thought. For instance, should a student who is recovering from an accident, unable to write with his/her hand
be allowed to have a scribe? Should foreign students who do not possess and extensive English vocabulary be
permitted to use a dictionary? Are there policies and processes for such cases?
Accommodations can be placed in one six categories (Thurlow, McGrew, Tindal, Thompson & Ysseldyke,
2000).
• Presentation (repeat directions, read aloud, use large print, braille)
• Response (mark answer in test booklet, permit responses via digital recorder or computer, use reference
materials like dictionary)
• Setting (study carrel, separate room, preferential seating, individualized or small group, special lighting)
• Timing (extended time, frequent breaks, unlimited time)
• Scheduling (specific time of day, subtests in different order, administer test in several timed sessions)
• Others (special test preparation techniques and out-of-level tests)

Fundamentally, an assessment accommodation should attend to the particular need of the student
concerned. For instance, presentation and setting are important considerations for a learner who is visually impaired.
For a learner diagnosed with attention deficit hyperactivity disorder (ADHD), frequent breaks are needed during a test
because of the child’s short attention span. To ensure the appropriateness of the accommodation supplied, it should
take into account three important elements:
• Nature and extent of learner’s disability. Accommodation is dictated by the type and degree of disability
possessed by the learner. A learner with moderate visual impairment would need a larger print edition of
assessment or special lighting condition. Of course, a different type of accommodation is needed if the child
has severe visual loss.

• Type and format of assessment. Accommodation is matched to the type and format of assessment given.
Accommodations vary depending on the length of the assessment, the time allotted, mode of response, etc.
a partially deaf child would not require assistance in a written test. However, his/her hearing impairment
would affect his/her performance should the test be dictated. He/she would also have difficulty in assessment
tasks characterized by group discussions like round table sessions.

• Competency and content being assessed. Accommodation does not alter the level of the performance or
content the assessment measures. In Science, permitting students to have a list of scientific formulae during
a test is acceptable if the teacher is assessing how students are able to apply the formulae and not simple
recall. In Mathematics, if the objectives is to add and subtract counting numbers quickly, extended time would
not be a reasonable accommodation.

Relevance
Relevance can also be thought of as an aspect of fairness. Irrelevant assessment would mean short-changing
students of worthwhile assessment experiences. Assessment should be set in a context that students will find
purposeful. Killen (2000) gave additional criteria for achieving quality assessments.

30 | P a g e
“Assessment should reflect the knowledge and skills that are most important for students to learn.”
Assessment should not include irrelevant and trivial content. Instead, it should measure higher-order abilities such as
critical thinking, problem solving and creativity which are 21st century skills. These skills are essential if one is to be
productive and competitive in today’s global society. Teachers are reminded that aiming for high levels of
performance, assessment should not curtail students’ sense of creativity and personality on their work. Rather, it
should foster initiative and innovation among students.

“Assessment should support every student’s opportunity to learn things for students to learn.”
Assessment must provide genuine opportunities for students to show what they have learned and encourage
reflective thinking. It should prompt them to explore what they think is important. This dan be done for example using
Ogle’s KWL (Know-Want-Learn) chart as a learning and assessment tool. It activates prior knowledge and personal
curiosity and encourages inquiry and research.

“Assessment should tell teachers and individual students something that they do not already know.”
Assessment should stretch student’s ability and understanding. Assessment tasks should allow them to apply their
knowledge in new situations. In a constructivist classroom, assessment can generate new knowledge by scaffolding
prior knowledge.

Ethical Issues

There are times when assessment is not called for. Asking pupils to answer sensitive questions like their
sexuality or problems in the family are unwarranted without the consent of the parents. Grades and reports of teacher
generated from using invalid and unreliable test instruments are unjust. Resulting interpretations are inaccurate and
misleading.

Other ethical issues in testing (and research) that may arise include possible harm to the participants;
confidentiality results; deception in regard to the purpose and use of assessment; and temptation to assist students
in answering tests or responding to surveys.

EXPLORE

ACTIVITY 1: ASSESSMENT SCENARIOS

Suppose you are the principal of a public high school. You received complaints from students concerning
their tests. Based on their complaints, you decided to talk to the teachers concerned and offered based on ethical
standards. Write down your recommendations citing specific aspects of ethics or fairness discussed in this chapter.

Scenario 1: Eight-grade students complained that their music teacher uses only written tests as the sole method of
assessment. They were not assessed on their skills in singing or creating musical melodies.

Scenario 2: Grade 7 students complained that they were not informed that there is a summative test in Algebra.

Scenario 3: Grade 7 students complained that they were not told to study for the mastery test. They were simply told
to study and prepare for the test.

Scenario 4: Grade 9 students complained that there were questions in their Science test on the last unit which was
not discussed in class.

Scenario 5: Grade 9 students study for a long test covering the following the topics: Motion in two dimensions;
Mechanical Energy; Heat, Work and Efficiency; Electricity and Magnetism. The teacher prepared several questions
on the first topic. Hence, students complained that most of what they studied did not turn up in the test.

31 | P a g e
Scenario 6: students complained that they have difficulty with the test because they prepared by memorizing the
content.

Scenario 7: Students were tested on an article they read about Filipino ethnic groups. The article depicted Ilocanos
as stingy. Students who hail from Ilocos were not comfortable answering the test.

ACTIVITY 2: DO’S AND DON’TS

Suppose you are a firm teacher especially when it comes to ethics in assessment. You would like to observe
correct conduct prior, during and after a test. Write down five do’s and don’ts for each phase of testing.

Before the Test During the Test After the Ttest

32 | P a g e
ACTIVITY 3: CHECKLIST QUESTIONS

An inexperienced teacher would like to ask you for your help in regard to ethics in assessment. Generate
guide questions to ensure ethical behavior in each aspect (3 each) discussed in the previous chapter. Sample
questions are given.

Aspect Questions

Student knowledge of learning 1. Were students informed of the learning targets?


targets and assessments
2.

3.

Opportunity to learn 1. Were all students given sufficient opportunities to learn knowledge and
skills?

2.

3.

Prerequisite knowledge and 1.


skills
2.

3.

Avoiding student stereotyping 1.

2.

3.

Avoiding bias in assessment 1.


tasks and procedures
2.

3.

Accommodating special needs 1.

2.

3.

Relevance 1.

2.

3.

33 | P a g e
APPLY

Name: _________________________________________________ Date: ____________________________

A. CASE ANALYSIS

Article VIII of the Code of Ethics for Teachers (Resolution No. 435, s.1997) contain ethical standards in
relation with students. These standards and benchmarks to ensure that teachers observe fairness, hustcie and equity
in all phases of the teaching-learning cycle.

34 | P a g e
For each of the following, explain why the teacher’s action is deemed unethical. Moreover, cite a section
of the Code of Ethics that was violated to support your answer.

1. A Social Studies teacher gave an essay asking students to suggest ways to improve the quality of education in
the country, the teacher simply scanned students’ answers and many of the students received the same score.
_________________________________________________________________________________________
_________________________________________________________________________________________
_________________________________________________________________________________________
2. A Grade 4 Technology and Livelihood Education teacher did not allot sufficient time in teaching animal cure,
but instead focused on gardening, a topic which he liked the most.
_________________________________________________________________________________________
_________________________________________________________________________________________
_________________________________________________________________________________________
3. The teacher uses stigmatizing descriptions for students who are not able to answer teacher’s questions during
an initiation-response-evaluation/recitation.
_________________________________________________________________________________________
_________________________________________________________________________________________
_________________________________________________________________________________________
4. A grade school teacher deducted five points from student’s test score for his misdemeanor in class.
_________________________________________________________________________________________
_________________________________________________________________________________________
_________________________________________________________________________________________
5. The teacher gave additional points to students who bought tickets for the school play which was declared to be
optional.
_________________________________________________________________________________________
_________________________________________________________________________________________
_________________________________________________________________________________________
6. Some students received an “incomplete” in the performance task due to absences or other reasons. The teacher
told them to acquire books instead in lieu of the performance assessment.
_________________________________________________________________________________________
_________________________________________________________________________________________
_________________________________________________________________________________________
7. A student approach the teacher during the summative test in Mathematics to ask about a test item. The teacher
reworded the test item and clarified the test question.
_________________________________________________________________________________________
_________________________________________________________________________________________
_________________________________________________________________________________________
8. The teacher excused foreign students from participating in the Linggo ng Wika program.

_________________________________________________________________________________________
_________________________________________________________________________________________
_________________________________________________________________________________________

35 | P a g e
B. SURVEY AND RECOMMENDATIONS

Green, Johnson, Do-Hong Kim & Pope (2007) wrote a study on “Ethics in Classroom Assessment Practices”.
Their study focused on defining ethical behavior and examining educator’s ethical judgement in relation to
assessment. Some of those scenarios are presented here.

Your task is to judge each scenario whether the student evaluation practice was unethical or not. Each
member of the group should do this separately. Afterwards, determine the number and percentage of the group who
judged the scenario as ethical/unethical. Then, discuss among your group members the reasons for your
agreement/disagreement. If unethical, write your recommendations. Complete the table below.

Total
Scenario Respondents’ answers (n = ___ ),
%

1. A teacher spends a class period to train his students in test-taking


skills (e.g., not spending too much time on one problem, eliminating
impossible answers, guessing).

2. A teacher administers a parallel from of the NATIONAL Aptitude Test


(NAT) to his students in preparation for the national testing. The
parallel form is another version of the national test that assess the
same content.

3. A teacher assesses student knowledge by using many types of


assessments: multiple-choice tests, essays, projects, portfolios.

4. A high school Social Studies teacher bases students’ final quarter


grade on two multiple choice tests.

5. A teacher state how she will grade a task when she assigns it.

6. A teacher tells students what materials are important to learn in


preparing for a class test.

7. For the final exam, a teacher always uses a few surprise items about
topics that were not on the study guide.

8. To minimize guessing, a teacher announces she will deduct more


points for a wrong answer than for leaving the answer blank.

9. A teacher allows a student with a learning disability in the language


arts to use a tape recorder when the student answers the essay
questions on social studies test.

10. A teacher always knows the identity of the students whose essay
test she is grading

36 | P a g e
ASSESS

Name: _________________________________________________ Date: ____________________________

TASK 1: SCENARIO-BASED/ PROBLEM SOLVING LEARNING


As an assessment expert, you were asked for advice on the following matters. Provide your
recommendations.

Scenario 1:

Here is an item found in a primary textbook characterizing people in a unit called “Racial Harmony”.

Choices: Japanese Indian British Chinese Korean Filipino

1. I am ________________. I am an English Teacher


2. I am ________________. I am a domestic helper in Hong Kong.

Would you recommend changing the items? Why?

____________________________________________________________________________________________
____________________________________________________________________________________________

Scenario 2:

Ms. Samson is assessing the spelling skills of Grade 3 students. Suppose she has a student with hearing
impairment. Which of the following would you select? Why?

a. Proof-reading style test


A paragraph is given to the student, and the student must find the misspelled words and supply the correct spelling
in the spaces provide.

____________________________________________________________________________________________
____________________________________________________________________________________________

b. Multiple choice spelling test


Two or more options are given for each word in the test, and the student must choose the correctly spelled word.

____________________________________________________________________________________________
____________________________________________________________________________________________

Scenario 3:

Mr. Rabor gave a test asking Grade 1 pupils to indicate whether the activity is for males or females.

1. Cleaning the house 4. Fixing the faucet


2. Washing the dishes 5. Doing the laundry
3. Driving a car

Do you find the test proper? Explain your answer. What would you recommend?

____________________________________________________________________________________________
____________________________________________________________________________________________

37 | P a g e
TASK 2: WORD CLOUD

A word cloud is an image composed of words selected from a text source. Words that appear more frequently
are given more prominence.

Explain how ethics come into play in regards to assessment. Use at least three words/expressions found in
the word cloud. Cite an example or situation hoe ethics (in assessment) can be translated from theory to practice.

Accommodating special needs

Opportunity to learn

Fairness
Unfair penalization

Knowledge of learning target


Prerequisite skills

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

GROUP ASSIGNMENT:
Ask a group of elementary, high school or non-Education college students (20-25) about their concept of a
fair assessment. Write down their responses in bullet form. Highlight key words. Create your own word cloud. You
may use any online word cloud generators.

38 | P a g e
SECTION 2 SUMMATIVE ASSESSMENT

For each item, write the letter of the correct answer on the space provided before the item number.
_______1. Which of the following is true about selected-response items?
A. It assesses the affective domain.
B. It assesses only part of the cognitive domain.
C. It assesses the higher levels of the cognitive domain.
D. It assesses the psychomotor domain.
________2. Which of the following guides the teacher in ensuring that achievement domains are achieved and a fair
and representative sample of questions appear on the test? It provides evidence of content validity.
A. Assessment matrix
B. Lesson plan
C. Table of contents
D. Table of Specifications
_________3. Mr. Carlos asked his students to demonstrate their proficiency in conducting a Science experiment on
density. Which of the following assessment methods is most appropriate?
A. Brief-constructed response
B. Oral questioning
C. Performance Task
D. Teacher Observation
__________4. A Social Studies teacher is thinking of a suitable assessment that would allow his/her Grade 10
students to explain the political, social and economic effects of climate change. Which method of
assessment will work best?
A. Brief-constructed response
B. Essay
C. Performance Task
D. Selected-response
__________5. An English teacher would like to find out quickly if his/her students were able to understand the
elements of the short story they just took up in class. Which method is most proper for this?
A. Brief-constructed response
B. Essay
C. Observation
D. Oral questioning
__________6. A math teacher gave her students a survey instrument that would make them aware of and more
responsible for their own learning through reflection and critical evaluation. Which type of assessment
method did the teacher use?
A. Selected-response

39 | P a g e
B. Essay
C. Observation
D. Self-assessment
__________7. Which type of learning target is achieved when students successfully made a documentary about
human rights in their Social Science subject?
A. Deep Understanding and Reasoning
B. Knowledge and Simple Understanding
C. Products
D. Skills
___________8. One of the learning competencies in Grade 10 Science class is that students should ne abl;e to
describe the structure and function of the reproductive system. If expressed as a learning target, what
type of learning target is this?
A. Deep Understanding and Reasoning
B. Knowledge and Simple Understanding
C. Products
D. Skills
___________9. Below is a test item in a Grade 8 periodical test in English: Construct a sentence using the given
adjectives and the indicated level of comparison.
• Clever (superlative)
________________________________________________________________________
• Little (comparative)
________________________________________________________________________

Which level of assessment is targeted?


A. Knowledge
B. Process
C. Understanding
D. Product/Performance
__________10. A Math teacher gave this assessment to his students to provide them with an opportunity to appky
their knowledge of systems of linear equations: Create a simple business proposal. The plan should
contain a title, purpose, budget, workers, and projected income. Highlight the equations used. A
graph of monthly income would be useful.
Which level of assessment is targeted?
A. Knowledge
B. Process
C. Understanding
D. Product/Performance

40 | P a g e
___________11. Mr. Cruz asked other Social Studies teachers in his high school to review his periodical test to
ensure that the test items represent his learning targets. Which type of evidence on validity did he
use?
A. Construct-related
B. Content-related
C. Criterion-related
D. Instructional-related
___________12. To obtain evidence of validity, a teacher allowed a time interval to elapse between the administration
of a test and obtaining the criteria scores. Which evidence of validity is it?
A. Concurrent validity
B. Consequential validity
C. Construct validity
D. Predictive validity
___________13. What evidence of validity is obtained by gathering test scores and criterion scores at nearly the
same time?
A. Concurrent validity
B. Consequential validity
C. Construct validity
D. Predictive validity
___________14. A teacher administered an aptitude test to a group of Grade 7 students and later compared their
test scores with their end-of-period grade in history. Which type of evidence on validity is shown
here?
A. Concurrent validity
B. Consequential validity
C. Construct validity
D. Predictive validity
___________15. In writing each test item, Ms. de Guzman was careful to ensure that the items are at an appropriate
reading level. Which type of evidence of validity does Ms. de Guzman want to obtain?
A. Construct-related
B. Content-related
C. Criterion-related
D. Consequential
___________16. Which of the following is true about high content validity of a teacher-made achievement test?
A. Test items were matched to the learning targets.
B. Test is equivalent to similar achievement tests.
C. Test items have high difficulty level.
D. Test was administered under controlled conditions.

41 | P a g e
___________17. The school guidance counselor administered two forms of the same personality test to the students.
Which type of reliability does she like to obtain?
A. Equivalence
B. Internal consistency
C. Rater consistency
D. Stability
___________18. A Science teacher administered the same test twice. Which statistical too is used in a test-retest
method?
A. Cronbach alpha
B. Kuder-Richardson 20/21
C. Pearson
D. Spearman-Brown Prophecy
____________19. Which of the following statements is true about validity and reliability?
A. For an instrument to be valid, it must be reliable.
B. For an instrument to be reliable, it must be valid.
C. Both A and B
D. None of these
____________20. Which of the following is NOT a method to obtain a reliability coefficient?
A. Test-retest
B. Equivalent forms
C. Internal consistency
D. Concurrent forms
____________21. If the standard deviation is 8 and reliability coefficient is 0.75, what is the 95% confidence interval
or range round in an observed score of 100?
A. 98-102
B. 96-104
C. 94-106
D. 92-108
____________22. Suppose a test of 15 items has a reliability index of 0.80. if the number of items is increased to 60,
which of the following is possibly the correlation coefficient?
A. 0.70
B. 0.80
C. 0.90
D. 1.00
____________23. An assessment should allow every learner equal opportunity to do well. Which principle of
assessment does this exhibit?

42 | P a g e
A. Fairness
B. Practicality
C. Reliability
D. Validity
____________24. Ms. de Leon uses familiar assessments that are not too time-consuming. Which assessment
criteria is she observing?
A. Fairness
B. Practicality and Efficiency
C. Reliability
D. Validity
____________25. Selected-response types like multiple choice tests can be a good choice of assessment for which
reason?
A. Efficient to administer and score
B. Can assess the learner’s ability to construct
C. Easy to prepare
D. Not biased
____________26. Ms. de Luna would like to give an inquiry-based research activity that would allow students to
undertake a historical inquiry of personalities, events or issues in history. She looked into the
strengths and limitations of this assessment strategy and how she will grade and judge the process
and final product. Which factor of practicality is she considering?
A. Cost
B. Ease of scoring and interpretation
C. Familiarity with the method
D. Time required
____________27. Ms. Sibug handles a large class with 50 students. What type of test would be less time-consuming
to grade?
A. Completion test
B. Essay
C. Multiple choice test
D. Oral questioning
____________28. A teacher informed a child’s parents that their son did not do well in the performance assessment.
To prove her point, she showed their son’s grades and the scores obtained by their son’s
classmates. What ethical issue is evident in this case?
A. Comprehensiveness
B. Confidentiality
C. Fairness
D. Transparency

43 | P a g e
___________29. A Grade 7 Science teacher grouped the students for an oral presentation about mitigation and
disaster risk reduction. She forgot to discuss to the class how they shall be graded. What ethical
concern did the teacher fail to consider?
A. Confidentiality
B. Fairness
C. Relevance
D. Transparency
____________30. The teacher reported that Janna is a slow and sloppy learner. What is wrong with the teacher’s
evaluation?
A. Irrelevant
B. Offensive
C. Stereotyping
D. Unfair penalization

44 | P a g e

You might also like