Assessment in Learning Module

Roles of Assessment
Four Roles of Assessment used in the Instructional Process

 Miller, Lin & Grolund (2009) identified these as functional roles of
assessment in classroom instruction.
 Nitko (1994) enumerated these as instructional decisions supported by
tests.
1. Placement Test
 It is used to determine a learner’s entry performance
 Teachers assess through a readiness pre-test
 It is used to determine if students have already acquired the intended
outcomes.
 Typically focusses on the questions “Does the learner possess the
knowledge and skills needed to begin the planned instruction?” “To
what extent has the learner already developed the understanding and
skills that are goals of the planned objectives.”
 Placement pre-test contains items that measure knowledge and skills
of students in reference to the leaning targets.
2. Formative Assessment
 There is now a shifting from a testing culture to an assessment culture
characterized by the integration of assessment and instruction (Dochy,
2001)
 It mediates the teaching and learning process
 The main concern of a classroom teacher is to monitor the learning
progress of the students.
1
 The teacher should assess whether the students achieved the intended
learning outcomes set for a particular lesson
 Its purposes are for immediate feedback, identify learning errors,
modify instruction and improve both learning and instruction.
 It is leaner centered and teacher directed.
 It occurs during instruction and context specific
 Muddiest point: a technique that can be used to address gaps in
learning. Background knowledge probe: short and simple questionnaire
given at the start of the lesson.
 Used as feedback to enhance teaching and improve the process of
learning
 An on-going process
Other types of formative assessment:

 question and answer during discussion
 assignments
 short quizzes
 teacher observations
Positive Effects of Formative Assessment

Black and William (1998) cited a body of evidence showing that
formative assessment can relate the standards of achievements.
Utaberta & Hassanpour (2012) enumerated the positive effects of
formative assessment, they are:
 reactivates or consolidates prerequisite skills or knowledge prior to
introducing new material
 focuses attention on important aspects of the subject;
 encourages active learning strategies;
 gives students opportunities
 provides knowledge of outcomes and corrective feedback
 helps students monitor their own progress and develop self-evaluation
skills
 guides the choice of further leaning activities to increase performance;
and
 helps students to feel a sense of accomplishment
Attributes of an Effective Formative Assessment:

Melmer, Burmaster & James (2008) delineated five key attributes:
1. learning progressions
2. learning goals and criteria for success;
3. descriptive feedback
4. self- and peer-assessment
5. collaboration
3. Diagnostic Assessment
 Intended to identify learning difficulties during instruction
 Can detect commonly held misconceptions in a subject
 Are not merely given at the start of instruction
2
 Used to defect causes of persistent learning difficulties
4. Summative Assessment
 Is done at the end of instruction to determine the extent to which the
students have attained the learning outcomes
 Is used for assigning and reporting grades or certifying mastery of
concepts and skills
5. Interim Assessment
 have the same purpose as formative assessments, but these are
given periodically throughout the school year
 prepare students for future assessments
 fall between formative and summative assessments
3
Principles of High Quality Assessment
Appropriateness and Alignment of Assessment Methods to Learning

Outcomes
• Chappuls, Chappuls & Stiggins (2009) five standards of quality
assessment:
1. Clear purpose
2. Clear learning targets
3. Sound assessment design
4. Effective communication of results
5. Student involvement in the assessment process
Questions on Classroom Assessment

• Why are you assessing?
• What do you want to assess?
• How are you going to assess?
Identifying Learning Outcomes

• Learning outcome pertains to a particular level of knowledge, skills and
values that a student has acquired at the end of a unit or period of
study as a result of his/her engagement in a set of appropriate and
meaningful learning experience.
Anderson, et al. (2005) four steps in a student outcome assessment:

• Create learning outcome statements
• Design teaching/assessment to achieve theses outcomes statements
• Implement teaching/assessment activities
• Analyze data in individual and aggregate levels
• Reassess the process
Taxonomy of Learning Domains

Cognitive Levels and Processes (Anderson, et al., 2001)
LEVELS Process and Action Verbs Sample Learning
Describing Learning Outcomes Competencies
Remembering Processes: Recognizing, Define the four

Retrieving relevant Recalling levels of mental
knowledge from long Verbs: define, describe, identify, processes in
-term memory label, list, match, name, outline, Marzano &
reproduce, select, state Kendall’s Cognitive
System.
Understanding Processes: Interpreting, Explain the

Constructing Exemplifying, Summarizing, purpose of
meaning from Inferring, Comparing, Explaining Marzano &
instructional Verbs: convert, describe, Kendall.s New
messages, including distinguish, estimate, extend, Taxonomy of
oral, written, and generalize, give examples, Educational
graphic paraphrase, rewrite, summarize Objectives.
4
communication
Applying Processes: Executing, Write a learning

Carrying out or using Implementing objective for each
a procedure in a Verbs: apply, change, classify level of the
given situation (examples of a concept), Marzano &
compute, demonstrate, Kendall’s Cognitive
discover, modify, operate, System
predict, prepare, relate, show,
solve, use
Analyzing Processes: Differentiating, Compare and

Breaking material Organizing, Attributing contrast the
into its constituent Verbs: analyze, arrange, thinking levels in
parts and determine associate, compare, contrast, the revised
how the parts relate infer, organize (a thesis) Bloom’s
to one another and Taxonomy and
to an overall Marzano Kendall’s
structure or purpose Cognitive System
Evaluating Processes: Executing, Judge the

Making judgements Monitoring, Generating effectiveness of
based on criteria and Verbs: appraise, compare, writing learning
standards conclude, contrast, criticize, outcomes using
evaluate, judge, justify, support Marzano &
(a judgement), verify Kendall’s
Taxonomy
Creating Processes: Planning, Producing Design a

Putting elements Verbs: classify (infer the classification
together to form a classification system), scheme for writing
coherent or construct, create, extend, learning outcomes
functional whole; formulate, generate, synthesize using the levels of
reorganize elements Cognitive system
into a new pattern or developed by
structure. Marzano & Kendall
Taxonomy of Psychomotor Domain

Levels Action Verbs Describing Sample Learning
Learning Outcomes Outcomes
Observing Describe, detect, distinguish, Relate music to a

Active mental differentiate, describe, relate, particular dance
attending of a select step.
physical event Begin, display, explain, move, Demonstrate
Imitating proceed, react, show, state, several dance steps
Attempted copying of volunteer in sequence
a physical behavior
5
Practicing Bend, calibrate, construct, Display several
Trying a specific differentiate, dismantle, dance steps in
physical activity over fasten, fix, grasp, grind, sequence
and over handle, measure, mix,
organize, operate,
manipulate, mend
Adapting Arrange, combine, compose, Perform a dance

Fine tuning. Making construct, create, deign, showing new
minor adjustments in originate, rearrange, combinations of
the physical activity in reorganize steps.
order to perfect it
Taxonomy of Affective Domain (Krathwohl, et)

Levels Action Verbs Describing Sample Learning
Learning Outcomes Competencies
Receiving Asks, chooses, describes, Listen attentively to

Being aware of or follows, gives, holds, volleyball
attending to identifies, locates, names, introduction.
something in the points to, selects, sits,
environment erect, replies, uses
Responding Answer, assist, comply, Assist voluntarily in

Showing some new conform, discuss, greet, setting up volleyball
behaviors as a result help, label, perform, nets.
of experience practice, present, read,
recite, report, select, tell,
write
Valuing Compare, describe, Attend optional

Showing some differentiate, explain , volleyball matches.
definite involvement follow, form, initiate, invite,
or commitment join, justify, propose, read,
report, select, share, study,
work
Organizing Adhere, alter, arrange, Arrange his/her own

Integrating a new compare, complete, defend, volleyball practice.
value into one’s explain, generalize, identify,
general set of values, integrate, modify, order,
giving it some ranking organize, prepare, relate,
among one’s general synthesize
priorities.
Internalizing Values Act, discriminate, display, Join intramurals to

Characterization by a influence, listen, modify, play volleyball twice a
6
value complex perform, practice, propose, week.
Acting consistently qualify, question, revise,
with the new value serve, use, verify
Types of Assessment Method

1. Selected-Response Format
• Student select from a given set of options to answer a question
or a problem.
• It is objective and efficient.
• The licensure examination for teachers is a selected-response
assessment.
Types of Selected-Response Assessment

1. Multiple Choice
• Consist of a stem (question or statement form) with four or five
answer choices (distracters).
2. Matching Type
• Consist of a set or column of descriptions and words, phrases
or images.
3. Alternate Response (true/false)
• Binary choice type but the reliability is not generally high
because of the possibility of guessing.
2. Constructed-Response Format
• Students need only to recognize and select the correct answer.
• Composed to address higher- order thinking skills, most require
only identification and recognition.
• Useful in targeting higher levels of cognition.
• Demands that students create or produce their own answers in
response to a question, problem or task
Categories of Constructed Response Format

• 1. Brief-constructed response items
• Require only short responses from students. Examples are
sentence completion where students fill in a blank at the end of
a statement; short answer to open-ended questions; labelling a
diagram; or answering a Mathematics problem by showing their
solutions.
• 2. Performance assessment
• Require student to perform a task rather than select from a
given set of options
• Called authentic or alternative assessments because students
are required to demonstrate what they can do through activities,
problems, and exercises.
• 3. Essay assessment
• Involve answering a question or proposition in written form.
• Powerful in the sense that it allows students to express
themselves and demonstrate their reasoning.
7
• Essay item that requires a few sentences is called restricted-
response.
• 4. Oral assessment
• A common assessment method during instruction to check on
student understanding.
• Oral questioning may take the form of an interview or
conference.
3. Teacher Observation
• A form of on-going assessment, usually done in combination
with oral questioning.
• Can also be used to assess the effectiveness of teaching
strategies and academic interventions.
4. Student Self-Assessment
• One of the standards of quality assessment identified by
Chappuis, Chappuis and Stiggins (2009).
• A process in which the students are given a chance to reflect
and rate their own work and judge how well they have performed
in relation to a set of assessment criteria.
Matching Learning Target with Assessment Methods

• Constructive Alignment
• Provides the “how-to” by verifying that the teaching-learning
activities (TLAs) and the assessment tasks (ATs) activate the
same verbs as in the ILOs.
• Learning Target
• Defined as a description of performance that includes what
learners should know and be able to do.
• Contains the criteria used to judge student performance.
• Derived from national and local standards.
Learning Targets and Assessment Methods (McMillan, 2007)

• Knowledge and Simple Understanding
• Pertains to mastery of substantive subject matter and
procedures.
• Covers the lower order thinking skills of remembering,
understanding and applying of revised Bloom’s Taxonomy
• Selected-response and constructed-response items are best in
assessing low-level learning targets in terms of coverage and
efficiency.
• Deep Understanding and Reasoning
• Involve higher order thinking skills of analyzing, evaluating and
synthesizing.
• Essays are best in checking for deep understanding and
reasoning as well as performance tasks which are also effective.
• Oral questioning also but less efficient than essays.
• Skills
8
• To assess skills, performance assessment is obviously the
superior assessment method.
• It becomes an “authentic assessment” when used in real-life
and meaningful context.
• Products
• Most adequately assessed through performance tasks.
• It is substantial and tangible output that showcases student’s
understanding of concepts and skills and their ability to apply,
analyze, evaluate and integrate those concepts and skills.
• Examples: musical compositions, stories, poems, research
studies, drawings, model constructions and multimedia
materials.
• Affect
• Student affect cannot be assessed simply by selected-response
or brief-constructed response tests.
• It pertains to attitudes, interest, and values students manifest.
• Best method for this learning target is self-assessment.
• Oral questioning may also work in assessing affective traits.
9
Validity
Refers to the accuracy of an assessment- whether or not it measures

what is supposed to measure.
Types of Validity
1. Content Validity - is the extent to which the content of the test matches the
instructional objectives.
2. Construct Validity - is the extent to which an assessment corresponds to

another variables, as predicted by some rationale or theory.
3. Criterion - Related Validity - is the extent to which scores on the test are in
agreement with (concurrent validity) or (predictive validity) an external
criterion.
4. Face Validity - the extent to which a test is subjectively viewed as covering

the concept it tries to measure.
Reliability
Is the degree to which an assessment tool produces stable and

consistent result.
Types of Reliability
1. Test-Retest Reliability
In test-retest reliability the single form of the test is administered twice

on the same sample with reasonable time gap. In this way two administration
of the same form of the two independent sets of scores. The two sets, when
correlated, give the value of the reliability coefficient.
2. Split-half Reliability
Other name Internal Consistency reliability. It indicates the

homogeneity of the test. This method the test is divided into two equal or
nearly halves. Common way of this test is the odd-even method.
3. Parallel-Forms Reliability
This reliability various names such as: alternative-forms reliability,

equivalent-forms reliability, comparable-forms reliability.
The alternative forms technique to estimate reliability is similar to the

test-retest method, except that different measures of a behavior (rather than
the same measure) are collected at different times. If the correlation between
the alternative forms is low, it could indicate that considerable measurement
error is present, because two different scales were used.
4. Inter-rater Reliability
Is a measure of reliability used to assess the degree to which different
10
judges or raters agree in their assessment decisions.
Factors Influencing Reliability of Test Scores
 Group Variability
Guessing by the examinees
Environmental conditions
Group Variability - when the group of examinees being tested is homogenous

in ability, the reliability of the test scores is likely to be lowered, but when the
examinees vary widely in their range of ability, that is, the group of examinees
is a heterogeneous one, the reliability of the test scores is likely to be high.
Guessing by the examinees - guessing in a test is an important source of

unreliability. In two alternative response options there is 50% chance of
answering the items correctly on the basis of the guess.
Environmental conditions - Testing environment should be uniform.

Arrangement should be such that light, sound, and other comforts are equal
and uniform to all the examinees. Otherwise it will end to lower the reliability
of the test scores.
11
 Practicality and Efficiency
Refers to the teacher‘s familiarity with the methods used, time required
for the assessment, complexity of the administration, ease of scoring, ease
of interpretation of the test results, and the materials used must be at the
lower cost.
Factors on practicality and efficiency that contribute to High Quality

assessment
 Familiarity with the Method
It would be a waste of time and resources if the teacher is
unfamiliar with the method of assessment.
The assessment may not satisfactorily realize the learning
outcomes.
The teacher may commit mistakes in the construct or format of the
assessment.
It is critical that the teachers learn the strengths and weaknesses of
each method of assessment.
 Time Required
Assessment should allow students to respond readily but not hastily.
Assessment should also be scored promptly but not without basis. It
should be noted that time is a matter of choice- it hinges on the teacher’s
choice of assessment method.
 Ease in Administration
Assessment should be easy to administer. To avoid questions during
the test or performance task, instructions must be clear and complete.
Instruction that is vague will confuse the students and they will
consequently provide incorrect responses. This may be a problem with
performance assessments that contain long and complicated directions
and procedures.
 Ease of scoring
Obviously, selected response formats are the easiest to score
compared to restricted and more so extended-response essays. It is also
difficult to score performance assessments like oral presentations,
research papers, among others.
Objectivity is also an issue. Selected response-tests are objectively
marked because each item has one correct or best answer.
McMillan (2007) suggests that for performance assessments, it is
more practical to use rating scales and checklists rather than writing
extended individualized evaluations.
 Ease of Interpretation
The teacher is able to determine right away if the student passed the
test by establishing a standard.
In performance tasks, rubrics are prepared to objectively and
expeditiously rate student’s actual performance or product.
 Cost
12
Classroom tests are generally inexpensive compared to national or
high- stakes tests.
As for performance tasks, examples of tasks that are not
considerably costly are written and oral reports, debates, and panel
discussions.
13
Ethics
Ethics is a code of moral standards of conduct for what is “good”

and “right” as opposed to what is “bad” and “wrong”. Ethical Behavior is “right”
or “good” in the context of governing moral code. It is values driven.
Student’s assessment should be ethical, fair, useful, feasible and
accurate. According to Airasian (2015) the ethical standards for assessment
refers to some aspects of a teachers’ fairness in dealing with his or her pupils
or students.
Aspects of Fairness:
1. Students knowledge of learning targets and assessment
2. Opportunity to learn
3. Prerequisite knowledge and skills
4. Avoiding student stereotyping
5. Avoiding bias in assessment tasks and procedures
Accommodating special needs
1. Students Knowledge Of Learning Targets And Assessment

This type of fairness speaks of transparency. Transparency is
disclosure of information to students about assessments. This includes what
learning outcomes are to be assessed and evaluated, assessment methods
and formats, weighting of items, allocated time in completing the assessment
and grading criteria. For product-based assessment, it would be instrumental
if the teacher can provide a sample of the work done by previous students so
that the students can acknowledge the quality of work their teacher is
expecting from them. The teachers should not create unusual hybrids of
assessment formats.
2. Opportunity to Learn
There are teachers who are forced to give reading assignment because
of the breadth that has to be covered in addition to limited or lost classroom
time. This would put students to disadvantage because they were not given
ample time and resources to sufficiently assimilate the material. McMillan
(2007) asserted that fair assessments are aligned with instruction that
provide adequate time and opportunities for all students to learn.
Discussing an extensive unit in an hour is obviously insufficient. They

will be ill prepared for a summative test or performance assessment.
3. Prerequisite Knowledge and Skills

Students may perform poorly in an assessment if they do not possess
background knowledge and skills. So as not to be unfair the teacher must
identify early in the prerequisite skills necessary for completing an
assessment. The teacher can analyze the assessment items and procedures
and determine the pieces of knowledge and skills required to answer them.
4. Avoiding Stereotyping
A stereotype is a generalization of a group based on inconclusive
14
observations of a small sample of these group common stereotypes are
racial, sexual, and gender remarks. Stereotyping is caused by preconceived
judgment of people one comes in contact in which are sometimes unintended.
The teacher should avoid terms that may be offensive to students of different
gender, race, religion, culture or nationality.
5. Avoiding Bias in Assessment Task

Assessment must be free from bias. Fairness demands that all
learners are given equal chances to do well and get a good assessment
teachers should not be affected by factors that are not part of the
assessment criteria. Student’s gender, academic status, socio-
economic background, or hand writing should not influence the teacher’s
judgment and scoring decision.
Two Forms of Assessment Bias:

 Offensive happens if test-takers get distressed, upset or distracted of
how an individual or a particular group is portrayed in a test.
 Unfair penalization harms the students’ performance due to test

content, not because items are offensive but rather, the content cater
to some particular groups from the same economic class, race, gender,
etc., leaving other groups at a loss or disadvantage.
Example:
An essay about enforcers the Metropolitan Manila Authority as
corrupt is an example of bias. This assessment may affect students
whose parents working in the MMDA.
Teachers can have their tests reviewed by colleagues to remove
offensive words or items. Content-knowledgeable reviewers can
scrutinize the assessment procedure or each item of the test.
6. Accommodating Special Needs
Teachers need to be sensitive to the needs of the students.
Certain accommodations must be given especially for those who are
physically and mentally challenge. For students who do not have
perfect vision, the teacher can adjust and print the written assessment
with a larger font.
Relevance
 Assessment should be set in a context that student will find purposeful.
 Assessment should reflect the knowledge and skills that are most
important for students to learn.
 Assessment should support every student’s opportunity to learn things
that are important.
Assessment should tell teachers and students something that they
do not already know.
15
Planning the Test
Classroom Test and General Purpose
Classroom test plays a central role in the evaluation of student learning.
They provide relevant measures of many important learning outcomes and
indirect evidence concerning others.
 Motivate the students
 Maintain learning atmosphere
 Measures achievement
 Check instructional effectiveness
 Identify areas for review
Table of Specification
This is the blueprint of the test which is usually prepared by the teacher
on preparing summative test only.
Advantages and Disadvantages of Table Specification

Advantages
1. Provides the students a fair opportunity to demonstrate what they have
learned from the instruction provided.
2. Enables the teacher to construct test items that measure a
representative sample of the learning objectives.
3. The assessment data/information gathered will provide a valid and
reliable foundation from which the teacher can make decisions about
student learning.
Disadvantage
1. It’s time consuming on the part of the teacher
Parts of the Table of Specification

At the table of specification contains the following;
1. Column 1 contains the list of major topics
2. Column 2 indicates the instructional time per topic
3. Column 3 has the percentage.
4. Column 4 indicates the levels in the cognitive domain such as
knowledge, comprehension, application, analysis, synthesis and
evaluation
5. Column 5 indicates the number of items per topic which is distributed
among the six levels in the cognitive domain.
Steps in Preparing the Table Specification

1. List the content/topics covered in the summative examination.
2. Indicate the corresponding instructional time use the particular topic.
3. Compute the percentage per topic or unit.
4. Decide the total number of items for the examination.
5. Compute the number of items per unit.
6. Distribute the total number of items per unit among the different levels
of cognition.
Teacher-Made Test
16
1. Objective Type - answers are in the form of a single word or phrase or
symbol.
a. Constructed- Response items
 This is also known as supply test. This type of test requires the
student to construct or produce a response to a question or task.
b. Selected- Response items
 This is the selection type test because the students respond to
each question by selecting an answer from the given choices.
2. Essay Type
This type of examination consists of question that requires the
student to respond in a lengthy manner where it tests the students’ ability
to express ideas accurately and to think critically.
Types of Essay Examination

a. Unrestricted Or Uncontrolled Type
This type of essay question the students have a wider freedom of
organizing their ideas in the way we want.
B. Restricted or Controlled Type
There’s a limit in the students’ organization of response. They are
guided in making their response.
Guidelines in Test Construction

1. Determine the purpose of testing
2. Select the appropriate type of test items
Preparing Relevant Test Items

1. Match items to intended outcomes
2. Obtain a representative sample
3. Select the proper item difficulty
4. Eliminate irrelevant barriers to the answer
5. Prevent unintended clues to the answer
Requisites for a Writer of Good Items

 Have a through mastery of the subject matter being tested
 Possess a rational and well developed set of educational values which
permeates and direct his/her thinking
 Understand psychologically and educationally the individuals for whom
the item is intended.
 Have mastery of verbal communication; be skilled in choosing, using
and arranging words so that they communicate the desired meaning as
simply as possible.
Guidelines for Writing the Test

 Begin writing items far enough in advance so that you will have time to
revise them.
 Match items to the intended outcomes at the proper difficulty level
provides a valid measure of instructional objectives.
 Be sure each item deals with important aspect of the content.
17
 Be sure that the problem posed is clear and specific.
 Be sure that each item is independent of all items.
 Be sure the item has one correct or best answer.
 Prevent unintended clues to the answer in the statement or question.
18
Selecting and Constructing Test Items and Tasks
A. Measuring Knowledge and Simple Understanding

 Knowledge
As it appears in cognitive taxonomies as the simplest and
lowest level, is categorized further into what thinking process is
involved in learning.
Involves remembering or recalling specific facts, symbols,
details, elements of events and principles to acquire new knowledge.
 Remembering
Being able to recall.
Being necessary in learning interrelationships among basic
elements and in learning methods, strategies and procedures.
Two types of Knowledge (McMillan, 2007)

 Declarative knowledge
Simple understanding requiring comprehension of "concepts,
ideas and generalizations" (knowledge as learned concept).
 Procedural knowledge
Application of skills and procedures learned in new situations
(knowledge as learned way of doing things).
B. Measuring Deep Understanding

Alignment of Learning Outcomes and Cognitive Levels
Knowledge
Knowledge Understanding Deep Understanding
Continuum
Simple
Understanding
Cognitive Levels/ Levels of Learning Outcomes

Level 1: Remembering Level 2: Comprehending Level 4:
 Recall  Interpret Analyzing
 Recognize  Exemplify  Organize
 Name  Classify  Distinguish
 Describe  Compare  Outline
 Explain  Diagnose
 Deconstruct
Level 3: Level 5: Evaluating
Applying  Critique
 Solve  Defend
 Apply  Justify
 Modify  Appraise
 Demonstrate  Measure
 Generate
19
Level 6: Creating
 Plan
 Generate
 Produce
Design
20
Utilization of Assessment Data
Results of tests are in the form of scores, and these may be Raw Score;
percentile, ranks; z-scores; T-scores; Stanines; or level, category or proficiency
scores (Harris, 2003).
Raw and Percentage Scores

Raw score
- are obtained by simply counting the number of correct responses in a
test following the scoring directions. For instance, a student who gets 30 of
50 items in a Math test correctly obviously gets the Raw Score of 30.
Percentage score
- is useful in describing a student’s performance based on a criterion.
Raw score 30
Example: ×100 = 60%
Number of Items 50
Percentile Rank
A percentile rank gives the perfect of the scores that are at or below a
raw or standard score. It is used to rank students in a reference sample.
P
Formula: R=
100(N+1)
R – Percentile Rank
P – Percentile
N – Is the Number of items
Example: Consider a data set of following numbers: 122, 112, 114, 17, 118, 116, 111,
115, and 112. Calculate 25th Percentile Rank.
25
R=
100(9+1)
R = 2.5th rank
111+112
We can take an average of 2nd term and 3rd term which is =
2
111.5 percentile rank
21
Standard Score
It is difficult to use raw score when making comparisons between
groups on different tests considering that tests may have different levels of
difficulty. To remedy this, a raw score may be transformed to a derived score.
Derived scores are of two types: scores of relative standing and
developmental scores. Example of the first type are percentiles, standard
score and stanine, while the developmental score include grade and age
equivalents.
As mentioned, a standard score is a derive score which utilizes the
normal curve to show how a student’s performance compare with the
distribution of scores above and below the arithmetic mean. Among the
standard scores are the z- score, T-score and stanine.
A normal score represents a normal distribution – a symmetric
theoretical distribution. The Mean (arithmetic average), Median (score that
separates the upper and lower 50%) and Mode (most frequently occurring
score) are located at the center of the bell curve where the peak is.
Mean - is the most commonly used measure of the center of data and
referred as the “arithmetic average.”
Example: Find the Grade Point Average (GPA) of Ritz Glenn for the first
semester of the school year 2010-2011. Use the table
Subjects Grades (χ¡) Units (w¡) (w¡) (χ¡)

BM 112 1.25 3 3.75
BM 101 1.00 3 3.00
AC 103N 1.25 6 7.50
BEC 111 1.00 3 3.00
MGE 101 1.50 3 4.50
MKM 101 1.25 3 3.75
FM 111 1.50 3 4.50
PEN 2 1.00 2 2.00
Σ(w¡) = 26 Σ(w¡) (χ¡)=32.00
Σ(w¡) (χ¡)
x =
Σw ¡
Median32 - is what divides the scores in the distribution into two equal parts.
x =
Fifty 26
percent (50%) lies below The Grade Point Average
the median value ofandRitz Glenn
50% liesforabove the
th
median value. It is also known as the middle score or the 50 percentile. For
x = 1 23 purposes,
classroom the first
thesemester
first thingSY
to 2010
do is 2011 is 1the
arrange 23score in proper order.
Example: Find the median score of 7 students in an English class.
X (score)
19 Analysis:
17 The median score is 15. Fifty percent
16 (50%) or three of the scores are above 1522
15 (19, 17, 16) and 50% or three of the scores
10 are below 15 (10, 5, 2)
5
Mode or the modal score is a scores that occurred most in the distribution. It
is classified as unimodal (only one mode), bimodal (two modes), and tri-
modal (three modes) and, multimodal (more than two modes).
Example: Scores of 40 students in science class consist of 60 items and they
tabulated below
X f Modal Class = 35-39

LL of MC = 35
10-14 5 Lв = 34.5
15-19 2 d1 = 9 – 2 = 7
d2 = 9 – 6 = 3
20-24 3
c.i = 5
25-29 5
30-34 2 x = Lв + ( )
d1
d1+d2
c.i
35-39
40-44
9
6
= 34.5+ ( )
7
7+3
5
35
45-49 3 = 34.5+
10
50-54 5 x = 34.5+3.5
n= x =38
40 The mode of the score distribution that
consists of 40 students is 38, because 38
A. Z-score
The z-score give the number of standard deviations of a test score
above or below the mean. The formula of
̅
x- x x-µ
Z= or Z = where,
s σ
z = z-value
x = raw score
s = sample standard deviation
̅
x = sample mean
σ = population standard deviation
µ = population mean
A negative z-score means it is below the average, while a positive z

23
value means it is above the average.
(Raw score) X =120, (population mean) µ = 100, (standard deviation) σ = 10

x-µ
Z=
σ
120-100 20
Z= = = +2
10 10
Interpretation: A score of 120 is two standard deviation above the mean.
B. T-score
There are two possible values of z-score, positive z if the raw score is
above the mean and negative z if the raw score is below the mean. To avoid
confusion between negative and positive value, use T-score to convert raw
scores. T-score is another type of standard score where the mean is 50 and
the standard deviation is 10. In the z-score the mean is 0 and the standard
deviation is one (1). To convert raw score to T-score, find first the z-score
equivalent of the raw score and use the formula T score = 10z + 50.
Example: Z score = 2
T score = 10 z + 50.
T score = 10 (2) + 50
T score = 70
Standard deviation of T distribution = 10
Mean of T distribution = 50
C. Stanine
Stanine, short for standard nine, is the method of scaling scores on a
nine-point scale. A raw score is converted to a whole number from a low 1 to
a high 9.
D. Normal curve equivalent

The normal curve equivalent (NCE) is a normalized standard score
within score the range 1-99. It has a mean of 50 and a standard deviation of
21.06. Caution should be exercised when converting a raw score to NCE
because the latter requires a representative national sample. NCE is preferred
by some of its equal-interval quality. Score differences between tests and
among subsets in a test battery can be calculated and examined.
A grade equivalent (GE) describes a learner’s developmental growth. It

gives a picture as to where he/she is on an achievement continuum. For
example, if a second grader obtained a grade equivalent to 4.2 in language
mechanics, it means his/her score is typical of a fourth grader, 2 months into
instruction. Note that a GE is an estimate of a learner’s location in the
development continuum and not the grade level where he/she should be
placed (Frechtling & Myerberg, 1983).
24
25
Improving a Classroom-Based Assessment Test
Judgmental Item-Improvement
This approach basically makes use of human judgment in reviewing
the items. The judges are the teachers themselves who know exactly what the
test is for, the instructional outcomes to be assessed, and the items' level of
difficulty appropriate to his/her class; the teacher's peers or colleagues who
are familiar with the curriculum standards for the target grade level, the
subject matter content, and the ability of the learners; and the students
themselves who can perceive difficulties based on their past experiences.
Teachers' Own Review

It is always advisable for teachers to take a second look at the
assessment tool s/he has devised for a specific purpose. To presume
perfection right away after its construction may lead to failure to detect
shortcomings of the test or assessment task.
There are five suggestions given by Popham (2011, p 253) for the teachers
to follow in exercising judgment:
1. Adherence to item-specific guidelines and general item-writing

commandments.
The preceding chapter has provided specific guidelines in writing
various forms of objective and non-objective constructed-response types and
the selected-response types for measuring lower level and higher-level
thinking skills. These guidelines should be used by the teachers to check how
good the items have been planned and written particularly in their alignment
to intended instructional outcomes
2. Contribution to score-based Inference.

The teacher examines if the expected scores generated by the test can
contribute to making valid inference about the learners. Can the scores reveal
the amount of learning achieved or show what have been mastered? Can the
score infer the student's capability to move on to the next instructional level?
Or rather the scores obtained do not make any difference at all in describing
or differentiating various abilities.
3. Accuracy of content.
This review should especially be considered when tests have been
developed after a certain period of time. Changes that may occur due to new
discoveries or developments can redefine the test content of a summative
test. If this happens, the items or the key to correction may have to be
revisited.
4. Absence of content gaps.

This review criterion is especially ‘useful in strengthening the score-
based inference capability of the test. If the current tool misses out on
important content now prescribed by a new curriculum standard, the score
will likely not give an accurate description of what is expected to be assessed.
26
The teacher always ensures that the assessment tool matches what is
currently required to be learned. This is a way to check on the content validity
of the test.
5. Fairness.
The discussions on item-writing guidelines always give warning on
unintentionally favoring the uninformed students obtain higher scores. These
are due inadvertent grammatical clues, unattractive distracters, ambiguous
problems and messy test instructions. Sometimes, unfairness can happen
because of due advantage received by a particular group like those seated.
Peer Review
There are schools that encourage peer or collegial review of
assessment instruments among themselves. Time is provided for this activity
and it has almost always yielded good results for improving tests and
performance-based assessment tasks. During these teacher dyad or triad
sessions, those teaching the same subject area can openly review together
the classroom tests and tasks they have devised against some consensual
criteria.
The suggestions given by test experts can actually be used collegially as

basis for a review checklist:
a. Do the items follow the specific and general guidelines in writing items
especially on:
 Being aligned to instructional objectives?
 Making the problem clear and unambiguous?
 Providing plausible options?
 Avoiding unintentional clues?
 Having only one correct answer?
b. Are the items free from inaccurate content?
C. Are the items free from obsolete content?
d. Are the test instructions clearly written for students to follow?
e. Is the level of difficulty of the test appropriate to level of learners?
f. Is the test fair to all kinds of students?
Student Review
Engagement of students in reviewing items has become a laudable
practice for improving classroom tests. The judgment is based on the
students' experience in taking the test, their impressions and reactions during
the testing event. The process can be efficiently carried out through the use of
a review questionnaire.
Popham (2011) illustrates a sample questionnaire shown in Table 9.1. It is

better to conduct the review activity a day after taking the test so the students
still remember the experience when they see a blank copy of the test.
Table 9.1 Item-Improvement Questionnaire for Students
27
1. If any of the items seemed confusing, which ones were they?
2. Did any items have more than one correct answer? If so, which ones?
3. Did any items have no correct answers? If so, which ones?
4. Were there words in any Items that confused you? If so, which ones?
5. Were the directions for the test, or for particular subsections, unclear? If so,
which ones?
Another technique of eliciting student judgment for item improvement

is by going over the test with his/her students before the results are shown.
Students usually enjoy this activity since they can get feedback on the
answers they have written. As they tackle each item, they can be asked to give
their answer, and if there is more than one possible correct answer, the
teacher makes notations for item-alterations. Having more than one correct
answer signals ambiguity either in the stem or in the given options. The
teacher may also take the chance to observe sources of confusion especially
when answers vary. During this session, it is important for the teacher to
maintain an atmosphere that allows students to question and give
suggestions. It also follows that after an item review session, the teacher
should be willing to modify the incorrect keyed answers.
Empirically - Based Procedures

Item-improvement using empirically-based methods is aimed at
improving the quality of an item using students' responses to the test. Test
developers refer to this technical process as item analysis as it utilizes data
obtained separately for each item. An item is considered good when its
quality indices, i.e. difficulty index and discrimination index, meet certain
characteristics. Fora norm-referenced test, these two indices are related since
the level of difficulty of an item contributes to its discriminability.
An item is good if it can discriminate between those who perform well
in the test and those who do not. However, an extremely easy item, that which
can be answered correctly by more than 85% of the group, or an extremely
difficult item, that which can only be answered correctly by 15%, is not
expected to perform well as a "discriminator" The group will appear to be
quite homogenous with items of this kind. They are weak items since they do
not contribute to "score-based inference" The difficulty index, however, takes
a different meaning when used in the context of criterion - referenced
interpretation or testing for mastery. An item with a high difficulty index will
not be considered as an "easy item" and therefore a weak item, but rather an
item that displays the capability of the learners to perform the expected
outcome. It therefore becomes an evidence of mastery.
Particularly for objective tests, the responses are binary in form, i.e. right or
wrong, translated into numerical figures as 1 and 0, for obtaining nominal data
like frequency, percentage and proportion.
Useful data then are in the form of:

a. Total number of students answering the item right (R)
b. Total number of students answering the item (T)
Difficulty Index
28
An item's difficulty index is obtained by calculating the p value (p)
which is the proportion of students answering the item correctly.
p= R/T
Where p is the difficulty index
R = Total number of students answering the item right
T =Total number of students answering the item
Here are two illustrative samples:

Item 1: There were 45 students in the class who responded to Item 1 and 30
answered it correctly.
P = 30/45
= 0.67
Item 1 has a p value of 0.67. Sixty-seven percent (67%) got the item right while
33% missed it.
Item 2: In the same class, only 10 responded correctly in Item 2.

P=10/45
=.22
Item 2 has a p value of 0.22. Out of 45 only 10 or 22% got the item right while
35 or 78% missed it.
For Normative-referenced test: Between the two items, Item 2 appears to be

a much more difficult item since less than a fourth of the class only was able
to respond correctly.
For Criterion-referenced test: The class shows much better performance in

Item 1 than in Item 2. It is still a long way for many to master Item 2.
The p-value ranges from 0.0 to 1.0, which indicates from extremely
very difficult as no one got it correctly to extremely very easy as everyone got
it correct. For binary-choice items, there's a 50 % probability of getting the
item correct, simply by chance. For multiple-choice items of four alternatives,
the chance of obtaining a correct answer by guessing is only 25%. This is an
advantage of using multiple-choice questions over binary-choice items. The
probability of getting a high score by chance is reduced.
Discrimination Index
As earlier mentioned, the power of an item to discriminate between
informed and uninformed groups or between more knowledgeable and less
knowledgeable learners is shown using the item- discrimination index (D).
This is an item statistics that can reveal useful information for improving an
item. Basically an item-discrimination index shows the relationship between
the student's performance in an item (i.e. right or wrong) and his/her total
performance in the test represented by the total score. Item-total correlation
is usually part of a package for item analysis. Getting high item-total
correlations indicate that the items contribute well to the total score so that
responding correctly to these items gives a better chance of obtaining
relatively high total scores in the whole test or subtest.
29
For classroom tests, the discrimination index shows if a difference
exists between the Performance of those who scored high and those who
scored low in an item. As a general rule, the higher the discrimination index
(D), the more marked the magnitude of the difference is, and thus, the more
discriminating the item is.
The nature of the difference however, can take different directions:

a. Positively discriminating item - Proportion of high scoring group is greater
than that or the low scoring group.
b. Negatively discriminating item - Proportion of high scoring group is less
than that of the low Scoring group.
C. Not discriminating - Proportion of high scoring group is equal to that of the

low scoring group Calculation of the discrimination index therefore requires
obtaining the difference between the proportion of the high-scoring group
getting the item correctly and the proportion of the low-scoring group getting
the item correctly using this simple formula:
D=Ru/Tu-R/T
Where D is item discrimination index
Ru =number of upper group getting the item correct
Tu = Number of upper group
R = number of lower group getting the item correct
T = Number of lower group
Another calculation can bring about the same result as (Kubiszyn and Borich,
2010):
D= Ru-R
T
Where, Ru = number of upper group getting the item correct

R = number of lower group getting the item correct
T = number of either group
As you can see R/T is actually getting the p value of an item. So to get
D is to get the difference between the p-value involving the upper half and the
p-value involving the lower half. So the formula for discrimination index (D)
can also be given as (Popham, 2011):
D= Pu - P
Where p= is the p-value for upper group (R,/T)
P= is the p-value for lower group (R/T)
To obtain the proportions of the upper and lower groups responding

to the item correctly, the teacher follows these steps:
a. Score the test papers using a key to correction to obtain the total scores of
30
the students.
Maximum score is the total number of objective items.

b. Order the test papers from highest to lowest score.
c. Split the test papers into halves: high group and low group.
 For a class of 50 or less students, do a 50-50 split. Take the upper half
as the HIGH Group and the lower half as the LOW Group
 For a big group of 100 or so, take the upper 25 27% and the lower 25 -
27%.
 Maintain equal numbers of test papers for Upper and Lower groups.
d. Obtain the p value for the Upper group and p-value for the Lower group
P(upper)=Ru/Th P(lower)=R/T
e. Get the Discrimination index by getting the difference between the p-values.
For purposes of evaluating the discriminating power of items, Popham

(2011) offers the guidelines proposed by Ebel & Frisbie (1991) shown in Table
9.2. The teachers can be guided on how to select the satisfactory items and
what to do to improve the rest.
Table 9.2 Guidelines for Evaluating the Discriminating Efficiency of Items
Discrimination Index Item Evaluation

40 and above Very good items
.30-.39 Reasonably good items, but possibly
subject to improvement
.20-.29 Marginal items, usually needing
improvement
.19 and below Poor items, to be rejected or
improved by revision
Items with negative discrimination indices, although significantly high,

are subject right away to revision if not deletion. With multiple-choice items,
negative D is a forensic evidence of errors in item writing. It suggests the
possibility of:
Wrong key More knowledgeable students selected a distracter which is
the correct answer but is not the keyed option Unclear problem in the stem
leading to more than one correct answer Ambiguous distracters leading the
more informed students be divided in choosing the attractive options
Implausible keyed option which more informed students will not choose As
you can see, awareness of item-writing guidelines can provide cues on how to
improve items bearing negative or non-significant discrimination indices.
Distracter Analysis
Another empirical procedure to discover areas for item-Improvement
31
utilizes an analysis of the distribution of responses across the distracters.
Especially when the difficulty index and discrimination index of the tem seem
to suggest its being candidate for revision, distracter analysis becomes a
useful follow-up. It can detect differences in how the more able students
respond to the distracters in a multiple-choice item compared to how the less
able ones do it. It can also provide an index of the plausibility of the
alternatives, that is, if they are functioning as good distracters. Distract be
revise to increase their attractiveness.
Sensitivity to Instruction Index

The techniques earlier discussed make use of responses obtained
from single administration of a test. The indices obtained for difficulty,
discrimination and option plausibility are, seen as helpful statistics for item-
improvement of norm-referenced or summative tests given after a period of
instruction.
Another empirical approach for reviewing test items is to infer how

sensitive an item has been to instruction. This is referred to as sensitivity to
instruction index (Si) and it signifies a change in student's performance as a
result of instruction. The information is useful for criterion referenced tests
which aim at determining if mastery learning has been attained, after a
designated or prescribed instructional period. The basic question being
addressed is a directional one, i.e., is student performance better after
instruction is given. In the context of item performance, Si will indicate if p-
value obtained for the item in the post-test is greater than the p-value in the
pre-test. Consider an item where in a class of 40, 80% answered it correctly in
the post-test while only 10% did it in the pre-test.
Its p-value for the post-test is .80 while for pre-test is .10, thus Si = .70
following this a sensitivity to instruction
(Si) = P(post) – P(pre)
=.80 -.10
=.70
Notice that the calculation for Si carries the same concept as the
discrimination he difference in proportion is obtained between post-test and
pre-test given to the same group. Similar to D interpretation, the higher Si
value is the more sensitive the item is in showing the change a result of
instruction. This item statistics gives additional information regarding the
efficiency and validity of the item.
32
Curriculum - Based Assessment (Progress Monitoring)
Progress Monitoring is described as a formative process to assess

student performance and evaluate the effectiveness of instruction. It is vital in
individualized education and special education. This technique is oftentimes
integrated in a Response to Intervention (RTI) method.
RTI is a method of academic intervention. The two popular progress
monitoring models are Curriculum-Based Assessment (CBM) and Mastery
Measurement. In CBM, all skills are assessed by each test. In Mastery
Measurement, skills are assessed one at a time in a pre-determined
instructional sequence.
Steps in Conducting Progress Monitoring

Luckner and Bowen (2010) listed the general steps in progress
monitoring.
1. Determine the level of implementation whether individual, small or
large group, classroom or school level.
2. Decide on what measurement to undertake. Create or select
appropriate short tests or probes (1-3) minutes. The test should
sample a wide range of skills to be mastered throughout the year.
3. Collect screening or baseline data. Administer and score the test or
probe. To ensure validity, frequent testing or probing should be done.
4. Decide on short-term objectives or outcomes.
5. Set and articulate long-range goals.
6. Decide when and how often to monitor.
7. Graph the scores. Graphs can visually describe how learners are doing.
Data results and graph can inform instruction.
8. Make instructional decisions. Adjust pedagogical strategies when
needed.
9. Continue monitoring.
10. Report progress to students, parents,(guardian) and stakeholders.
Philippine Basic Education Rating System

The grade components for grades 1-10 and senior high school (SHS)
effective SY 2015-2016 are shown on from the table below. Different sets or
weights are now followed.
New Percentage Distribution of Grades

Learning Areas/Tracks Written Performan Quarterly
Work ce Tasks Assessme
(WW) (PT) nt (QA)
Grades 1-10
Languages; AP; EsP 30% 50% 20%
Science; Math 40% 40% 20%
Mapeh; EPP/TLE 20% 60% 20%
Grades 11-12
Core Subjects
Academic Track 25% 50% 25%
All other subjects 25% 45% 30%
33
Work 35% 40% 25%
Immersion/Research/Business
Enterprise
Simulation/Exhibit/Performance
Technical/Vocational and Livelihood
(TVL)/Sports/Arts and Design Track
All other subjects
Work 20% 60% 20%
Immersion/Research/Exhibit/Performa 20% 60% 20%
nce
Transmutation Table (Appendix B, DepEd Order No. 8, s. 2015)
Initial Grade Transmute Initial Transmuted Initial Transmu

d Grade Grade Grade Grade ted
Grade
100 100 77.60- 86 48.00- 72
79.19 51.99
98.40-99.99 99 76.00- 85 44.00- 71
77.59 47.99
96.80-98.39 98 74.40- 84 40.00- 70
75.99 43.99
95.20-96.79 97 72.80- 83 36.00- 69
74.39 39.99
93.60-95.19 96 71.20- 82 32.00- 68
72.79 35.99
92-93.59 95 69.60- 81 28.00- 67
71.19 31.99
90.40-91.99 94 68.00- 80 24.00- 66
69.59 27.99
88.80-90.39 93 66.40- 79 20.00- 65
67.99 23.99
87.20-88.79 92 64.80- 78 16.00- 64
66.39 19.99
85.60-87.19 91 63.20- 77 12.00- 63
64.79 15.99
84.00-85.59 90 61.60- 76 8.00-11.99 62
63.19
82.40-83.99 89 60.00- 75 4.00-7.99 61
61.59
80.80-82.39 88 56.00- 74 0-3.99 60
59.99
73.20-80.79 87 52.00- 73
55.99
When scoring tests and other assessments, the raw scores are totaled
in each level of assessments then the percentage scores are calculated. After
that the corresponding percentage weights are applied. As a case in, suppose
34
a grade 2 pupil acquired a total score of 64 in science out of 80 points
specifically on WW. The percentage score is 80. This was calculated by
dividing the total raw score by the highest possible score, multiplied by 100%.
To get the Weighted Scores (WS), we take 40% of the PS which is equal to 32.
The same procedure is applied to other components.
After the weighted scores for all three components are obtained the initial
grade is calculated. This is simple the sum of the weighted scores of the WW,
PT and QA. Finally the initial grade is transmuted using a standard
transmutation table. Refer to the table below.
The NGS does not follow a zero-based system. The passing grade is 60
which is the equivalent to 75. The floor grade is 60, meaning a score of zero
does not convert to an equivalent grade of zero but 60.
Levels of Proficiency and Descriptors

Grading Scale KPUP Descriptors Remarks
Levels of
Proficiency
90 and above Advanced (A) Outstanding Passed
85-89 Proficient (P) Very Satisfactory Passed
80-84 Approaching Satisfactory Passed
Proficiency (AP)
74-79 Developing (D) Fairly Satisfactory Passed
Below 75 Beginning (B) Did not Meet Failed
Expectations
Sample Raw Scores in Grade 4 Mathematics

Writte Raw Highe Performa Raw Highe Quarterl Raw Highe
n Sco st nce Sco st y Sco st
Work re Possib Tasks re Possib Assess re Possib
le (PT) le ment le
Score Score (QA) Score
WW1 16 20 PT1 16 20
WW2 18 20 PT2 16 20
WW3 20 25 PT3 18 20
WW4 20 25 PT4 24 30
WW5 20 30 PT5 20 30
WW6 16 30
Total 110 150 TOTAL 94 120
PS 73.33 PS 78.33 PS 66.67

35
WS 29.33 WS 31.33 WS 13.33
FORMULAE: Percentage Score (PS) =TOTAL RAW SCORE/HIGHEST

POSSIBLE SCORE* 100%
Weighted Score (WS) = Percentage Score * Weight of the Component
In table above, the percentage scores for WW, PT, and QA are 73.33,
78.33 and 66.67, respectively. For the indicated learning area (Mathematics)
and Grade level (4), the weight are as follows: 40% for WW, 40% for PT, and
20% for QA. To get the weighted score of each component, we obtain the
product of the percentage score and its respective weight. The initial grade is
the sum of the weighted scores, i.e. 29.33 + 31.33 +13.3. This yields 73.99.
Using the transmutation table, we find the transmuted grade is 83.
36
Grading Systems
Is a process of assigning a numerical value, letter or symbol to
represent student knowledge and performance.
The process of judging the quality of a pupil’s performance is called
Grading
Grading 6 roles:
1. to communicate achievement status of students to parents;
2. to provide information to students for self-evaluation;
3. to identify or sort students for specific programs;
4. to provide incentives for students to learn;
5. to evaluate effectiveness of instructional programs;
6. to provide evidence of a student’s lack of effort
Norms represent the typical performance of a specified group of individuals.

Thus norm-referenced test interpretation reveals how an individual compares
with other persons who have taken the same test.
Norm-Referenced Grading refers to a grading system wherein a student’s

grade is placed in relation to the performance of a group.
Norm-Referenced Grading focuses on performance of one’s peers, grading

with relative standards.
Criterion-Referenced
The description of the individual’s performance in the test is generally
described in terms of the percentage of correct responses of correct
responses to items in a clearly defined learning task.
Criterion-Referenced Grading systems are based on a fixed criterion measure.

There is a fixed target and the students must achieve that target in other to
obtain a passing grade in a course regardless of how the other students in the
class perform.
Absolute Standard Grading grades reflect student’s achievement of the

learning targets as specified in the curriculum.
Approaches to Grading
Numerical grades (100, 99, 98,..) the system using numerical grade is
popular and well understood.
Letter grades (A, B, C, etc.) letter grades they appear to be less
intimidating compare compared to numerical grades.
Two-category grades (Pass-Fail; Satisfactory-Unsatisfactory) this is
less stressful students because they need not fear about low grade points
averages.
Checklists
A checklist may be simple or elaborate.
Standards-based (advanced, proficient, beginning or similar) requires
37
teachers to base grades from definite learning standards.
Grading issues
1. consider what type of student performances you need.
2. consider how to make marking scales consistent throughout the
marking period.
3. decide on the grade components and their respective weights.
4. consider the standards or boundaries for each letter grade.
5. decide on borderline cases.
6. be concerned with the issues of failures.
7. be concerned with the practice assigning zero for a mark.
38
The Large-Scale Assessment Framework
 Addresses the products of learning.
 Includes a range of cognitive engagement across language proficiency
levels.
 Is supported graphically or visually at the lower language proficiency
levels.
 Ensures the use of grade level materials at the uppermost language
proficiency level.
 Serves a resource for all teachers.
Purposes of Large-Scale Assessments

 Ranking, selection, placement
 Student-centered and high-stakes
 Accountability
 At the level of schools and school groups
 Monitoring
 At the group (country, state, student groups)
Types: Intended Uses: Key Guiding

Accountability Consideratio Questions:
Large-scale  ESSA ns for ELs:
summative finding to  Validi  What do
assessment. states ty of large-scale
 District and score assessme
 NAEP school interp nt systems
Mathemati performanc retati reveal
cs and e evaluation on
about EL
Science Cross/district/sch  Test
students’
assessmen ool comparisons acco
STEM
Macro- Population-level mmo
ts learning?
decisions datio
level
 State  Monitoring ns  What are
Assessme
standards- of animal and
the
based progress acces
limitations
mathemati and sibilit
of large-
achievemen y
cs scale
t resou
assessmen STEM
Student-level rces
ts (e.g., assessme
decisions  Test
SBAC &  Monitoring desig nt
PARCC) of animal n/dev practices
progress elop and
 NGSS
achievemen ment policies?
science
t
and  How can
 Promotion
39
Types: Intended Uses: Key Guiding
Classroom Classroom Considerations Questions:
summative summative for ELs:  What
assessment: assessment:  Teacher does
 Curriculu  Monitorin professio effective
m-based g nal classroo
assessme progress developm m
nts and ent for assess
 State or achievem impleme ment of
district ent ntation the
interim  Tracking/ and STEM
assessme placemen unbiased disciplin
nts t interpreta es look
Micro-level  Teacher  District/p tion like for
created arent  Innovativ EL
Assessmen
tests reporting e students
Formative  Diagnosti practices, ?
assessment: c including:  What
 Informal: Formative  Bilingual should
 Moment- Assessment: assessm teacher
to-  Generate ent know
moment feedback  Impleme and be
observatio to ntation of able to
ns, teachers learning do to
conversati (continge progressi serve EL
ons with nt ons students
teachers, instructio  Social through
peers, etc. n) and interactio classroo
(short- students n of ELs m
cycle)** (to guide and assess
 Formal: learning) social ment?
 Periodic during inclusion
probes instructio of ELs in
and tasks n the
(medium- classroo
long m
40
A Balanced System of Assessment
Large Scale Mid-Scale Small-Scale

(Assessment of) (Assessment for) (Assessment for)
 Summative in nature  Formative processes  Questioning

 Norm referenced with summative  Achievement
 Aptitude information  Digital Learning tools
 Achievement  Criterion-referenced
 Often teacher or
district-made
 Achievement
Essential Question:
How can we help Essential Question:

Essential Question: students learn more?
How can we help
What have students students learn more?
already learned?
41
Use of Large-Scale Assessment
 Some of LSA programs make use of high-stake testing which defines

a student’s entry or non-entry to a desired advance level.
 Grade level standardized test of achievement administered at major

transition points which students have pass before proceeding to the
next higher level. This is done in countries influenced by the British
system of education.
 Examination given by secondary schools before graduation to certify

student’s readiness for college level. Failure could harshly mean not
receiving the secondary diploma or requiring the students to take extra
units of course work as a remedial intervention before they are allowed
to go.
 Senior students taking a national achievement test at a scheduled time

with their obtained scores spelling admission or non-admission to
tertiary institutions of their choice. Only those reaching the prescribed
standard of the institution could get admitted.
 Admission to state colleges and universities in ASIA requires passing a

standardized entrance examination and the obtained score determines
the course or campus the student can be enrolled in.
 Admission to graduate programs in most universities in the U.S.
requires passing a qualifying examination, e.g. Graduate Record
Examination (GRE).
 Certification of second language proficiency required by many higher

education institutions from non-English speaking applicants. It is
obtained using Test of English as a Second Foreign Language
(TOESL/FL) for American institutions and International English
Language Test (IELT) for British schools.
 Another purpose of LSA is monitoring and judging progress of

student performance for evaluative purposes and research.
 Many countries use a national examination as their barometer of how

their students are performing along desired competencies in different
subject areas at different levels across time. Some countries do
census testing and administer yearly an achievement test to target
grades of all schools like the National School Achievement Test (NSAT)
in the Philippines.
 LSA is also triggered by policies initiating standards-based reforms in

education to improve student performance. The “No Child Left Behind”
policy in the U.S. is backed by large-scale assessment to rationalize
fiscal support for schools doing well and motivate teachers to improve
their teaching-learning methodologies.
42
 At the international scene, there are standardized examinations
developed to monitor changes in educational achievement in basic
school subjects over time. During the last decade or so, international
assessments for Science and Mathematics (Trends in International
Mathematics and Science Study), reading (Progress in International
Reading Literacy Study and Civics (International Civics and Citizenship
Education Study), are being coordinated with participating countries
interested in knowing status of their students in comparison with the
rest of the world.
 Standardized tests have also been developed for screening and

diagnostic purposes. There are those which have been reported as
being able to identify developmental delays in a young child’s
development. An example of this standardized tool is the “Ages and
Stages Questionnaire” (ASQ) which screens young children on five
developmental areas such as Communication, Gross Motor, Fine Motor,
Problem Solving, and Personal-social.
43
Development of Large-Scale Student Assessment
Large-Scale Student Assessment

 Is being used for different purpose with improvement of student
performance never failing to rank first in importance.
Review of Classroom Test Development Process
Within the context of summative assessment in the classroom, the
suggested phases of work in developing the test generally consist of the
following procedural steps to ensure basic validity requirements:
 Planning the test which specifies

• Purpose of the test
• Learning outcomes
• Test blueprint-test format, number of items
 Item-construction which is performed by the classroom teacher

following a table of specifications.
 Review and Revision for item improvement.

• Judgement approach before and after administration of test
o By the teacher/peers to ensure the accuracy and alignment of items
and test instruction.
o By the student to ensure comprehensibility of items and test
instruction.
• Empirical approaches-after administration of test

o Obtain items statistic in the form of quality indices
Development Process of Large-Scale Test

• Divide the class into four group. Discuss the concerns that must be
raised if a large-scale assessment test will be developed?
• As a class, discuss these questions and classify them into phases of
work they will fall under.
• Watch the video presentation showing how “ETS creates fair,
meaningful question to guide your discussion later.
• What do you see as a common step between developing classroom
tests and large-scale test?
STEPS in Developing Test by ETS

Key Steps Fundamental Questions to be
Addressed
Step 1: Defining Objective • Who will take the test and for
what purpose?
• What skills and/or areas of
knowledge should be tested?
• What should test takers be
able to use their knowledge?
• What kind of questions should
44
be included? How many each
kind?
• How long should the test be
• How difficult should the test be
Step 2: Item Development who will be

Committees • Defining test objective and
specifications
• Helping ensure test question
are unbiased
• Determining test format (e.g.,
multiple choice, essay,
construction response, etc.)
• Considering supplemental test
material
• Reviewing test questions, or
test items, written before.
• Writing test question.
Step 3: Writing and Reviewing Item developers and reviewers must

Question see to it that each item:
• Has only one correct answer
among the options provide in
the test
• Conform to the style rules used
throughout the test
There are scoring guides for open-
ended response (e.g. short-written
answer, essay and oral responses)
Step 4: The Pretest Items are pretested to a sample

group similar to the population to be
tested. Result should be determine:
• The difficulty of each question
• If questions are ambiguous or
misleading
• If question should be revised or
eliminated
• If incorrect alternative answer
should be revised or replaced
Step 5: Detecting and Removing After pretesting, test reviewers re-

Unfair Question examine the items.
• Are there any test question
which have language, symbols
or words and phrases
inappropriate or offensive to
any subgroup of the
population?
45
• Are there questions
consistently performed better
by a group than other groups?
• What items further need
revision or removal before final
version is made?
Step 6: Assembling the Test After the test is assembled, item

reviewers prepare a list of correct
answer and are compared with
existing answer keys.
• Are the intended answers
indeed the correct answer?
46
Establishing Validity of Tests
Validity is regarded as the basic requirement of every test. It refers to
the degree to which a test measures what is intended to be measured.
Can the test perform its intended function?
This is the business of validity and the one adapted by the classical
model for regarding validity. There are three conventional types of validity,
criterion-related validity and construct validity (Anastasi and Urbina, 1997).
Establishing Validity
Criterion-related Content
Validity Validity
Types of
Test
Validity
Construct
Validity
Establishing Test Validity

Types of Validity Meaning Procedure
1. Content Validity How well the sample Compare test tasks
test bar tasks with test specifications
represented the domain describing the task
of tasks to be domain under
measured. consideration (non-
statistical)
2. Construct Validity How test performance Experimentally
can be described determine what factors
psychologically. influence scores on
test. The procedure
may be logical and
statistical using
correlations and other
statistical methods.
3. Criterion-related How well test Compare test scores
Validity performance predicts with measure of
future performance or performance (grade)
estimates current obtain on later date (for
performance on some prediction). Or another
47
valued measures other measure of
than the test itself performance obtain
concurrently (for
estimating present
status. (Primarily
Statistical). Correlate
test results with
outside criterion.)
Five Categories of Evidence Supporting a Score Interpretation:

 Evidence based on the test content
 Evidence based on response process
 Evidence based on internal structure
 Evidence based on relations to other variables
 Evidence based on consequences of testing
Establishing Reliability
Measure of Stability Measure of

and Equivalence Stability
Types of
Test
Validity
Measure of Measure of
Internal Equivalence
Consistency
Establishing Test Reliability
Types of Methods of Procedure

Reliability Estimating Reliability
Measures
1. Measure of Stability Test-retest method Give a test twice to the
same group with any
time interval between
tests from several
minutes to several
years. (Pearson r)
48
2. Measure of Equivalence format Give two forms of a
Equivalence method test to the same group
in close succession
(Pearson r)
3. Measure of Stability Test – retest with Give two forms of a
equivalent forms test to the same group
with increased time
intervals between
forms. (Pearson r)
4. Measure of internal Kuder – Richarson Give a test once. Score
consistency method the total test and apply
the Kuder Richarson
formula.
REFERENCES
https://www.google.com/establishingvalidityoftest
Assessment of Learning 1, Estefania S. de Guzman, Ph.D., Joel L. Adamos.

M.Math Ed., Marcela J. Leus, Ed.D., Adriana Printing Company, Inc. Publishing
Division
C.Thatchinamoorhy (2015-16) Class Notes, AEX 812-Advances in designs and

techniques in Social Science Research (2+1)
Dr.T.Rahakrishnan (2009) Advances & Challenges in Agricultural Extension &

Rural Development.
Rosenthal, R. and Rosnow, R. L. (1991) Essentials of behavioral Research;

methods and Data Analysis. Second Edition. McGraw-Hill Publishing
49
Compan.pp.46-45
https//www.socialresearchmethods.net/kb/mtmmmat.htm
https://www.google.com/moreexamplesofvalidityoflarge-scaletest
Green S. K., Johnson R. L., Kim D.C., & Pope N. S., Ethics in Classroom
Practices: Issues and Attitudes p. 1-10
50

Assessment in Learning Module

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assessment in Learning Module

Uploaded by

Copyright:

Available Formats

Roles of Assessment

Four Roles of Assessment used in the Instructional Process

Other types of formative assessment:

Positive Effects of Formative Assessment

Attributes of an Effective Formative Assessment:

Appropriateness and Alignment of Assessment Methods to Learning

Questions on Classroom Assessment

Identifying Learning Outcomes

Anderson, et al. (2005) four steps in a student outcome assessment:

Taxonomy of Learning Domains

Remembering Processes: Recognizing, Define the four

Understanding Processes: Interpreting, Explain the

Applying Processes: Executing, Write a learning

Analyzing Processes: Differentiating, Compare and

Evaluating Processes: Executing, Judge the

Creating Processes: Planning, Producing Design a

Taxonomy of Psychomotor Domain

Observing Describe, detect, distinguish, Relate music to a

Adapting Arrange, combine, compose, Perform a dance

Taxonomy of Affective Domain (Krathwohl, et)

Receiving Asks, chooses, describes, Listen attentively to

Responding Answer, assist, comply, Assist voluntarily in

Valuing Compare, describe, Attend optional

Organizing Adhere, alter, arrange, Arrange his/her own

Internalizing Values Act, discriminate, display, Join intramurals to

Types of Assessment Method

Types of Selected-Response Assessment

Categories of Constructed Response Format

Matching Learning Target with Assessment Methods

Learning Targets and Assessment Methods (McMillan, 2007)

Refers to the accuracy of an assessment- whether or not it measures

2. Construct Validity - is the extent to which an assessment corresponds to

4. Face Validity - the extent to which a test is subjectively viewed as covering

Is the degree to which an assessment tool produces stable and

In test-retest reliability the single form of the test is administered twice

Other name Internal Consistency reliability. It indicates the

This reliability various names such as: alternative-forms reliability,

The alternative forms technique to estimate reliability is similar to the

Is a measure of reliability used to assess the degree to which different

Factors Influencing Reliability of Test Scores

Guessing by the examinees

Group Variability - when the group of examinees being tested is homogenous

Guessing by the examinees - guessing in a test is an important source of

Environmental conditions - Testing environment should be uniform.

Factors on practicality and efficiency that contribute to High Quality

Ethics is a code of moral standards of conduct for what is “good”

1. Students Knowledge Of Learning Targets And Assessment

Discussing an extensive unit in an hour is obviously insufficient. They

3. Prerequisite Knowledge and Skills

5. Avoiding Bias in Assessment Task

Two Forms of Assessment Bias:

 Unfair penalization harms the students’ performance due to test

Advantages and Disadvantages of Table Specification

Parts of the Table of Specification

Steps in Preparing the Table Specification

Types of Essay Examination

Guidelines in Test Construction

Preparing Relevant Test Items

Requisites for a Writer of Good Items

Guidelines for Writing the Test

A. Measuring Knowledge and Simple Understanding

Two types of Knowledge (McMillan, 2007)

B. Measuring Deep Understanding