Professional Documents
Culture Documents
Assessment in Learning Module
Assessment in Learning Module
1. Placement Test
It is used to determine a learner’s entry performance
Teachers assess through a readiness pre-test
It is used to determine if students have already acquired the intended
outcomes.
Typically focusses on the questions “Does the learner possess the
knowledge and skills needed to begin the planned instruction?” “To
what extent has the learner already developed the understanding and
skills that are goals of the planned objectives.”
Placement pre-test contains items that measure knowledge and skills
of students in reference to the leaning targets.
2. Formative Assessment
There is now a shifting from a testing culture to an assessment culture
characterized by the integration of assessment and instruction (Dochy,
2001)
It mediates the teaching and learning process
The main concern of a classroom teacher is to monitor the learning
progress of the students.
1
The teacher should assess whether the students achieved the intended
learning outcomes set for a particular lesson
Its purposes are for immediate feedback, identify learning errors,
modify instruction and improve both learning and instruction.
It is leaner centered and teacher directed.
It occurs during instruction and context specific
Muddiest point: a technique that can be used to address gaps in
learning. Background knowledge probe: short and simple questionnaire
given at the start of the lesson.
Used as feedback to enhance teaching and improve the process of
learning
An on-going process
3. Diagnostic Assessment
Intended to identify learning difficulties during instruction
Can detect commonly held misconceptions in a subject
Are not merely given at the start of instruction
2
Used to defect causes of persistent learning difficulties
4. Summative Assessment
Is done at the end of instruction to determine the extent to which the
students have attained the learning outcomes
Is used for assigning and reporting grades or certifying mastery of
concepts and skills
5. Interim Assessment
have the same purpose as formative assessments, but these are
given periodically throughout the school year
prepare students for future assessments
fall between formative and summative assessments
3
Principles of High Quality Assessment
4
communication
5
Practicing Bend, calibrate, construct, Display several
Trying a specific differentiate, dismantle, dance steps in
physical activity over fasten, fix, grasp, grind, sequence
and over handle, measure, mix,
organize, operate,
manipulate, mend
6
value complex perform, practice, propose, week.
Acting consistently qualify, question, revise,
with the new value serve, use, verify
2. Constructed-Response Format
• Students need only to recognize and select the correct answer.
• Composed to address higher- order thinking skills, most require
only identification and recognition.
• Useful in targeting higher levels of cognition.
• Demands that students create or produce their own answers in
response to a question, problem or task
7
• Essay item that requires a few sentences is called restricted-
response.
• 4. Oral assessment
• A common assessment method during instruction to check on
student understanding.
• Oral questioning may take the form of an interview or
conference.
3. Teacher Observation
• A form of on-going assessment, usually done in combination
with oral questioning.
• Can also be used to assess the effectiveness of teaching
strategies and academic interventions.
4. Student Self-Assessment
• One of the standards of quality assessment identified by
Chappuis, Chappuis and Stiggins (2009).
• A process in which the students are given a chance to reflect
and rate their own work and judge how well they have performed
in relation to a set of assessment criteria.
• Skills
8
• To assess skills, performance assessment is obviously the
superior assessment method.
• It becomes an “authentic assessment” when used in real-life
and meaningful context.
• Products
• Most adequately assessed through performance tasks.
• It is substantial and tangible output that showcases student’s
understanding of concepts and skills and their ability to apply,
analyze, evaluate and integrate those concepts and skills.
• Examples: musical compositions, stories, poems, research
studies, drawings, model constructions and multimedia
materials.
• Affect
• Student affect cannot be assessed simply by selected-response
or brief-constructed response tests.
• It pertains to attitudes, interest, and values students manifest.
• Best method for this learning target is self-assessment.
• Oral questioning may also work in assessing affective traits.
9
Validity
Types of Validity
1. Content Validity - is the extent to which the content of the test matches the
instructional objectives.
3. Criterion - Related Validity - is the extent to which scores on the test are in
agreement with (concurrent validity) or (predictive validity) an external
criterion.
Reliability
Types of Reliability
1. Test-Retest Reliability
3. Parallel-Forms Reliability
4. Inter-rater Reliability
10
judges or raters agree in their assessment decisions.
Group Variability
Environmental conditions
11
Practicality and Efficiency
Refers to the teacher‘s familiarity with the methods used, time required
for the assessment, complexity of the administration, ease of scoring, ease
of interpretation of the test results, and the materials used must be at the
lower cost.
Time Required
Assessment should allow students to respond readily but not hastily.
Assessment should also be scored promptly but not without basis. It
should be noted that time is a matter of choice- it hinges on the teacher’s
choice of assessment method.
Ease in Administration
Assessment should be easy to administer. To avoid questions during
the test or performance task, instructions must be clear and complete.
Instruction that is vague will confuse the students and they will
consequently provide incorrect responses. This may be a problem with
performance assessments that contain long and complicated directions
and procedures.
Ease of scoring
Obviously, selected response formats are the easiest to score
compared to restricted and more so extended-response essays. It is also
difficult to score performance assessments like oral presentations,
research papers, among others.
Objectivity is also an issue. Selected response-tests are objectively
marked because each item has one correct or best answer.
McMillan (2007) suggests that for performance assessments, it is
more practical to use rating scales and checklists rather than writing
extended individualized evaluations.
Ease of Interpretation
The teacher is able to determine right away if the student passed the
test by establishing a standard.
In performance tasks, rubrics are prepared to objectively and
expeditiously rate student’s actual performance or product.
Cost
12
Classroom tests are generally inexpensive compared to national or
high- stakes tests.
As for performance tasks, examples of tasks that are not
considerably costly are written and oral reports, debates, and panel
discussions.
13
Ethics
Aspects of Fairness:
1. Students knowledge of learning targets and assessment
2. Opportunity to learn
3. Prerequisite knowledge and skills
4. Avoiding student stereotyping
5. Avoiding bias in assessment tasks and procedures
Accommodating special needs
2. Opportunity to Learn
There are teachers who are forced to give reading assignment because
of the breadth that has to be covered in addition to limited or lost classroom
time. This would put students to disadvantage because they were not given
ample time and resources to sufficiently assimilate the material. McMillan
(2007) asserted that fair assessments are aligned with instruction that
provide adequate time and opportunities for all students to learn.
4. Avoiding Stereotyping
A stereotype is a generalization of a group based on inconclusive
14
observations of a small sample of these group common stereotypes are
racial, sexual, and gender remarks. Stereotyping is caused by preconceived
judgment of people one comes in contact in which are sometimes unintended.
The teacher should avoid terms that may be offensive to students of different
gender, race, religion, culture or nationality.
Relevance
Assessment should be set in a context that student will find purposeful.
Assessment should reflect the knowledge and skills that are most
important for students to learn.
Assessment should support every student’s opportunity to learn things
that are important.
Assessment should tell teachers and students something that they
do not already know.
15
Planning the Test
Classroom Test and General Purpose
Classroom test plays a central role in the evaluation of student learning.
They provide relevant measures of many important learning outcomes and
indirect evidence concerning others.
Motivate the students
Maintain learning atmosphere
Measures achievement
Check instructional effectiveness
Identify areas for review
Table of Specification
This is the blueprint of the test which is usually prepared by the teacher
on preparing summative test only.
Disadvantage
1. It’s time consuming on the part of the teacher
Teacher-Made Test
16
1. Objective Type - answers are in the form of a single word or phrase or
symbol.
a. Constructed- Response items
This is also known as supply test. This type of test requires the
student to construct or produce a response to a question or task.
b. Selected- Response items
This is the selection type test because the students respond to
each question by selecting an answer from the given choices.
2. Essay Type
This type of examination consists of question that requires the
student to respond in a lengthy manner where it tests the students’ ability
to express ideas accurately and to think critically.
17
Be sure that the problem posed is clear and specific.
Be sure that each item is independent of all items.
Be sure the item has one correct or best answer.
Prevent unintended clues to the answer in the statement or question.
18
Selecting and Constructing Test Items and Tasks
Remembering
Being able to recall.
Being necessary in learning interrelationships among basic
elements and in learning methods, strategies and procedures.
Procedural knowledge
Application of skills and procedures learned in new situations
(knowledge as learned way of doing things).
Knowledge
Knowledge Understanding Deep Understanding
Continuum
Simple
Understanding
19
Level 6: Creating
Plan
Generate
Produce
Design
20
Utilization of Assessment Data
Results of tests are in the form of scores, and these may be Raw Score;
percentile, ranks; z-scores; T-scores; Stanines; or level, category or proficiency
scores (Harris, 2003).
Percentage score
- is useful in describing a student’s performance based on a criterion.
Raw score 30
Example: ×100 = 60%
Number of Items 50
Percentile Rank
A percentile rank gives the perfect of the scores that are at or below a
raw or standard score. It is used to rank students in a reference sample.
P
Formula: R=
100(N+1)
R – Percentile Rank
P – Percentile
N – Is the Number of items
Example: Consider a data set of following numbers: 122, 112, 114, 17, 118, 116, 111,
115, and 112. Calculate 25th Percentile Rank.
25
R=
100(9+1)
R = 2.5th rank
111+112
We can take an average of 2nd term and 3rd term which is =
2
111.5 percentile rank
21
Standard Score
It is difficult to use raw score when making comparisons between
groups on different tests considering that tests may have different levels of
difficulty. To remedy this, a raw score may be transformed to a derived score.
Derived scores are of two types: scores of relative standing and
developmental scores. Example of the first type are percentiles, standard
score and stanine, while the developmental score include grade and age
equivalents.
As mentioned, a standard score is a derive score which utilizes the
normal curve to show how a student’s performance compare with the
distribution of scores above and below the arithmetic mean. Among the
standard scores are the z- score, T-score and stanine.
A normal score represents a normal distribution – a symmetric
theoretical distribution. The Mean (arithmetic average), Median (score that
separates the upper and lower 50%) and Mode (most frequently occurring
score) are located at the center of the bell curve where the peak is.
Mean - is the most commonly used measure of the center of data and
referred as the “arithmetic average.”
Example: Find the Grade Point Average (GPA) of Ritz Glenn for the first
semester of the school year 2010-2011. Use the table
Σ(w¡) (χ¡)
x =
Σw ¡
Median32 - is what divides the scores in the distribution into two equal parts.
x =
Fifty 26
percent (50%) lies below The Grade Point Average
the median value ofandRitz Glenn
50% liesforabove the
th
median value. It is also known as the middle score or the 50 percentile. For
x = 1 23 purposes,
classroom the first
thesemester
first thingSY
to 2010
do is 2011 is 1the
arrange 23score in proper order.
Example: Find the median score of 7 students in an English class.
X (score)
19 Analysis:
17 The median score is 15. Fifty percent
16 (50%) or three of the scores are above 1522
15 (19, 17, 16) and 50% or three of the scores
10 are below 15 (10, 5, 2)
5
Mode or the modal score is a scores that occurred most in the distribution. It
is classified as unimodal (only one mode), bimodal (two modes), and tri-
modal (three modes) and, multimodal (more than two modes).
Example: Scores of 40 students in science class consist of 60 items and they
tabulated below
A. Z-score
The z-score give the number of standard deviations of a test score
above or below the mean. The formula of
̅
x- x x-µ
Z= or Z = where,
s σ
z = z-value
x = raw score
s = sample standard deviation
̅
x = sample mean
σ = population standard deviation
µ = population mean
B. T-score
There are two possible values of z-score, positive z if the raw score is
above the mean and negative z if the raw score is below the mean. To avoid
confusion between negative and positive value, use T-score to convert raw
scores. T-score is another type of standard score where the mean is 50 and
the standard deviation is 10. In the z-score the mean is 0 and the standard
deviation is one (1). To convert raw score to T-score, find first the z-score
equivalent of the raw score and use the formula T score = 10z + 50.
Example: Z score = 2
T score = 10 z + 50.
T score = 10 (2) + 50
T score = 70
Standard deviation of T distribution = 10
Mean of T distribution = 50
C. Stanine
Stanine, short for standard nine, is the method of scaling scores on a
nine-point scale. A raw score is converted to a whole number from a low 1 to
a high 9.
24
25
Improving a Classroom-Based Assessment Test
Judgmental Item-Improvement
This approach basically makes use of human judgment in reviewing
the items. The judges are the teachers themselves who know exactly what the
test is for, the instructional outcomes to be assessed, and the items' level of
difficulty appropriate to his/her class; the teacher's peers or colleagues who
are familiar with the curriculum standards for the target grade level, the
subject matter content, and the ability of the learners; and the students
themselves who can perceive difficulties based on their past experiences.
There are five suggestions given by Popham (2011, p 253) for the teachers
to follow in exercising judgment:
3. Accuracy of content.
This review should especially be considered when tests have been
developed after a certain period of time. Changes that may occur due to new
discoveries or developments can redefine the test content of a summative
test. If this happens, the items or the key to correction may have to be
revisited.
26
The teacher always ensures that the assessment tool matches what is
currently required to be learned. This is a way to check on the content validity
of the test.
5. Fairness.
The discussions on item-writing guidelines always give warning on
unintentionally favoring the uninformed students obtain higher scores. These
are due inadvertent grammatical clues, unattractive distracters, ambiguous
problems and messy test instructions. Sometimes, unfairness can happen
because of due advantage received by a particular group like those seated.
Peer Review
There are schools that encourage peer or collegial review of
assessment instruments among themselves. Time is provided for this activity
and it has almost always yielded good results for improving tests and
performance-based assessment tasks. During these teacher dyad or triad
sessions, those teaching the same subject area can openly review together
the classroom tests and tasks they have devised against some consensual
criteria.
Student Review
Engagement of students in reviewing items has become a laudable
practice for improving classroom tests. The judgment is based on the
students' experience in taking the test, their impressions and reactions during
the testing event. The process can be efficiently carried out through the use of
a review questionnaire.
27
1. If any of the items seemed confusing, which ones were they?
2. Did any items have more than one correct answer? If so, which ones?
3. Did any items have no correct answers? If so, which ones?
4. Were there words in any Items that confused you? If so, which ones?
5. Were the directions for the test, or for particular subsections, unclear? If so,
which ones?
Difficulty Index
28
An item's difficulty index is obtained by calculating the p value (p)
which is the proportion of students answering the item correctly.
p= R/T
Where p is the difficulty index
R = Total number of students answering the item right
T =Total number of students answering the item
The p-value ranges from 0.0 to 1.0, which indicates from extremely
very difficult as no one got it correctly to extremely very easy as everyone got
it correct. For binary-choice items, there's a 50 % probability of getting the
item correct, simply by chance. For multiple-choice items of four alternatives,
the chance of obtaining a correct answer by guessing is only 25%. This is an
advantage of using multiple-choice questions over binary-choice items. The
probability of getting a high score by chance is reduced.
Discrimination Index
As earlier mentioned, the power of an item to discriminate between
informed and uninformed groups or between more knowledgeable and less
knowledgeable learners is shown using the item- discrimination index (D).
This is an item statistics that can reveal useful information for improving an
item. Basically an item-discrimination index shows the relationship between
the student's performance in an item (i.e. right or wrong) and his/her total
performance in the test represented by the total score. Item-total correlation
is usually part of a package for item analysis. Getting high item-total
correlations indicate that the items contribute well to the total score so that
responding correctly to these items gives a better chance of obtaining
relatively high total scores in the whole test or subtest.
29
For classroom tests, the discrimination index shows if a difference
exists between the Performance of those who scored high and those who
scored low in an item. As a general rule, the higher the discrimination index
(D), the more marked the magnitude of the difference is, and thus, the more
discriminating the item is.
D=Ru/Tu-R/T
Where D is item discrimination index
Ru =number of upper group getting the item correct
Tu = Number of upper group
R = number of lower group getting the item correct
T = Number of lower group
Another calculation can bring about the same result as (Kubiszyn and Borich,
2010):
D= Ru-R
T
As you can see R/T is actually getting the p value of an item. So to get
D is to get the difference between the p-value involving the upper half and the
p-value involving the lower half. So the formula for discrimination index (D)
can also be given as (Popham, 2011):
D= Pu - P
Where p= is the p-value for upper group (R,/T)
P= is the p-value for lower group (R/T)
a. Score the test papers using a key to correction to obtain the total scores of
30
the students.
c. Split the test papers into halves: high group and low group.
For a class of 50 or less students, do a 50-50 split. Take the upper half
as the HIGH Group and the lower half as the LOW Group
For a big group of 100 or so, take the upper 25 27% and the lower 25 -
27%.
Maintain equal numbers of test papers for Upper and Lower groups.
d. Obtain the p value for the Upper group and p-value for the Lower group
P(upper)=Ru/Th P(lower)=R/T
e. Get the Discrimination index by getting the difference between the p-values.
Distracter Analysis
Another empirical procedure to discover areas for item-Improvement
31
utilizes an analysis of the distribution of responses across the distracters.
Especially when the difficulty index and discrimination index of the tem seem
to suggest its being candidate for revision, distracter analysis becomes a
useful follow-up. It can detect differences in how the more able students
respond to the distracters in a multiple-choice item compared to how the less
able ones do it. It can also provide an index of the plausibility of the
alternatives, that is, if they are functioning as good distracters. Distract be
revise to increase their attractiveness.
Its p-value for the post-test is .80 while for pre-test is .10, thus Si = .70
following this a sensitivity to instruction
(Si) = P(post) – P(pre)
=.80 -.10
=.70
Notice that the calculation for Si carries the same concept as the
discrimination he difference in proportion is obtained between post-test and
pre-test given to the same group. Similar to D interpretation, the higher Si
value is the more sensitive the item is in showing the change a result of
instruction. This item statistics gives additional information regarding the
efficiency and validity of the item.
32
Curriculum - Based Assessment (Progress Monitoring)
33
Work 35% 40% 25%
Immersion/Research/Business
Enterprise
Simulation/Exhibit/Performance
Technical/Vocational and Livelihood
(TVL)/Sports/Arts and Design Track
All other subjects
Work 20% 60% 20%
Immersion/Research/Exhibit/Performa 20% 60% 20%
nce
When scoring tests and other assessments, the raw scores are totaled
in each level of assessments then the percentage scores are calculated. After
that the corresponding percentage weights are applied. As a case in, suppose
34
a grade 2 pupil acquired a total score of 64 in science out of 80 points
specifically on WW. The percentage score is 80. This was calculated by
dividing the total raw score by the highest possible score, multiplied by 100%.
To get the Weighted Scores (WS), we take 40% of the PS which is equal to 32.
The same procedure is applied to other components.
After the weighted scores for all three components are obtained the initial
grade is calculated. This is simple the sum of the weighted scores of the WW,
PT and QA. Finally the initial grade is transmuted using a standard
transmutation table. Refer to the table below.
The NGS does not follow a zero-based system. The passing grade is 60
which is the equivalent to 75. The floor grade is 60, meaning a score of zero
does not convert to an equivalent grade of zero but 60.
In table above, the percentage scores for WW, PT, and QA are 73.33,
78.33 and 66.67, respectively. For the indicated learning area (Mathematics)
and Grade level (4), the weight are as follows: 40% for WW, 40% for PT, and
20% for QA. To get the weighted score of each component, we obtain the
product of the percentage score and its respective weight. The initial grade is
the sum of the weighted scores, i.e. 29.33 + 31.33 +13.3. This yields 73.99.
Using the transmutation table, we find the transmuted grade is 83.
36
Grading Systems
Is a process of assigning a numerical value, letter or symbol to
represent student knowledge and performance.
The process of judging the quality of a pupil’s performance is called
Grading
Grading 6 roles:
1. to communicate achievement status of students to parents;
2. to provide information to students for self-evaluation;
3. to identify or sort students for specific programs;
4. to provide incentives for students to learn;
5. to evaluate effectiveness of instructional programs;
6. to provide evidence of a student’s lack of effort
Criterion-Referenced
The description of the individual’s performance in the test is generally
described in terms of the percentage of correct responses of correct
responses to items in a clearly defined learning task.
Approaches to Grading
Numerical grades (100, 99, 98,..) the system using numerical grade is
popular and well understood.
Letter grades (A, B, C, etc.) letter grades they appear to be less
intimidating compare compared to numerical grades.
Two-category grades (Pass-Fail; Satisfactory-Unsatisfactory) this is
less stressful students because they need not fear about low grade points
averages.
Checklists
A checklist may be simple or elaborate.
Standards-based (advanced, proficient, beginning or similar) requires
37
teachers to base grades from definite learning standards.
Grading issues
1. consider what type of student performances you need.
2. consider how to make marking scales consistent throughout the
marking period.
3. decide on the grade components and their respective weights.
4. consider the standards or boundaries for each letter grade.
5. decide on borderline cases.
6. be concerned with the issues of failures.
7. be concerned with the practice assigning zero for a mark.
38
The Large-Scale Assessment Framework
Addresses the products of learning.
Includes a range of cognitive engagement across language proficiency
levels.
Is supported graphically or visually at the lower language proficiency
levels.
Ensures the use of grade level materials at the uppermost language
proficiency level.
Serves a resource for all teachers.
39
Types: Intended Uses: Key Guiding
Classroom Classroom Considerations Questions:
summative summative for ELs: What
assessment: assessment: Teacher does
Curriculu Monitorin professio effective
m-based g nal classroo
assessme progress developm m
nts and ent for assess
State or achievem impleme ment of
district ent ntation the
interim Tracking/ and STEM
assessme placemen unbiased disciplin
nts t interpreta es look
Micro-level Teacher District/p tion like for
created arent Innovativ EL
Assessmen
tests reporting e students
Formative Diagnosti practices, ?
assessment: c including: What
Informal: Formative Bilingual should
Moment- Assessment: assessm teacher
to- Generate ent know
moment feedback Impleme and be
observatio to ntation of able to
ns, teachers learning do to
conversati (continge progressi serve EL
ons with nt ons students
teachers, instructio Social through
peers, etc. n) and interactio classroo
(short- students n of ELs m
cycle)** (to guide and assess
Formal: learning) social ment?
Periodic during inclusion
probes instructio of ELs in
and tasks n the
(medium- classroo
long m
40
A Balanced System of Assessment
Essential Question:
41
Use of Large-Scale Assessment
42
At the international scene, there are standardized examinations
developed to monitor changes in educational achievement in basic
school subjects over time. During the last decade or so, international
assessments for Science and Mathematics (Trends in International
Mathematics and Science Study), reading (Progress in International
Reading Literacy Study and Civics (International Civics and Citizenship
Education Study), are being coordinated with participating countries
interested in knowing status of their students in comparison with the
rest of the world.
43
Development of Large-Scale Student Assessment
Step 1: Defining Objective • Who will take the test and for
what purpose?
• What skills and/or areas of
knowledge should be tested?
• What should test takers be
able to use their knowledge?
• What kind of questions should
44
be included? How many each
kind?
• How long should the test be
• How difficult should the test be
45
• Are there questions
consistently performed better
by a group than other groups?
• What items further need
revision or removal before final
version is made?
46
Establishing Validity of Tests
Validity is regarded as the basic requirement of every test. It refers to
the degree to which a test measures what is intended to be measured.
Can the test perform its intended function?
This is the business of validity and the one adapted by the classical
model for regarding validity. There are three conventional types of validity,
criterion-related validity and construct validity (Anastasi and Urbina, 1997).
Establishing Validity
Criterion-related Content
Validity Validity
Types of
Test
Validity
Construct
Validity
47
valued measures other measure of
than the test itself performance obtain
concurrently (for
estimating present
status. (Primarily
Statistical). Correlate
test results with
outside criterion.)
Establishing Reliability
Measure of Measure of
Internal Equivalence
Consistency
48
2. Measure of Equivalence format Give two forms of a
Equivalence method test to the same group
in close succession
(Pearson r)
3. Measure of Stability Test – retest with Give two forms of a
equivalent forms test to the same group
with increased time
intervals between
forms. (Pearson r)
4. Measure of internal Kuder – Richarson Give a test once. Score
consistency method the total test and apply
the Kuder Richarson
formula.
REFERENCES
https://www.google.com/establishingvalidityoftest
49
Compan.pp.46-45
https//www.socialresearchmethods.net/kb/mtmmmat.htm
https://www.google.com/moreexamplesofvalidityoflarge-scaletest
Green S. K., Johnson R. L., Kim D.C., & Pope N. S., Ethics in Classroom
Practices: Issues and Attitudes p. 1-10
50