Ede 206 Sedillo

EDE 206
Language and
Literature
Assessment
SCORING
LANGUAGE
TESTS
RICKY C. SEDILLO JR., LPT
Presenter
2
Scoring Items
• Scoring is the first step in the process of processing
test result, namely the process of converting the
answers to the test questions into numbers.
• Scoring is an act of quantification of answers given by
tester in a learning outcome test.
• In language testing, a construct of great interest has
always been the ability to recognise logical relations
between clauses, such as those indicating situation–
problem–response–evaluation patterns (Hoey, 1983),
3
One way of testing this is to use a multiple-choice
approach, as in the following example from Fulcher
(1998b):
Most human beings are curious. Not, I mean, in the sense that they are odd, but
in the sense that they want to find out about the world around them, and about
their own part in the world
1a. But they cannot do 1b. They therefore ask 1c. Or, on the other hand,
this easily. questions, they wonder, they may wish to ask
they speculate. many questions.
What they want to find out may be quite simple things: What lies beyond that
range of hills? Or they may be rather more complicated inquiries: How does
grass grow? Or they may be more puzzling inquiries still: What is the purpose
of life? What is the ultimate nature of truth? To the first question the answer
may be obtained by going and seeing. The answer to the next question will not
be so easy to find, but the method will be essentially the same.
2a. So, he is forced to 2b. Although, often, it may 2c. It is the method of the
observe life as he sees it. not be the same. scientist.
4
• You may wish to spend some time looking at this
particular task to identify problems and may also
wish to reverse engineer a specification and
attempt to improve it.
•Many testers prefer to use a sequencing item type

rather than multiple choice, in which the learners
have to reconstruct the original sequence of
sentences in a text, as in the following example
(Alderson et al., 2000: 425).
5
Below is a short story that happened recently. The order of the
five sentences of the story has been changed. Your task is to
number the sentences to show the correct order: 1, 2, 3, 4 and
5. Put the number on the line. The first one has been done for
you.
_____ (a) She said she’d taken the computer out of the box,
plugged it in, and sat there for 20 minutes waiting for something
to happen.
__1__ (b) A technician at Compaq Computers told of a frantic
call he received on the help line.
_____ (c) The woman replied, ‘What power switch?’
_____ (d) It was from a woman whose new computer simply
wouldn’t work.
_____ (e) The tech guy asked her what happened when she
pressed the power switch.
6
•Assuming that a mark is given for each sentence
placed in the correct slot, the most obvious scoring
problem with an item like this is that, once one
sentence is placed in the wrong slot, not only is
that sentence incorrect, but the slot for another
correct answer is filled.
7
SCORABILITY
8
SCORABILITY
•Scorability means that the test should be easy to
score, direction for scoring should be clearly stated
in the instruction. Provide the students an answer
sheet and the answer key for the one who will
check the test.
•Scorability is highly desirable, whether tests are

delivered by paper and pencil, or on computer.
9
•Even with closed response items, Lado saw that if
the answers are ‘scattered in the pages’, the time
taken to score a test is extended, and the chances
of making errors when marking and transferring
results to a mark book would increase.
•He therefore recommended the use of separate
answer sheets upon which test takers could record
their responses. Scoring speeds up significantly,
and errors are reduced.
•The most commonly used keys are stencils that
enable the scorer to see whether a response is
correct or incorrect without having to read the
question. 10
•Even today, many teachers construct
‘templates’ to score tests much more quickly.
•From the very earliest days, a further perceived
advantage of scoring closed response items
was cost. Of course, increasing the speed of
marking through the use of stencils reduced
cost, but using clerical or other untrained staff
also reduced personnel costs.
•With the emphasis on speed, cost and
accuracy, there were many ingenious attempts
to make scoring easier. Lado (1961: 365–366)
mentions two of these.
11
The first was the Clapp-Young Self-Marking test
sheets.
•This consisted of two pieces of paper sealed
together with a sheet of carbon paper between
them.
•When the test taker marks an answer on the sheet
it is printed on the second sheet, which already has
the correct answers circled.
•The marker separates the sheets and counts off
the correct answers from the second sheet of
paper.
12
A second method was the punch-pad self-scoring
device
•Test taker removed a perforated dot from the

answer sheet; if the response was correct, a red
dot was revealed below.
13
•Not surprisingly, computer-based testing has
become exceptionally popular.
•Computers are capable of delivering tests

efficiently, and can produce immediate scores for
both the test takers and the score users.
•What matters most in terms of how computer-

based tests are scored is the relationship between
the response of the test taker to the items, and the
reaction of the computer to the response.
•There are three basic options:

14
Linear Tests
•A linear test is an exam that is administered and
monitored by proctoring software in which you are
unable to move backward in the exam to change
prior answers.
•In a linear test the test takers are presented with

items in a set sequence, just as if they were
encountering them in pencil-and-paper format.
15
Branching Tests
•Branching enables authors to deliver dynamic
assessments that adapt to each individual learner,
providing them with more personalized learning
experience.
•Using a pre-configured question path of items,

students are dynamically guided through
assessments, with each succeeding branch based
on how the student’s answer in the previous
question.
16
Adaptive Tests
•In these tests the computer estimates the ability of
the test taker after they respond to each individual
item. If a test taker answers an item correctly, the
computer selects a more difficult item. If the test
taker answers an item incorrectly, it selects an
easier item.
•This means that no two test takers are likely to

face the same set of items, assuming that the item
pool is large enough
17
Scoring Constructed Response Tasks
•If scoring closed-response items seems to be
problematic, the situation becomes more complex
when we turn to constructed responses, such as
extended writing tasks, or speaking.
•Assessing writing or speech is normally done by a

rater, using a rating scale.
•As Weigle (2002: 108) says, the two things with

which we are most concerned are defining the
rating scale, and training the raters how to use it.
18
•Rating scales are traditionally constructed from a
set of fairly arbitrary levels, each level being
defined by a descriptor or rubric.
•Hamp-Lyons (1991) classifies all rating scales into

one of three possible types, as follows.
19
Holistic Scales
•A single score is awarded, which reflects the
overall quality of the performance.
•Holistic scales are generally fairly easy to use and

with extensive training high levels of inter-rater
reliability can be achieved.
20
Strong Performance Writing has a clear focus and engages the
reader in the opening lines. Information is
accurate. Transitions help the reader move
smoothly from one idea to another. Any
errors in structures and/or spelling are
minor and infrequent, they do not interfere
with communication.
Meets Expectations Writing has a clear opening statement and
logical sequence of ideas. The information
is accurate. Any errors in structures and/or
spelling are minimal and do not interfere
with communication.
Approaching Expectations Writing includes a purpose for reading in

the opening paragraph. The information is
accurate. Supporting ideas follow the
opening paragraph. Error in structures
and/or spelling may at times distract from
the message.
21
Primary Trait Scales
•A single score is awarded, but the descriptors are
developed for each individual prompt (or
question) that is used in the test.
•The primary trait rating scale reflects the specific

qualities expected in writing samples at a number
of levels on the scale.
22
PERSUADING AN AUDIENCE
1 Fails to persuade the audience.
2 Attempts to persuade but does not provide sufficient support.
3 Presents a somewhat persuasive argument but without consistent

development and support.
4 Develops a persuasive argument that is well developed and

supported.
23
Multiple Trait Scoring
• Unlike the two scale types already mentioned,
multiple trait scoring requires raters to award
two or more scores for different features or traits
of the speech or writing sample.
• The traits are normally prompt or prompt-type
specific, as in primary trait scoring.
24
Excellent Average Needs Improvement
Time on task The group forms The group forms fairly soon The group takes a long
immediately to work on to work mostly on activity time to form, they do not
activity until the teacher until the teacher indicates work on activity (unless the
indicates otherwise; if group otherwise; if group finishes teacher walks by); if group
finishes early, members early, members are either finishes early, members
discuss topic related to TL silent or discuss topics not discuss topics not related
related to TL. to TL.
10 9 8 7 6 5 4 3 2 1
Participation All group members All group members but one More the one group
participate equally participate equally member does not
throughout the entire throughout the activity participate equally
activity. throughout the activity.
4 3 2 1
5
Group Cooperation All members cooperate to Most members cooperate to Members do not cooperate
help each other learn; if help each other learn; if to help each other learn; if
anyone has been absent, anyone has been absent, anyone has been absent,
the group helps him/her; no the group sometimes helps the group does not help;
one acts ‘superior’ him/her; no one acts some members act
‘superior’ superior
10 9 8 7 6 5 4 3 2 1
Use of Target Members use as much TL Members use some TL Members rarely use TL
Language as possible (also to greet during activity (also to greet during activity (neither do
and say farewells) and say farewells) they greet nor say
farewells)
5 4 3 2 1
25
AUTOMATED
SCORING
26
•‘The examiner who is conscientious hesitates,
wonders if this response is as good as another he
considered good, if he is being too easy or too
harsh in his scoring.’
•As part of the relentless drive to use technology

to improve scorability, recent decades have seen
a growing interest in scoring speaking and writing
automatically (Wresch, 1993)
27
e-rater
•The software is capable of analysing syntactic
features of the essay, word and text length, and
vocabulary.
PhonePass
•The kinds of tasks utilised by computer scored
speaking tests include reading sentences aloud,
repeating sentences, providing antonyms for
words, and uttering short responses to questions.
28
THANK YOU
FOR
LISTENING!!!
29

Ede 206 Sedillo

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ede 206 Sedillo

Uploaded by

Copyright:

Available Formats

EDE 206

•Many testers prefer to use a sequencing item type

•Scorability is highly desirable, whether tests are

•Test taker removed a perforated dot from the

•Computers are capable of delivering tests

•What matters most in terms of how computer-

•There are three basic options:

•In a linear test the test takers are presented with

•Using a pre-configured question path of items,

•This means that no two test takers are likely to

•Assessing writing or speech is normally done by a

•As Weigle (2002: 108) says, the two things with

•Hamp-Lyons (1991) classifies all rating scales into

•Holistic scales are generally fairly easy to use and

Approaching Expectations Writing includes a purpose for reading in

•The primary trait rating scale reflects the specific

1 Fails to persuade the audience.

2 Attempts to persuade but does not provide sufficient support.

3 Presents a somewhat persuasive argument but without consistent

4 Develops a persuasive argument that is well developed and

•As part of the relentless drive to use technology

You might also like