ASSESSING Production SKILLS

ASSESSING
PRODUCTIVE
AND
INTERACTIVES
KILLS
C
Contents
- Defining productive and interactive
language abilities
- Purpose for assessment + Assessment

design + Task review + Trying out
materials
Lecturer: Mr. Nguyễn Thanh Tùng
- Scoring performance
 Group : Bùi Trần Bảo Ngọc
Dương Phan Xuân An - Score reporting and feedback +
Hồ Hoàng Huy Standard setting 2
DEFINING PRODUCTION AND
INTERACTIVE LANGUAGE ABILITIES
Productive language processing in many respects mirrors
receptive processing (Cutler, 2005)
+ the speaker/writer: a message to be conveyed, and the
process of production : an appropriate physical form:
speech or writing
+ the listener or reader: uncertainty about the message and
must ‘unscramble’ the visual or auditory signal to build
the message .
-> As with the receptive models of reading and listening,
there is a good deal of shared ground between process
models of speech and writing
4
speaker/writer (the addresser), like the reader or
listener, has in mind certain goals for
communication
+ speaker/writer: a message that he or she wishes to
convey to listeners or readers, make decisions on how
best to convey that message (knowledge of the topic),
the addressee, the context, and the types or genres
+ listeners or readers : recover the message through the
kinds of receptive processes
5
INTERACTION PRODUCTION
Definition kind of spontaneous exchange found lectures or presentations
in conversation – or written texts such as
reports, essays and
manuals.
-> planned and rehearsed
monologues
-> as the two extremes of a continuum: brief planning pauses and do

allow the participants some time to consider what they will say.
Presentation title 6
Two general types of speech
(equally to writing):
factually oriented talk: conveying evaluative talk: more demanding as a
stance towards the content, such as
information + covers a range of
rhetorical functions, including explanation, justification and prediction
description, narration, instruction
and comparison
EX: ‘this is a book about a father and Ex: ‘this is a very moving story because it
his three daughters’ touches on family relationships’..
Brown and Yule (1983) + interactional language : functions

primarily to build and maintain social
+ transactional language: conveys
relationships
information 7
INTERACTION PRODUCTION
organize + based on the physical, linguistic, and social
context in which they are communicating.
and + speaker considers what information is new
structure and what is already known to the listener, and
their they give prominence to new information by
messages emphasizing it in speech and placing it at the
end of a sentence.
+ The speaker may also omit information that
is already shared by both the speaker and
listener, especially in casual speech. (Halliday
and Hasan, 1976).
SPEAKING WRITING
tend to follow the sequencing of ideas and to the
common patterns conventions of the type of writing they
of interaction are engaged in.
Learners need to gain awareness of how communicative language

functions are typically realised in different contexts and of the
sequences that are followed in familiar speech events in the target
language
organize take turns, recognise when it is +greater opportunity to plan.
appropriate for them to speak ->more grammatically complex
and + awareness language (subordinate clauses).
structure + gramma: relatively simple + to organize their sentences,
their and its vocabulary relatively paragraphs, and overall text
messages vague before committing to them
+ vocabulary : is also +not usually under pressure to
constrained by the limitations contribute to an ongoing
on planning time interaction,
+ involve more coordination: -> greater scope than speakers to
strings of speech units organise their sentences , their
connected by and or then paragraphs , texts. (notes or
+ communication is outlines, or by holding the
facilitated : speakers are structure in the writer’s memory)
together + use their shared
knowledge of objects around
Presentation title them 10
SPEAKING WRITING
DISADV + cannot convey as much as + more distant from the addressee both
physically and socially -> a burden on
the pauses, intonational
ANTAGE patterns, shifts of pitch and the language to convey the message as
S volume, gestures and facial
there is much less scope for referring to
shared knowledge
expressions + articulate + struggle with the script, punctuation,
unfamiliar sounds, stress and spelling conventions of the new
patterns, and intonation language.
systems-> hard for the learner
EX
to convey their intended “‘position the document holder to the
message and hinder right of the viewer, between the
communication. keyboard and the monitor”
EX
‘put that thing over here’
+Observable features of a learner's language processing, such as

pronunciation and grammar, are often used to assess their language 11
SPEAKING WRITING
monitoring monitor and revise their message -> + revise their text before sharing it
they receive feedback from their with others
and conversational partners,
revising such as spelling, pronunciation,
Example: student writes an essay
grammar, organization, formality, and realizes that they have
structure, style, and audience repeated the same point multiple
effects. times. They would then revise
their text by rephrasing their
example: a non-native speaker points or consolidating them.
struggling to pronounce a word
correctly in a conversation. ->
noticing their own error or
receiving feedback -> revise their
pronunciation and repeat the word
correctly.
Assessment designers need to decide how best to take account of such choices when
scoring performance (Fulcher, 2003). 12
SPEAKING WRITING
monitoring monitor and revise their message -> + revise their text before sharing it
they receive feedback from their with others
and conversational partners,
revising such as spelling, pronunciation,
Example: student writes an essay
grammar, organization, formality, and realizes that they have
structure, style, and audience repeated the same point multiple
effects. times. They would then revise
their text by rephrasing their
example: a non-native speaker points or consolidating them.
struggling to pronounce a word
correctly in a conversation. ->
noticing their own error or
receiving feedback -> revise their
pronunciation and repeat the word
correctly.
13
Assessment designers need to decide how best to take account of such
choices when scoring performance (Fulcher, 2003).
+ local accuracy: relatively hesitantly, concentrating on grammatical

formulation and pausing frequently to plan /correct themselves.
+ fluency, sacrificing a degree of accuracy to maintain the flow of
speech.
Both strategies have benefits and costs for the learner

and can cause strain on the addressee.
+ As Luoma (2004) expressed, they are concerned with speech, the
product of the individual, rather than with talk as a shared social
activity.
+ He and Young (1998) argued that the social nature of talk raises
questions about the validity of interactive speaking tests.
Scores are awarded to the individual assessee, but talk is a
product of shared interaction
+ the meaning of any sentence or speech unit is not universal->
“Good evening, madam”-> by a hotel receptionist
-by a mother to her teenage daughter
->DIFFICULT FOR LANGUAGE LEARNERS TO GAUGE 15
- Grice (1975), suggested that we have in mind four conversational
maxims or rules of thumb
1. the maxim of quality: say only what you know to be true;
2. the maxim of quantity: give all the information that the addressee
needs (but not more);
3. the maxim of relation: be relevant (to the topic or setting);
4. the maxim of manner: be unambiguous, brief, clear and orderly
Face value
16
Leech (1983), who was concerned with the rules that govern politeness,
also put forward conversational maxims.
+ tact, generosity and agreement.
The agreement principle: ‘Yes. You are so right’ is interpreted as being more
polite than ‘No. You’re completely wrong
Of course, people do disagree; but in order to maintain politeness they often
try to mitigate the effect by being indirect or by apologising: ‘I’m sorry, but I
don’t really follow your point’ or ‘I’m afraid I’m going to have to disagree with
you.
17
Purposes for assessment
ASSESSMENT
proficiency Formative classroom

• T wish to know an assessee • T wish to know what
can know how to write a difficulties in pronunciation
technical report, give a or writing an assessee is
presentation or bargain when facing
buying a car.
it is essential to consider how the nature
of the writing and speaking that
learners do will affect the kinds of
linguistic knowledge they will draw on
and the cognitive processes they will
Assessment design
It is now well established in language
assessment that meaningful
interpretations of writing ability or
speaking ability can only really be
supported through the use of
performance assessment.
Input and interaction
For assessments of writing skills, the choice of response
format generally lies between paper-based and computer-
based options. Input might take the form of a written or
visual prompt, or a combination of the two. In assessments
of speaking, input is more often presented orally.
SCORE PERFORMANCE
SCORING PERFORMANCE
Impression scoring
• Although the selected response formats (in Chapter 5), which will tend to be more accurate and less demanding, they
are uninformative about how well learners are able to use language for social purposes.
• Performance tests can provide better evidence of this, but scoring them requires scorer or rater’s skill and judgement.
Impression scoring:
• The rater awards a score out of an arbitrary total to reflect the quality of the performance. Percentages are a popular
option due to its reassuring air of statistical precision.
• Easy to understand and apply
• May have worryingly incompatible ideas about the standards of performance represented by different scores
Error counts
Error counts:
• Seemed to promise a more objective basis for scoring by involving ducting points for each error or each sentence
containing error.
• Does not really solve the problem of reliability.
• Judgements about the relative seriousness of errors would reintroduce the subjectivity that error counting was
intended to eliminate .
• Error counting is limited to mechanical accuracy.
• Error counting rewards very cautious language use. Simple sentences could outscore those made up of more
ambitious language.
SCORING PERFORMANCE
Checklists
Checklists:
• Direct attention to aspects of performance that are considered important.
• Straightforward to use and interpret.
• Suitable for use in peer- and self-assessment.
Learner can discuss model

Learner
Include leaners in responses and identify
understanding of Form a checklist.
the design process. features that characterize
success criteria.
good performance.
• Completed checklists can provide immediate diagnostic feedback to the assesse.

• The simplicity of checklists can also be considered a weakness.
• Successful communication is not readily reducible to yes/no choices
• Aggregating the checks to arrive at a score may misrepresent the relative communicative effectiveness of assesses.
SCORING PERFORMANCE
Rating scales
Rating scales: o Scale levels are sometimes called grades, bands, or
• Also known as scoring rubrics and marking schemes. points.
• Combine the advantage of checklists and impression • Scale and descriptors development is crucial for the
scoring. validity of the assessment.
• Can be converted from checklists.
• Consist of graded descriptions intended to characterize
different levels of performance.
o The descriptions guide the rater’s decision about the
matches between the level described and the sample
of language being assessed.
Þ As McNamara (1996) noted, the scale used in assessing performance tasks represents, either implicitly or explicitly, the
theoretical foundation upon which the test is built.
SCORING PERFORMANCE
Task-specific versus generic scales
• Weigle (2002, p. 109) reviewed rating scales used to score writing assessments and found two features that could be
used to distinguish between different types.
Task-specific scales: Generic scales:
• Designed to partner a single writing or speaking prompt • Designed to work in relation to a class of tasks of a
• The scales can communicate the features of a successful certain type
response very clearly and meticulously. • More helpful than task-specific scales at a formative level
• Have two serious shortcomings: in suggesting how learner can be better
o Every time a new prompt is devised, a new scale is • Most operational scales are more generic in nature
also required.
o Difficult to see the further general abilities than the
ability related to carry out the one specific task.
SCORING PERFORMANCE
Holistic versus analytic scales
Holistic scales
• Award a single score
• The rater consults a set of graded descriptions (see Table 6.4)
and consider how the performance satisfies the stated criteria
• Primary trait scales are holistic scales that have been designed
to accompany a specific task and reflect features of writing or
speaking that are particularly pertinent to that task
Analytic scales
• Award multiple scores to single script (or spoken performance)
• Require raters to award a number of different scores across a
range of categories or criteria.
• Multiple trait scales are analytic scales that have been designed
for use with a specific task
• Allow for differential weighting to be applied across categories
o If the assessment suggests that one feature needs extra
attention than others => criteria that reflect this feature
can be more heavily weighted and contribute to the overall
score
SCORING PERFORMANCE
Behavior or ability?
• In addition to the holistic/analytic distinction, scales can also be
categorised according to whether they are more ‘real world’
and ‘behavioural’ in orientation or whether they are more
‘ability/interaction’ and ‘normative’ (Bachman, 1990; Council of
Europe, 2001)
Real world behavioral scales
• Focus on specific task types
• Tend to describe the kinds of language behavior that learners
display when carrying out these tasks.
• Offer more concrete descriptions of the observable behaviors
associated with a level of performance
Ability/interaction scales
• Focus on the test taker
• Describe the abilities or competences that underlie and support
language use
• Represent language use in terms of degree or frequency in
relation theoretical categories.
SCORING PERFORMANCE
Behavior or ability?
• Sometimes more ability-oriented scales are nested within more behaviorally oriented level definitions. The Graded
Examinations in Spoken English (GESE) developed by Trinity College, London, for example, include tests or grades targeting
12 distinct levels of speaking ability. Each grade of the suite represents a different level of challenge for the test taker.
• There is one overall scale that covers all 12 grades and identifies the behavioral ‘communicative skills’ identified with each
o Lowest level (grade 1): ‘give very short answers to simple questions and requests for information’
o Highest level (grade 12): ‘initiate the discussion and actively seek ways in which to engage the examiner in a
meaningful exchange of ideas and opinions
• However assesses at both levels are rated against the same more ability-focused and four-band scales
o Band B: ‘the candidate’s contributions are generally effective, comprehensible, appropriate and adequately fulfil the
task’
o Band D: contributions ‘are very limited, lack comprehensibility and appropriacy and, although there is some attempt at
the task, this is not fulfi lled, even with support’ (Trinity College, London, 2010).
• In the metaphor of the high jump suggested by Pollitt (1991), the grade of the examination contributes to the interpretation
of each score.
• The task ‘bar’ is set at a level appropriate to the target level of the test and the interpretation of the scale is adjusted
accordingly.
• Nested systems are popular in schools, where expectations may be lower when students are in their third year than when
they are in their fifth, but the same grading system operates across years
• However, nested scales may convey a dispiriting message to learners who improve their language skills
SCORING PERFORMANCE
Approaches to developing rating scales
• Taking up a distinction made by North (2000), Fulcher (2003) categorized recent methods of scale development within
two general approaches: intuitive and empirical
Intuitive rating scale development EBB scales (empirically derived, binary-choice, boundary-definition):
• carried out by appointed experts • Fulcher (2003) contrasted intuitive approaches to what he called
• working either individually or in a group empirical method.
• prepared according to the intuitions, established • First introduced by Upshur and Turner (1995).
practice, a teaching syllabus, a needs analysis or • Performances are divided by expert judges into two categories:
some combination of these stronger and weaker with reasonable reasons.
• The resulting scale may be refined over time • Justifications made for each division are used in preparing a
according to experience or new developments series of yes/no questions
in the field • Each choice is simple and concisely worded and narrows
• Includes both more ability/oriented scales and down the range of scores
more behaviorally/real-world oriented scales
SCORING PERFORMANCE
Scaling descriptors
• Large numbers of descriptions of performance are collected and broken down into their constituent parts so that the
developer has large collection of isolated statements
o First decide whether these descriptions are meaningful in classifying leaners.
o Then rank the statements according to difficulty.
o Statements of similar difficulty are grouped into sets to form a level and the level descriptions are arranged into a
scale
o The resulting scales are taken to reflect a consensus view of the characteristics of different levels of language ability
• Fulcher et al. (2011, p. 9) objected to such quantitative ‘scaling descriptors’ approaches.
• Both the ordering of descriptors into levels and the categories used have been questioned because they seem
inconsistent with second language acquisition research (Green, 2012a).
• It has also been argued that scales sometimes misrepresent the nature of language in use.
o For example: in the overview of speaking abilities, highly competent interactions between skilled speakers of a
language are rarely consistently ‘complex’ (CEFR, C2: grammatical accuracy scale) with a ‘high degree of
grammatical accuracy’ (CEFR, C1: grammatical accuracy scale), displaying ‘a very broad lexical repertoire’ (CEFR, C2:
vocabulary range scale), but are often quite simple syntactically and may involve a rather restricted range of
vocabulary (Hughes, 2010)
SCORING PERFORMANCE
Data-driven scale development
• Such concerns have led other developers to take a third, more qualitative, empirical approach: data-driven scale
development (Fulcher, 1987).
• Involves the systematic analysis of performances by learners (from the population who will take the assessment) on the
relevant tasks.
• Key features that help to discriminate between learners are identified from these sample learner performances.
• According to Fulcher et al. (2011), a major shortcoming of this method is that it tends to generate very detailed level
descriptors that raters find too complex and time consuming to be practical for routine use
• The best methods for rating scale development are said to take advantage of the strengths of a
range of intuitive, quantitative and qualitative approaches.
• The CEFR (Council of Europe, 2001, p. 207) advocated ‘a complementary and cumulative
process’ that brings together all three
SCORING PERFORMANCE
Rater training
Rater training:
• A well-written scale helps to define the test construct for the raters so that they are guided to features they should
attend to in a performance. This can enhance levels of agreement.
• Raters have to be trained to use scales effectively and consistently.
• Quality control procedures are needed to ensure that this happens.
• Can be carried out before a rating session, or at fixed times during a school year.
o Raters look at and discuss a set of previously scored performances to help them understand the scale.
o Then they rate some performances
o As the effect of training can be transient, raters have to be monitored over time to ensure that they remain
consistent
• Where scores have serious consequences for the assessee, it is essential that the results should not depend on the views
of a single rater, even if that rater is an experienced expert.
o If they are in agreement, that suggests that the score is accurate
o Where there are minor differences, an average is often taken as the final score.
o If more substantial disagreements occur, a third independent rating should be obtained
SCORING PERFORMANCE
Using computer to score performance
• Automated essay scoring has already been in use for many years and with advances in speech recognition, automatic
scoring of tests of speaking is also becoming increasingly popular
• Automated scoring systems usually depend on first obtaining large numbers of human ratings.
• Given enough samples of assessee writing (or speech captured through automated speech recognition tools), the
machine comes to associate certain patterns of features with scores at different levels.
o For example, if the first essay the system is given to analyze has 18 words per sentence and is scored as A, but the
second has 16 words and is scored B => the machine would predict that words per sentence might be a useful
feature in telling the difference between an A and a B level essay. According to this hypothesis, the next essay,
which has 18 words per sentence, should score A.
Þ If the prediction is right, the model is strengthened; if it is wrong, the machine makes adjustments and reduces the
importance given to sentence length
• No automated scoring system could actually rely on just one or two features and some take account of hundreds or even
thousands of features.
• Once the machine has developed a model that works well on the samples it has been given, it can then be used to score
performances that have not previously been rated by humans.
• At present, automated scorers have to be retrained every time a new prompt is used.
Score reporting and feedback
Standard setting
•Descriptive reporting of speaking and writing
performance is more straightforward-> observe
directly how the assessee responds to the
assessment tasks.
•Alderson (1991) raised the objection that the
kinds of scales employed by trained assessors
may be too technical or detailed to
communicate effectively to non-specialists ->
the use of adapted user-oriented scales
SCORE
REPORTING
39
•Feedback is immediate and targeted, and grading
and scoring can be delayed and shifted to the
background informative, descriptive, commentaries
and teacher-student conferencing
+ a wider range of task types can be used in
classroom assessment as there are fewer constraints
on standardization and time limitations.
FEEDBACK
40
Sinclair and Coulthard (1975) : classroom interaction
I-R-F pattern. + The I : initiation – the teacher asks a
question or calls on a student to speak: ‘What colour
eyes does Jane have?’
+ R is for response – a student answers the question:
‘Mmm. She has blue eyes.’
+ F is feedback – the teacher accepts or corrects the
learner’s response, an evaluative comment: positive
(‘Good. Yes, she has blue eyes.’) or negative (‘No, she
has brown eyes.’).
FEEDBACK  This kind of feedback is certainly immediate, not

actually fulfil the primary function of feedback:
promoting effective learning.
41
error correction: ‘focus on form’ (Long and Robinson, 1998) can be
beneficial.
feedback is ignored by learners may be that it is not informative enough.
correct answer is simply provided, there is no thinking for the learners to
do and little opportunity to learn.
reflective thinking : allowing time for learners to self-correct and increasing wait
time for learners (Walsh, 2011).
- in the beginning stages of learning, learners may need a good deal of time to formulate a
response to teacher questions.
-> Teachers may, perhaps incorrectly, interpret silence as error.
-> if questions are put to the whole class, time for formulation allowed for all and then learners
chosen at random to respond (Wiliam, 2011)
-> all learners may be engaged in first formulating a response and then suggesting alternative
answers. -> more information for the teacher than traditional individual questioning 42

ASSESSING Production SKILLS

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ASSESSING Production SKILLS

Uploaded by

Copyright:

Available Formats

ASSESSING

- Purpose for assessment + Assessment

-> as the two extremes of a continuum: brief planning pauses and do

Brown and Yule (1983) + interactional language : functions

Learners need to gain awareness of how communicative language

+Observable features of a learner's language processing, such as

+ local accuracy: relatively hesitantly, concentrating on grammatical

Both strategies have benefits and costs for the learner

proficiency Formative classroom

Learner can discuss model

• Completed checklists can provide immediate diagnostic feedback to the assesse.

FEEDBACK  This kind of feedback is certainly immediate, not

You might also like