Professional Documents
Culture Documents
ASSESSING Production SKILLS
ASSESSING Production SKILLS
PRODUCTIVE
AND
INTERACTIVES
KILLS
C
Contents
- Defining productive and interactive
language abilities
5
INTERACTION PRODUCTION
Definition kind of spontaneous exchange found lectures or presentations
in conversation – or written texts such as
reports, essays and
manuals.
-> planned and rehearsed
monologues
Presentation title 6
Two general types of speech
(equally to writing):
factually oriented talk: conveying evaluative talk: more demanding as a
stance towards the content, such as
information + covers a range of
rhetorical functions, including explanation, justification and prediction
description, narration, instruction
and comparison
EX: ‘this is a book about a father and Ex: ‘this is a very moving story because it
his three daughters’ touches on family relationships’..
Presentation title 8
SPEAKING WRITING
tend to follow the sequencing of ideas and to the
common patterns conventions of the type of writing they
of interaction are engaged in.
Assessment designers need to decide how best to take account of such choices when
scoring performance (Fulcher, 2003). 12
SPEAKING WRITING
monitoring monitor and revise their message -> + revise their text before sharing it
they receive feedback from their with others
and conversational partners,
revising such as spelling, pronunciation,
Example: student writes an essay
grammar, organization, formality, and realizes that they have
structure, style, and audience repeated the same point multiple
effects. times. They would then revise
their text by rephrasing their
example: a non-native speaker points or consolidating them.
struggling to pronounce a word
correctly in a conversation. ->
noticing their own error or
receiving feedback -> revise their
pronunciation and repeat the word
correctly.
13
Assessment designers need to decide how best to take account of such
choices when scoring performance (Fulcher, 2003).
Presentation title 14
+ As Luoma (2004) expressed, they are concerned with speech, the
product of the individual, rather than with talk as a shared social
activity.
+ He and Young (1998) argued that the social nature of talk raises
questions about the validity of interactive speaking tests.
Scores are awarded to the individual assessee, but talk is a
product of shared interaction
+ the meaning of any sentence or speech unit is not universal->
“Good evening, madam”-> by a hotel receptionist
-by a mother to her teenage daughter
->DIFFICULT FOR LANGUAGE LEARNERS TO GAUGE 15
- Grice (1975), suggested that we have in mind four conversational
maxims or rules of thumb
1. the maxim of quality: say only what you know to be true;
2. the maxim of quantity: give all the information that the addressee
needs (but not more);
3. the maxim of relation: be relevant (to the topic or setting);
4. the maxim of manner: be unambiguous, brief, clear and orderly
Face value
16
Leech (1983), who was concerned with the rules that govern politeness,
also put forward conversational maxims.
+ tact, generosity and agreement.
The agreement principle: ‘Yes. You are so right’ is interpreted as being more
polite than ‘No. You’re completely wrong
Of course, people do disagree; but in order to maintain politeness they often
try to mitigate the effect by being indirect or by apologising: ‘I’m sorry, but I
don’t really follow your point’ or ‘I’m afraid I’m going to have to disagree with
you.
17
Purposes for assessment
ASSESSMENT
Error counts
Error counts:
• Seemed to promise a more objective basis for scoring by involving ducting points for each error or each sentence
containing error.
• Does not really solve the problem of reliability.
• Judgements about the relative seriousness of errors would reintroduce the subjectivity that error counting was
intended to eliminate .
• Error counting is limited to mechanical accuracy.
• Error counting rewards very cautious language use. Simple sentences could outscore those made up of more
ambitious language.
SCORING PERFORMANCE
Checklists
Checklists:
• Direct attention to aspects of performance that are considered important.
• Straightforward to use and interpret.
• Suitable for use in peer- and self-assessment.
Holistic scales
• Award a single score
• The rater consults a set of graded descriptions (see Table 6.4)
and consider how the performance satisfies the stated criteria
• Primary trait scales are holistic scales that have been designed
to accompany a specific task and reflect features of writing or
speaking that are particularly pertinent to that task
Analytic scales
• Award multiple scores to single script (or spoken performance)
• Require raters to award a number of different scores across a
range of categories or criteria.
• Multiple trait scales are analytic scales that have been designed
for use with a specific task
• Allow for differential weighting to be applied across categories
o If the assessment suggests that one feature needs extra
attention than others => criteria that reflect this feature
can be more heavily weighted and contribute to the overall
score
SCORING PERFORMANCE
Behavior or ability?
• In addition to the holistic/analytic distinction, scales can also be
categorised according to whether they are more ‘real world’
and ‘behavioural’ in orientation or whether they are more
‘ability/interaction’ and ‘normative’ (Bachman, 1990; Council of
Europe, 2001)
Real world behavioral scales
• Focus on specific task types
• Tend to describe the kinds of language behavior that learners
display when carrying out these tasks.
• Offer more concrete descriptions of the observable behaviors
associated with a level of performance
Ability/interaction scales
• Focus on the test taker
• Describe the abilities or competences that underlie and support
language use
• Represent language use in terms of degree or frequency in
relation theoretical categories.
SCORING PERFORMANCE
Behavior or ability?
• Sometimes more ability-oriented scales are nested within more behaviorally oriented level definitions. The Graded
Examinations in Spoken English (GESE) developed by Trinity College, London, for example, include tests or grades targeting
12 distinct levels of speaking ability. Each grade of the suite represents a different level of challenge for the test taker.
• There is one overall scale that covers all 12 grades and identifies the behavioral ‘communicative skills’ identified with each
o Lowest level (grade 1): ‘give very short answers to simple questions and requests for information’
o Highest level (grade 12): ‘initiate the discussion and actively seek ways in which to engage the examiner in a
meaningful exchange of ideas and opinions
• However assesses at both levels are rated against the same more ability-focused and four-band scales
o Band B: ‘the candidate’s contributions are generally effective, comprehensible, appropriate and adequately fulfil the
task’
o Band D: contributions ‘are very limited, lack comprehensibility and appropriacy and, although there is some attempt at
the task, this is not fulfi lled, even with support’ (Trinity College, London, 2010).
• In the metaphor of the high jump suggested by Pollitt (1991), the grade of the examination contributes to the interpretation
of each score.
• The task ‘bar’ is set at a level appropriate to the target level of the test and the interpretation of the scale is adjusted
accordingly.
• Nested systems are popular in schools, where expectations may be lower when students are in their third year than when
they are in their fifth, but the same grading system operates across years
• However, nested scales may convey a dispiriting message to learners who improve their language skills
SCORING PERFORMANCE
Approaches to developing rating scales
• Taking up a distinction made by North (2000), Fulcher (2003) categorized recent methods of scale development within
two general approaches: intuitive and empirical
Intuitive rating scale development EBB scales (empirically derived, binary-choice, boundary-definition):
• carried out by appointed experts • Fulcher (2003) contrasted intuitive approaches to what he called
• working either individually or in a group empirical method.
• prepared according to the intuitions, established • First introduced by Upshur and Turner (1995).
practice, a teaching syllabus, a needs analysis or • Performances are divided by expert judges into two categories:
some combination of these stronger and weaker with reasonable reasons.
• The resulting scale may be refined over time • Justifications made for each division are used in preparing a
according to experience or new developments series of yes/no questions
in the field • Each choice is simple and concisely worded and narrows
• Includes both more ability/oriented scales and down the range of scores
more behaviorally/real-world oriented scales
SCORING PERFORMANCE
Approaches to developing rating scales
Scaling descriptors
• Large numbers of descriptions of performance are collected and broken down into their constituent parts so that the
developer has large collection of isolated statements
o First decide whether these descriptions are meaningful in classifying leaners.
o Then rank the statements according to difficulty.
o Statements of similar difficulty are grouped into sets to form a level and the level descriptions are arranged into a
scale
o The resulting scales are taken to reflect a consensus view of the characteristics of different levels of language ability
• Fulcher et al. (2011, p. 9) objected to such quantitative ‘scaling descriptors’ approaches.
• Both the ordering of descriptors into levels and the categories used have been questioned because they seem
inconsistent with second language acquisition research (Green, 2012a).
• It has also been argued that scales sometimes misrepresent the nature of language in use.
o For example: in the overview of speaking abilities, highly competent interactions between skilled speakers of a
language are rarely consistently ‘complex’ (CEFR, C2: grammatical accuracy scale) with a ‘high degree of
grammatical accuracy’ (CEFR, C1: grammatical accuracy scale), displaying ‘a very broad lexical repertoire’ (CEFR, C2:
vocabulary range scale), but are often quite simple syntactically and may involve a rather restricted range of
vocabulary (Hughes, 2010)
SCORING PERFORMANCE
Approaches to developing rating scales
Data-driven scale development
• Such concerns have led other developers to take a third, more qualitative, empirical approach: data-driven scale
development (Fulcher, 1987).
• Involves the systematic analysis of performances by learners (from the population who will take the assessment) on the
relevant tasks.
• Key features that help to discriminate between learners are identified from these sample learner performances.
• According to Fulcher et al. (2011), a major shortcoming of this method is that it tends to generate very detailed level
descriptors that raters find too complex and time consuming to be practical for routine use
• The best methods for rating scale development are said to take advantage of the strengths of a
range of intuitive, quantitative and qualitative approaches.
• The CEFR (Council of Europe, 2001, p. 207) advocated ‘a complementary and cumulative
process’ that brings together all three
SCORING PERFORMANCE
Rater training
Rater training:
• A well-written scale helps to define the test construct for the raters so that they are guided to features they should
attend to in a performance. This can enhance levels of agreement.
• Raters have to be trained to use scales effectively and consistently.
• Quality control procedures are needed to ensure that this happens.
• Can be carried out before a rating session, or at fixed times during a school year.
o Raters look at and discuss a set of previously scored performances to help them understand the scale.
o Then they rate some performances
o As the effect of training can be transient, raters have to be monitored over time to ensure that they remain
consistent
• Where scores have serious consequences for the assessee, it is essential that the results should not depend on the views
of a single rater, even if that rater is an experienced expert.
o If they are in agreement, that suggests that the score is accurate
o Where there are minor differences, an average is often taken as the final score.
o If more substantial disagreements occur, a third independent rating should be obtained
SCORING PERFORMANCE
Using computer to score performance
• Automated essay scoring has already been in use for many years and with advances in speech recognition, automatic
scoring of tests of speaking is also becoming increasingly popular
• Automated scoring systems usually depend on first obtaining large numbers of human ratings.
• Given enough samples of assessee writing (or speech captured through automated speech recognition tools), the
machine comes to associate certain patterns of features with scores at different levels.
o For example, if the first essay the system is given to analyze has 18 words per sentence and is scored as A, but the
second has 16 words and is scored B => the machine would predict that words per sentence might be a useful
feature in telling the difference between an A and a B level essay. According to this hypothesis, the next essay,
which has 18 words per sentence, should score A.
Þ If the prediction is right, the model is strengthened; if it is wrong, the machine makes adjustments and reduces the
importance given to sentence length
• No automated scoring system could actually rely on just one or two features and some take account of hundreds or even
thousands of features.
• Once the machine has developed a model that works well on the samples it has been given, it can then be used to score
performances that have not previously been rated by humans.
• At present, automated scorers have to be retrained every time a new prompt is used.
Score reporting and feedback
Standard setting
•Descriptive reporting of speaking and writing
performance is more straightforward-> observe
directly how the assessee responds to the
assessment tasks.
•Alderson (1991) raised the objection that the
kinds of scales employed by trained assessors
may be too technical or detailed to
communicate effectively to non-specialists ->
the use of adapted user-oriented scales
SCORE
REPORTING
39
•Feedback is immediate and targeted, and grading
and scoring can be delayed and shifted to the
background informative, descriptive, commentaries
and teacher-student conferencing
+ a wider range of task types can be used in
classroom assessment as there are fewer constraints
on standardization and time limitations.
FEEDBACK
40
Sinclair and Coulthard (1975) : classroom interaction
I-R-F pattern. + The I : initiation – the teacher asks a
question or calls on a student to speak: ‘What colour
eyes does Jane have?’
+ R is for response – a student answers the question:
‘Mmm. She has blue eyes.’
+ F is feedback – the teacher accepts or corrects the
learner’s response, an evaluative comment: positive
(‘Good. Yes, she has blue eyes.’) or negative (‘No, she
has brown eyes.’).