Bok 3A978 1 349 68771 8

Designing
Listening
Tests: A
Practical
Approach
RITA GREEN
Designing Listening Tests: A Practical Approach
Rita Green has spent many years at the coalface of language test develop-
ment and training in a variety of international contexts; this book is the
sum of this experience. This book is a fantastic resource for anyone look-
ing to develop listening tests: a highly practical, theoretically-grounded
guide for teachers and practitioners everywhere. Green covers a range of
important principles and approaches; one highlight is the introduction
to the textmapping approach to working with sound files. This book is
highly recommended for anyone involved in the development of listen-
ing tests.
Luke Harding, Senior Lecturer, Lancaster University, UK

Designing Listening Tests:

A Practical Approach
Rita Green
RitaGreen
UK
ISBN 978-1-137-45715-8ISBN 978-1-349-68771-8(eBook)

DOI 10.1057/978-1-349-68771-8
Library of Congress Control Number: 2016950461
The Editor(s) (if applicable) and The Author(s) 2017

The author(s) has/have asserted their right(s) to be identified as the author(s) of this work in accordance
with the Copyright, Designs and Patents Act 1988.
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and trans-
mission or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made.
Printed on acid-free paper
This Palgrave Macmillan imprint is published by Springer Nature

The registered company is Macmillan Publishers Ltd.
The registered company address is: The Campus, 4 Crinan Street, London, N1 9XW, United Kingdom
Preface
Who is this book for?

This book is primarily aimed at teachers who would like to develop listen-
ing tests for their students but who have little, if any, training in the field
of assessment. It is also designed for test developers who have some expe-
rience of assessing the listening ability of test takers, but who would like
a practical reference book to remind them of the procedures they should
follow, and of the many dos and donts that litter the field of task develop-
ment. Those who are engaged in MA studies, or other types of research,
should also find Developing Listening Tests (DLT) of interest as the book
raises many issues which would benefit from further investigation.
DLT offers a systematic approach to the development of listening tasks,
starting with a discussion of what listening involves, and the importance
of drawing up test specifications. It also explores how to exploit sound
files and investigates a range of issues related to task development. The
book concludes with a look at the benefits of trialling and data analysis,
and how to report test scores and set pass marks.
Not everyone reading this book will be able to carry out all of these
recommended stages. In many cases, even where test developers would
like to do this, the demands and limitations of their assessment contexts
make some stages very difficult to achieve. What is of importance is to
attempt to do as many as possible.
v
viPreface
The organisation of this book

Each chapter focuses on one major aspect of the task development cycle.
Chapter 1 starts with an overview of the issues which a test developer
needs to consider when developing a listening test. These include looking
at the processes which are involved in real-life listening, how the spoken
and written forms of the language differ and what makes listening dif-
ficult. The chapter ends with a discussion on why listening is important
and introduces the reader to the task development cycle.
Chapter 2 discusses the role that test specifications play in assisting the
test developer to define the construct underlying the test, and to describe
the conditions under which the test takers performance will be measured.
Chapter 3 introduces the reader to a procedure called textmapping which
helps test developers to determine the appropriateness of the sound files
they would like to use in their task development work and explores how
those sound files can be exploited.
Chapter 4 focuses on task development, investigates many of the deci-
sions that need to be made at this stage, and provides a set of item writing
guidelines to help in this process. The chapter also discusses the role of
peer review in task development and provides an outline of how this feed-
back could work. Chapter 5 consists of a range of sample listening tasks
taken from a number of different testing projects. Each task is discussed
in turn providing insights into the listening behaviour, the sound file and
the task. Links to the sound files are also provided.
Chapter 6 focuses on the benefits to be gained from trialling the listen-
ing tasks and carrying out data analysis. Chapter 7 explores the different
ways test scores can be reported and how pass marks (or cut scores) can
be calculated. Readers are provided with insights into how a standard
setting session can be run and the importance of producing a post-test
report is discussed.
Good luck with the book and the task development process!
RitaGreen
UK
Acknowledgements
I would like to start by thanking my colleagues and friends for their

feedback on previous versions of these chapters. A special mention goes
to Karmen Piorn, Irene Thelen-Schaefer, Caroline Shackleton, David
Gardner, Heidi Ford-Schmidt and Astrid Dansoko.
I would also like to express my thanks to the following people
and organisations who have provided me with copyright permis-
sion to include the tasks and/or sound files used in this book: Graham
Hyatt, Lnderverbundprojekt VerA6, Germany; Julia Grossmann
& Linnet Souchon; Walter Indra; The BundesInstitut, Zentrum fr
Innovation und Qulitatsentwicklung (Bifie), Austria; The Institut zur
Qualittsentwicklung im Bildungswesen (IQB), Humboldt-Universitt
zu Berlin, Germany; Devawongse Varopakarn Institute of Foreign
Affairs (DVIFA), Ministry of Foreign Affairs, Thailand; Paul Vogel;
The Department of Foreign Affairs and Trade (DFAT), Australian
Government; iPod traveller: www.ipodtraveller.net; Star Radio,
Cambridge, UK; Nathan Turner, Centro de Lenguas Modernas, Granada
University, Spain; and Luke Harding, Lancaster University, UK.
Reprint of SPSS screen images courtesy of International Business
Machines Corporation, SPSS, Inc., an IBM Company.
vii
Contents
1 What is involved inassessing listening?1

1.1 What thelistening process involves 2
1.2 How listening differs between contexts andlisteners 5
1.3 How listening input varies 7
1.4 How thespoken andwritten forms ofthelanguage differ 8
1.5 What makes listening difficult? 11
1.5.1 Nature oflistening11
1.5.1.1 No permanent record 11
1.5.1.2 Lack ofreal gaps 12
1.5.1.3 Lack ofredundancy 12
1.5.2 Complexity ofprocessing13
1.5.2.1 Multi-tasking13
1.5.2.2 Controlled versus automatic processing 14
1.5.3 Input14
1.5.3.1 Content14
1.5.3.2 Topic14
1.5.3.3 Sound quality 15
1.5.3.4 Mode ofdelivery 16
1.5.4 Task16
1.5.5 Listening environment17
ix
xContents
1.5.6 Speaker characteristics17

1.5.6.1 Speed ofdelivery 17
1.5.6.2 Number andtype ofvoices 18
1.5.7 Listeners characteristics18
1.6 Why is assessing listening important? 19
1.7 Summary20
1.7.1 Task development cycle 21
2 How can test specifications help?27

2.1 What are test specifications? 27
2.2 Purpose ofthetest 28
2.3 Target test population 28
2.4 The construct 29
2.5 Performance conditions 34
2.5.1 Input 35
2.5.1.1 Source 35
2.5.1.2 Authenticity 37
2.5.1.3 Quality 38
2.5.1.4 Level ofdifficulty 39
2.5.1.5 Topics 40
2.5.1.6 Discourse type 40
2.5.1.7 Nature ofcontent 41
2.5.1.8 Number ofsound files needed 41
2.5.1.9 Length ofsound files 42
2.5.1.10 Mode ofdelivery 43
2.5.1.11 Number oftimes heard 43
2.5.1.12 Speaker characteristics 45
2.5.2 Task 46
2.5.2.1 Instructions andtheexample 46
2.5.2.2 Test method 46
2.5.2.3 Number ofitems 47
2.5.2.4 Number oftasks 48
2.5.3 Criteria ofassessment 48
2.6 Why do weneed test specifications? 49
2.7 Summary51
Contents xi
3 How do weexploit sound files?55

3.1 Identifying thepotential use ofasound file 55
3.2 A procedure forexploiting sound files: Textmapping 57
3.3 Textmapping forgist 59
3.3.1 Defining thelistening behaviour 59
3.3.2 Checking forconsensus 61
3.3.3 The Gist textmap table 64
3.3.4 Summary ofthegist textmapping procedure 66
3.3.5 Textmapping multiple gist files 67
3.4 Textmapping forspecific information andimportant
details (SIID) 68
3.4.3 The SIID textmap table 72
3.4.4 Summary oftheSIID textmapping procedure 74
3.4.5 Textmapping longer SIID sound files 75
3.4.6 Textmapping multiple SIID sound files 76
3.5 Textmapping formain ideas andsupporting
details (MISD) 76
3.5.3 The MISD textmap table 78
3.5.4 Summary oftheMISD textmapping procedure 80
3.6 Re-textmapping 82
3.7 Useful by-products 82
3.8 Summary 83
4 How do wedevelop alistening task?85

4.1 Task identifier (TI) 85
4.2 Task instructions 88
4.3 Task issues 90
4.3.1 Test method 90
4.3.1.1 Multiple matching (MM) 91
4.3.1.2 Short answer questions (SAQ) 92
4.3.1.3 Multiple choice questions (MCQ) 94
xiiContents
4.3.1.4 Other test methods 95

4.3.2 Number oftimes heard 96
4.3.3 Number ofitems needed 96
4.3.4 Task layout 96
4.3.5 Mode ofdelivery 97
4.3.6 Integrated listening tasks 97
4.3.7 Grading issues 97
4.4 Guidelines fordeveloping listening items 98
4.4.1 Sound file 99
4.4.2 Task instructions 100
4.4.3 Item/task development 101
4.4.3.1 General issues 101
4.4.3.2 Test method 103
4.4.3.2.1 General issues 103
4.4.3.2.2 Short answer questions (SAQ) 104
4.4.3.2.3 Multiple matching (MM) 105
4.4.3.2.4 Multiple choice questions
(MCQ)106
4.4.4 Layout issues 107
4.5 Peer review andrevision 107
4.5.1 Peer review 108
4.5.2 Revision 112
4.6 Summary 112
5 What makes agood listening task?115

Introduction115
Part 1: Multiple matching tasks 116
5.1 Task 1: Reading habits (MM) 116
5.1.1 Sound file 117
5.1.2 Task 117
5.1.2.1 Listening behaviour 117
5.1.2.2 Suitability oftest method 118
5.1.2.3 Layout 118
5.2 Task 2: School class (MM) 118
Contents xiii

5.2.2 Task 120
5.2.2.3 Layout 122
5.3 Task 3: Adiplomat speaks (MM) 122
5.3.2 Task 124
5.3.2.3 Layout 126
Part 2: Short answer tasks 127
5.4 Task 4: Winter holidays (SAQ) 127
5.4.2 Task 128
5.4.2.3 Layout129
5.5 Task 5: Message (SAQ) 129
5.5.2 Task 130
5.5.2.3 Layout 130
5.6 Task 6: Oxfam Walk (SAQ) 131
5.6.2 Task 132
5.6.2.3 Layout133
Part 3: Multiple choice tasks 133
5.7 Task 7: Hospital (MCQ) 133
5.7.2 Task 134
xivContents

5.7.2.3 Layout 135
5.8 Task 8: Tourism inParis 135
5.8.2 Task 138
5.8.2.3 Layout139
5.9 Summary 139
5.10 Keys tothesample tasks 141
6 How do weknow if thelistening task works?145

Introduction145
6.1 Why do wetrial? 146
6.1.1 Task instructions 146
6.1.2 Amount oftime allocated 147
6.1.3 Different test methods 147
6.1.4 Task key 148
6.1.5 Task bias 148
6.1.6 Sample tasks/benchmark performances 149
6.1.7 Tasks forstandard setting 149
6.1.8 Test administration guidelines 150
6.1.9 Feedback questionnaires 151
6.1.10 Feedback tostakeholders 153
6.1.11 Test specifications 153
6.1.12 Summary 153
6.2 How do wetrial? 154
6.2.1 The test population 154
6.2.2 Trial dates 154
6.2.3 Size ofthetrial population 155
6.2.4 Test booklet preparation 155
6.2.5 Administration andsecurity issues 157
6.2.6 Marking 158
Contents xv
6.3 Trial results 160

6.3.1 Why carry out adata analysis? 160
6.3.2 How do wecarry out adata analysis? 161
6.3.2.1 Stage 1: Frequencies 162
Summary166
6.3.2.2 Stage 2: Discrimination 166
Summary168
6.3.2.3 Stage 3: Internal consistency (reliability) 168
Summary171
6.3.2.4 Overall task difficulty 171
6.3.3 Drop, revise or bank? 172
6.4 Conclusions 172
7 How do wereport scores andset pass marks?175

7.1 Reporting test scores 175
7.1.1 Separate skills or all skills? 175
7.1.2 Weighting ofdifferent skills 177
7.1.3 Method ofreporting used 178
7.1.4 Norm-referenced approach 179
7.1.5 Criterion-referenced approach 180
7.1.6 Pass marks 181
7.2 Standard setting 182
7.2.1 What is standard setting? 182
7.2.2 Why do westandard set? 183
7.2.3 Who is involved instandard setting? 185
7.2.3.1 Before standard setting 185
7.2.3.2 During standard setting 186
7.2.4 Importance ofjudge selection 187
7.2.5 Training ofjudges 188
7.2.6 Selecting astandard setting method 190
7.2.7 Role ofstatistics instandard setting 191
7.2.8 Standard setting procedure 192
7.2.9 Confirming item andtask difficulty levels 194
xviContents
7.3 Stakeholder meetings 195

7.4 Sample tasks andtest website 195
7.5 Post-test reports 197
7.5.1 Post-test item analysis 197
7.5.2 Recommendations 199
Final thoughts 199
DLT Bibliography203
Index205
Acronyms
CAID Cronbachs Alpha if Item Deleted

CEFR Common European Framework of References
CITC Corrected Item Total Correlation
EFL English as a Foreign Language
ICAO International Civil Aviation Organization
IELTS International English Language Testing System
MCQ Multiple choice questions
MISD Main ideas and supporting details
MM Multiple matching
SAQ Short answer questions
SEM Standard error of measurement
SEQ Sequencing
SIID Specific information and important details
SHAPE Supreme Headquarters Allied Powers Europe
SLP Standardized Language Profile
STANAG Standardised Agreement
TI Task identifier
xvii
List of figures
Fig. 1.1 Extract from lecture 12

Fig. 1.2 Task development cycle 21
Fig. 2.1 CEFR B2 descriptors 31
Fig. 2.2 STANAG Level 1 descriptors 32
Fig. 2.3 ICAO Level 3 descriptors 32
Fig. 2.4 General listening focus 33
Fig. 2.5 Talking points 36
Fig. 2.6 Test specifications template 51
Fig. 3.1 Instructions for gist textmapping 61
Fig. 3.2 Gist textmapping results 62
Fig. 3.3 Highlighted communalities (gist) 63
Fig. 3.4 Communalities (gist) 63
Fig. 3.5 Gist textmap table 65
Fig. 3.6 Gist textmapping procedure 66
Fig. 3.7 Different types of SIID 69
Fig. 3.8 SIID textmapping results 71
Fig. 3.9 SIID: Textmap Table1 72
Fig. 3.10 SIID: Textmap Table2 73
Fig. 3.11 SIID textmapping procedure 74
Fig. 3.12 Main ideas, supporting details and SIID 77
Fig. 3.13 MISD Textmap Table 79
Fig. 3.14 MISD textmapping procedure 80
Fig. 4.1 Task identifier 86
xix
xx List of figures
Fig. 5.1 Janes reading habits (MM) 117

Fig. 5.2 School class (MM) 119
Fig. 5.3 A diplomat speaks (MM) 123
Fig. 5.4 Winter holidays (SAQ) 127
Fig. 5.5 Message (SAQ) 129
Fig. 5.6 Oxfam Walk (SAQ) 131
Fig. 5.7 Hospital (MCQ) 134
Fig. 5.8 Tourism in Paris (MCQ) 136
Fig. 6.1 Feedback questionnaire: Example 1 152
Fig. 6.2 Feedback questionnaire: Example 2 152
Fig. 6.3 Frequencies on Q1 162
Fig. 6.4 Frequencies on Q2-Q4 164
Fig. 6.5 Frequencies on Q5-Q8 165
Fig. 6.6 Popham (2000) Discrimination levels 167
Fig. 6.7 Discrimination indices 167
Fig. 6.8 Reliability statistics 170
Fig. 6.9 Overall task difficulty 171
Fig. 7.1 Extract from CEFR familiarisation exercise (listening) 189
Fig. 7.2 Website materials 196
1
What is involved inassessing listening?
Assessing a test takers listening performance is a complex procedure

(Field 2013: 84) and it is essential that test developers spend quality time
thinking about what the complete listening process involves before they
embark on any task development work. Where this does not happen, it
is more than likely that the listening tasks will not be at the appropriate
difficulty level, and involve items which measure more than just the test
takers listening ability. Test developers should have a good understand-
ing of the following issues:
1 . What the listening process involves.

2. How listening differs from context to context, listener to listener.
3. How listening input varies.
4. How the spoken and written forms of the language differ.
5. What makes listening difficult.
The aim of this chapter is to explore these issues in preparation

for the ensuing chapters which will focus on the task development
procedures.
The Editor(s) (if applicable) and The Author(s) 2017 1

R. Green, Designing Listening Tests
DOI10.1057/978-1-349-68771-8_1
2 Designing Listening Tests
1.1 What thelistening process involves

According to Rost (2011: 2) Listening is essentially a transient and invis-
ible process that cannot be observed directly. Let us explore this statement
by looking ata couple of scenarios to try to understand how the listening
process works. Imagine a situation in which someone is about to convey
some information to you in your native tongue. The process starts with
the person producing sounds which are transmitted to you by means of
sound waves. On receiving these sound waves, your internal acoustic-
phonetic processor and parser are activated. These tools enable you to
segment the strings of the message into chunks, and to decode the sounds.
If you are successful in doing this, you should be able to identify what has
been said by matching the segmented and decoded input to your internal
store of words and phrases in order to make sense of it. During this pro-
cess, you are also likely to tap into any relevant world knowledge or previ-
ous relevant experiences (schemata) in order to give the input meaning.
Where the listener is able to carry out these processes, an overall picture of
the message the speaker is conveying should be produced and if appropri-
ate and/or necessary the listener may make a response.
Field (2013: 95-6) divides the listening event into five processes which
can take place simultaneously. He describes the first three as lower-level
processes. These include decoding the input, carrying out a lexical search,
and parsing, which involves producing an abstract proposition based on
what has been heard using the listeners own words. The last two processes
he refers to as higher-level ones. These involve firstly, the construction
of meaning where the listener applies world knowledge and inferencing
strategies to cover anything which has been left unsaid by the speaker.
And secondly, the construction of discourse, where the listener checks
the relevance and applicability of what has been said to previous parts
of the message. These are then integrated into the whole picture as the
listener recalls it. Field (ibid.) adds that these higher-level processes may
not take place in a consecutive order and that the listener is constantly
making provisional guesses at the word, phrase and clause level . These
guesses need to be continuously reviewed and changed in light of the
new incoming messages. (See Lynch 2010; Rost 2011; Brunfaut 2016 for
further descriptions of the process.)
1 What is involved inassessing listening? 3
Generally speaking, as listeners in our native tongue(s), we carry out

these processes simultaneously and automatically; in fact, we do this
often without giving the listening act a single passing thought. It is only
when we stumble over something which has been said that we become
conscious of this automatic behaviour. This is because we have moved
from automatic processing into what is referred to as a controlled pro-
cessing. This can be triggered by the speakers use of a word which we do
not immediately recognise, one which has been used in an unexpected
way, or one which does not seem to fit into the overall picture. If the mes-
sage is being conveyed by phone or by a person talking directly to us we
can ask for clarification if the meaning is not clear; if not, we might be
able to guess what we have missed from the context and still achieve the
overall meaning (depending on how seminal that part of the message is)
though some detail may be lost.
Now let us look at a second scenario. Imagine you are listening to
someone who is speaking in a language that is not your native tongue.
Depending on the degree of exposure you have had to that language,
your acoustic-phonetic processor and parser may have some difficulties
in chunking the incoming message. This is especially true if the language
being spoken does not belong to the same language group as your own.
Trying to understand spoken Chinese in the early 1980s in Beijing is
perhaps a useful (personal) example of this phenomenon. My acoustic-
phonetic processor simply could not detect the white spaces between the
words; nor were there sufficient international words and/or cognates for
a lexical search to be of any help in understanding what was being said.
Parsing did not happen. So in Fields (2013) terms, my listening experi-
ence was not even successful at the lower processing levels; any higher-
level processing would have been based on strategic listening (for which
read guessing) based on my knowledge of the context and the speaker.
When a language has few words or cognates in common with the lis-
teners own first language (as in the case above), those who are new to
that language are likely to be less successful in their attempts to decode
the incoming message. The listener is likely to find him/herself repeatedly
slipping into a controlled state (indeed they may not leave it after the
first utterance) and significant parts, if not all of the message, may be lost.
Although the example above is an extreme case, it is something which
test developers need to be very aware of when selecting input to use in

a test. Listeners can usually cope with the presence of a few new words,
especially if they are not seminal to understanding the message or what
is being targeted in a given task. However, where the number of new
words or phrases occur more frequently, say, for example, in every other
sentence, the listener is likely to be forced to spend longer periods in
controlled mode. This can potentially lead to overload in their attempts
to understand the message and ultimately processing may stop.
When something is new a learner pays it far more conscious attention,
and therefore processing is slow. As the input becomes more and more
familiar, processing usually becomes faster and eventually can move to
being automatic. This is true of many things, not just learning a language.
For example, we experience this when learning to drive, to swim, to cook
and so on. However, for many second language listeners, processing is
often only partly automatic and consequently processing breaks down,
as the listener cannot handle the continuous stream of input quickly
enough. As Fortune (2004: 42) states:
automatic processing requires little/no attention and as such is less likely to

interfere with other processes at work; controlled processes require attention and
it is difficult to deal with more than one source of information at a time.
Field (2013: 106-7) adds
The importance of automaticity in all these processes cannot be overstated. If

a basic operation like matching a set of speech sounds to a word requires an
effort of attention, it imposes demands upon a listeners working memory that
can preclude other operations. By contrast, when the mapping from word to
word senses is highly automatic, working memory resources are freed for higher-
level processes such as making inferences, interpreting the speakers intentions,
recognising a line of argument and so on.
In other words, the more time that listeners can spend in auto-
matic mode, the less demand there will be on their working memories
(Baddeley 2003; Field 2013). This, in turn, means that in the assess-
ment context, the listener will have more working capacity for dealing
with other issues, such as applying what s/he has understood to the
task. Test developers therefore need to think carefully about the degree
of cognitive strain they are placing on test takers when asking them
to process a sound file. Not only do test takers need to cope with the
listening processes discussed above but they also need to manage such
factors as language density, speaker articulation, speed of delivery,
number of voices, accessibility of the topic inter alia, all of which are
likely to contribute to the burden of listening for the second language
listener (see2.5.1).
1.2How listening differs between contexts

andlisteners
As test developers we need to be aware that the way we listen changes in
accordance with what we are listening to. For example, the way we listen
to something for pleasure (films, music, jokes and so on) may differ from
how we listen when we are trying to gather information or insights about
something (lectures, documentaries and so on). Similarly, the act of lis-
tening is likely to be different when we are listening to check the validity
of an argument (political debates, speeches and so on) as opposed to how
we listen when we want to empathise with a friend who is describing
something personal that has happened to her (see Wilson 2008; Rost
2011; and Brown and Yule 1983). As Lynch (2009: 55, emphasis in the
original) points out:
what we listen to and why are important influences on how we listen.
Secondly, the degree of attention a listener exhibits also varies accord-

ing to the context in which the listening event takes place (Fehrvryn
and Piorn 2005). A study by Bone (1988) cited by White (1998: 6-7)
revealed that:
people often listen at only 25 per cent of their potential and ignore, forget, dis-
tort, or misunderstand the other 75 per cent. Concentration rises above 25 per
cent if they think that what they are hearing is important and/or they are
interested in it, but it never reaches 100 per cent.
Imagine a scenario where we simply want to identify a detail, say a

new gate number at an airport, or the cost of something. In these circum-
stances, we frequently engage in selective listening. On other occasions,
though, the opposite is the case. Take for instance a situation where we
are trying to identify someones arguments in favour of a particular pro-
posal. In this case, we are more likely to employ careful listening in order
not to miss anything. This is especially true if the speaker is being implicit
(see2.4). In other scenarios, when we are not really interested in what
someone is talking about, our attention may wander and the amount of
detail taken in will necessarily be less than it might otherwise have been
if we had been fully engaged.
Thirdly, the linguistic ability of the listener will impact on the way
in which s/he is able to listen. For example, compare what a beginner
is expected to be able to achieve as opposed to someone who is more
expert. According to the Common European Framework of References
(CEFR) a listener at A1 should be able to:
follow speech which is very slow and carefully articulated, with long pauses for
him/her to assimilate meaning. (Overall Listening Comprehension)
understand instructions addressed carefully and slowly to him/her and follow
short simple directions. (Listening to Announcements and Instructions)
While an expert listener at C1 on the same scale should be able to:
understand a wide range of recorded and broadcast audio material, including

some non-standard usage, and identifying the finer point of detail including
implicit attitudes and relationships between speakers. (Listening to audio
media and recordings)
recognise a wide range of idiomatic expressions and colloquialisms, appreciating
register shifts. (Overall Listening Comprehension)
This comparison, while somewhat extreme, demonstrates clearly the

importance of being aware of what is expected of listeners at different
competence levels and the necessity of ensuring that the task(s) focus
on the appropriate types of listening behaviour(s). The more advanced
the listeners are, the wider the range of different listening behaviours the
tasks should measure in order to avoid construct under-representation.
1.3 How listening input varies

When we consider the various kinds of listening input we process on a
daily basis as a listener, it is obvious that they are incredibly varied, much
more so than the various forms of the written word (Field 2013). This
variety poses a challenge for test developers and is something they must
consider carefully before starting to search for sound files. To help us with
this challenge, Lynch (2009: 15) advises us to think of the range of differ-
ent types of listening as being on a number of continua:
from unplanned to planned (Ochs 1979); from oral to literate (Tannen

1982a, b); from interactional to transactional (Brown and Yule 1983);
from involved to detached (Bygate 1998)
The decision as to which continua the sound files should be selected

from should be related to the target test population. Compare the
needs of career diplomats with those of young learners in terms of
what would be classified as suitable input. For example, a diplomat
who is required to take a listening test is likely to feel far more satisfied
with a sound file based on the speech of a visiting dignitary as opposed
to an interview with the latest winner of a reality show. This is because
the former type of input reflects the kind of real-world listening s/he
would be engaged with professionally and would therefore have cogni-
tive validity (Field 2013). An appropriate sound file for young learners,
on the other hand, would need to be not only within their cognitive
ability, but also based on a familiar (and preferably interesting) topic
within their world knowledge. In other words, the target test popula-
tion and the real-world listening context are instrumental in helping
the test developers to identify which type of sound file should be used
in the test.
Secondly, the test developer needs to decide whether the test takers
listening ability should be measured by means of collaborative tasks,
non-collaborative tasks (Buck 2001) or both. At the collaborative (or
interactional) end of such a continuum, both listening and speaking
abilities would be involved, possibly through some kind of role-play,
problem-solving exercise, conversation, negotiation (for example, busi-
ness or diplomatic context) or transmission (aeronautical context). At
the non-collaborative (non-interactional) end, the listening event might
involve listening to a lecture, an interview or a phone-in. According to
Banerjee and Papageorgiou (2016: 8) large-scale and standardised listen-
ing tests use non-collaborative tasks.
Lets look at some concrete examples. Air traffic controllers (ATC)
need to be able to demonstrate not only good listening skills but also
the ability to interact when communicating with pilots or fellow ATC
colleagues (see ELPAC: English Language Proficiency for Aeronautical
Communication Test). Therefore, an interactional listening task is likely
to have much more validity. In occupational tests, such as those aimed
at civil servants or embassy support staff, where an ability to communi-
cate on the telephone is considered an important skill, the test would
ideally include some interactional tasks (see INTANs English Language
Proficiency Assessment Test). Although tertiary level students need to dem-
onstrate their ability to take notes during lectures, which would suggest
non-interactional tasks have more cognitive validity, they may also need
to function in small-group contexts involving speaking which would
indicate interactional tasks are also important. In the case of young learn-
ers, it is also likely to be both.
1.4How thespoken andwritten forms

ofthelanguage differ
Spoken input exhibits both oral and written features to differing
extents. How does this impact on the listener? In general, the more
oral features the input contains, the easier it is for listeners to follow
and understand what is being said (Field 2013; Lynch 2010). This is
because features such as pauses, hesitations, fillers, repetition, repairs,

false starts, corrections, afterthoughts, and asides have a tendency to
render the input less dense and also to increase the degree of redun-
dancy. Both of these characteristics provide the listener with more time
to process the incoming message. The written form of the language, on
the other hand, usually exhibits fewer of these characteristics. This is
due to the fact that, being more permanent in nature, it tends to have
much less redundancy.
Compare, for example, the spoken language features involved in two
friends chatting about a new film with that of a politician giving a speech;
the characteristics they reflect are quite different. The former is much
more likely to exhibit many of the oral aspects mentioned above such
as pauses, hesitations, back-tracking, fillers and so on while the latter
is more likely to display more written characteristics. This could be due
to the speech having been written before being delivered; it could also
be partly owing to its purpose the politician may well hope that what
s/he has to say will be remembered and even quoted in future (and there-
fore s/he may not want pauses, hesitations and back-tracking to feature).
Asking test takers to process such a speech, while certainly not invalid
given the appropriate target test population, is definitely much more cog-
nitively demanding.
A second way in which oral and written language differ is that spo-
ken language idea units tend to contain simpler syntax and consist of
shorter sentences broken into clause-length utterances (Field 2013: 111).
The written equivalent, by comparison, often includes more complex
syntax, with relative or subordinate clauses. When such written lan-
guage forms part of a sound file, the listener has to work much harder
to process the input, as the utterances tend to be longer and the amount
of redundancy more limited. The spoken language also makes more use
of simpler linking devices such as and, or and but, while the writ-
ten form exhibits more complex ones such as however, moreover,
therefore and so on.
Thirdly, the spoken form of the language often includes more low-
information content words; in the written language there are more
complex grammatical constructions such as gerunds, participles and so
on. In other words, many of the words a speaker produces are redun-
dant they simply form part of the packaging and can be ignored
by the listener (see 1.5.1.3). The writer, on the other hand, is often
instructed or feels obliged to make every word count. This has obvi-
ous consequences for the listener when a written text is used as the
basis for a sound file.
Fourthly, due to its temporary nature, the spoken form may contain
more dialect, slang and colloquialisms than the written form. On the
other hand, though, the speaker may well exhibit more personal and
emotional involvement which may aid the listeners comprehension espe-
cially where there is also visual input.
Fifthly, the discourse structure and signposting used differs across
the two forms. The written form has punctuation, while the spoken
has prosodic cues such as intonation, stress, pauses, volume and speed.
Depending on the characteristics of the speakers voice, these prosodic
cues can either aid comprehension or hinder it take, for example, a
speaker who talks very fast or someone who exhibits a limited or unex-
pected intonation pattern.
To summarise, where a sound file contains many of the written charac-
teristics discussed above, this increases the degree of processing required
by the listener. This is because the resulting input is likely to be more
complex in terms of grammatical structures, content words, and length
of utterances; also because it will probably exhibit less redundancy.
While this does not mean that input based on speeches or radio news, for
example, is invalid, careful thought must be given to the purpose of the
test, the test takers needs and the construct upon which the test is based.
In other words, the test developer needs to ask him/herself whether in
a real-life listening context, the test population for whom s/he is devel-
oping a test, would ever listen to such a rendition. To this end, the test
developer may find it useful to carry out a needs analysis in order to
identify appropriate listening events for the target test population while
developing the test specifications (see2.5). (See Chafe 1985, and Chafe
and Danielewicz 1987 for a more in-depth discussion of the differences
between the spoken and written word.)
1.5 What makes listening difficult?

The discussion so far in this chapter has touched on a number of fac-
tors which are likely to contribute to the difficulties that test takers face
when listening. Only by being aware of these aspects is the test developer
likely to be able to pitch the difficulty level of the task with any kind of
accuracy. The following section discusses a range of these issues which
may impact either as single variables or as interrelated ones (Brunfaut
2016).
1.5.1 Nature oflistening
1.5.1.1 No permanent record
One common feature of many listening events is that there is no per-

manent record to refer to unlike in reading contexts. This means that
without an appropriate amount of redundancy in the sound file, it is very
challenging for a second language listener, (and especially young learners
with short attention spans) to maintain processing for long periods of
time and to build up a detailed comprehensive picture of what is being
said, unlike with a written text (Lynch 2010). Even where there is redun-
dancy in a sound file, its impact can be minimised if test developers create
their tasks based on the transcript of the sound file, rather than working
with the actual sound file itself. This, in turn, can have a serious influence
on what is tested and how it is tested (see 3.2).
Where the sound file has an insufficient amount of redundancy,
test takers may rely on the task to provide them with a substitute
permanent record as many tasks provide a skeletal framework of the
listening content. Care must be taken that this does not impact on
the cognitive validity of the test as in the real-world context, listeners
would not have access to such input (Field 2013) though admittedly
they would not have a multiple-choice listening task to complete
either.
1.5.1.2 Lack ofreal gaps
As mentioned in 1.4, unlike in the written language, there are no

real gaps in the spoken language; listeners have to decide where one
word ends and the next one starts (Lynch 2010). Speakers who do
not enunciate clearly, who run their words together, or who add or
change sounds when speaking, therefore contribute to the difficulty
level experienced by the listener (Wilson 1998). Such speakers do, of
course, exist in real-life listening contexts; the important issue is to be
aware of how this can impact on the test takers ability to complete a
listening task.
1.5.1.3 Lack ofredundancy
It was observed in 1.4 that oral language includes a lot of superfluous

words which carry no real content, but which are simply used to package
the key point(s) the speaker is trying to convey. This redundancy is a key
factor in enabling the listener to process the continuous stream of input.
Take for example, the extractshown inFigure 1.1below which is from
the beginning of a lecture on first language acquisition:
Hi everyone
Er, today were going to talk about first language acquisition or to put it more
simply, how children learn their first language. In the first part of the lecture, I
am going to give a brief overview of some typical stages of language
development. Then I am going to briefly cover some important theories of child
language acquisition. ..
Fig. 1.1 Extract from lecture (Harding 2015)

A student taking notes on this overview would probably write down

something like the following:
1st language acquistion

Typical stages language development
Theories child language acquisition
In other words, s/he would write a total of 11 words. In the real-life

context, the listener would no doubt also use note form, for example,
devt for development, acq for acquisition and so on. The rest of the
lecturers words, which are not written down, that is the remaining 49
words, will no doubt have been assigned to the ether.
Without redundancy, the listening event becomes much more diffi-
cult. This underlines why using a piece of input that was originally cre-
ated to be read (as opposed to being written to be spoken like a speech,
sermon and so on) is much more difficult for listeners as many of the oral
features are missing.
1.5.2 Complexity ofprocessing
1.5.2.1 Multi-tasking
By now, it should have become clear to the reader why listening is con-
sidered a complex process. In order to be successful, the listener must
identify what the speaker is saying by simultaneously using a proces-
sor (which decodes the incoming message), a lexicon (against which
the words/phrases are matched), and a parser (which produces a mental
idea of what has been said). In addition, the listener is likely to call on
their knowledge of the topic, the speaker and the context while continu-
ously checking how everything fits into the whole picture. Visual input
(see1.5.3.4), adds yet another dimension.
Given the need for multi-tasking, it is therefore not at all surprising that,
even with native speakers, listening breaks down and the listener must ask
for repetition or clarification if the speaker is present. Indeed it is really
quite amazing that as listeners we manage to do this in our own L1, let
alone that our students can manage this in their second or third languages.
1.5.2.2 Controlled versus automatic processing
In 1.1 above it was pointed out that the amount of time a listener has to
spend in controlled as opposed to automatic processing mode is likely
to impact quite heavily on the degree to which their listening is likely
to be successful. If we then add to this the requirements of a task which
involves reading, and sometimes also writing, we have yet another factor
that the test developer needs to take into account. Too often, the strain of
having to process the sound file in real-time as well as respond to a task
is not fully appreciated, particularly if the tasks have not been through all
the recommended stages of task development (see 1.7).
1.5.3 Input
Based on the discussion so far in this chapter, it will have become clear
that the type of input the listener needs to process plays a major role in
terms of difficulty, and impacts on whether successful comprehension
takes place or not. The degree of success may be influenced by a number
of variables which are discussed below.
1.5.3.1 Content
Research carried out by Rvsz and Brunfaut (2013) found that input
which contained a higher percentage of content words, as well as a broader
range of words in general, increased the difficulty level for listeners as it
required more cognitive processing. Field (2013: 87) notes that the way
a word sounds when used in context, as opposed to the word being used
in isolation, also impacts on its level of difficulty for second language
listeners. He adds that longer pieces of input place an added burden on
the listener, as s/he has to continually modify the overall picture of what
the speaker is trying to convey.
1.5.3.2 Topic
Asking listeners to cope with an unfamiliar topic is likely to lead to addi-

tional strain in terms of the cognitive processing needed and may well
lead to reliability issues in terms of the resulting test scores (see Buck
2001; Banerjee and Papageorgiou 2016). This is also true of input that
entails a lot of cultural references, as listeners may need to understand
more than the actual language used.
Going into a listening event cold is liable to increase the difficulty
level. Where the topic can be contextualised, listeners are likely to acti-
vate their world knowledge or relevant experiences (schemata) and thus
reduce some of the pressure which their working memories will need
to deal with (Vandergrift 2011). It therefore seems reasonable to argue
that the topic of the sound file be signalled to the listener in the task
instructions (see 4.2). Where this does not happen, it is more than pos-
sible that the first utterance or two of the recording will be lost as the
listener attempts to grapple not only with the unknown topic but also
with the speakers accent, intonation and speed of delivery as well as the
task itself. In such scenarios, items which are placed at the very beginning
of the sound file are likely to prove particularly difficult to answer.
However, sometimes a test takers background knowledge of a topic can
have a negative effect (Rukthong 2016). Lynch (2010: 54) points out:
listeners background knowledge can distort as well as support comprehension.

Knowing a great deal about a topic can lead to false interpretations if the lis-
tener does not continually monitor their current understanding against the evi-
dence in the input.
1.5.3.3 Sound quality
It hardly needs to be said that, all things being equal, a poor quality sound
file is going to be much more difficult to process than one with good
sound quality. While in real life there are occasions when we do have to
cope with the former, it would be unfair to assess a test takers listening
ability on something that is of poor sound quality unless it can be argued
that this is something the listener would have to do in the real-life listen-
ing context. Even air traffic controllers and pilots, who may well be faced
with such conditions, are able to ask the speaker to repeat the message.
Many test developers (often with their teachers hat on) feel that sound
files that include background noise are unfair. However, from arealistic point
of view, some type of background noise is nearly always present, be it the
humming of lights, the air conditioner or noise resulting from traffic. The
important issue to remember is that any background noise should be sup-
portive rather than disruptive; in other words, the noise should help the lis-
tener by providing clues as to the context in which the event is taking place.
1.5.3.4 Mode ofdelivery
Research findings regarding the advantages of including visuals in a lis-

tening test are mixed (Brunfaut 2016). Those who are in favour of using
video clips argue that they replicate a real-life listening event more closely
as such events often come with visual support (Field 2013; Wagner
2013). Other research (Alderson etal. 1995), however, has revealed that
visual input can be distracting when test takers have to listen, read and
complete questions at the same time, and may result in it being ignored.
Ockey (2007: 533) found that test-takers had very little or no engagement
with still images in computer-based listening tests but varied in the degree
to which they watched the video stimulus suggesting that this is related
to an individual preference among listeners.
Following on from these research findings, a distinction therefore needs
to be made between whether the clip features context-only visuals or
content visuals. An example of the former would be a talking head clip,
where test takers simply see the head of the person giving the talk, such
as in an excerpt from a lecture. The latter, on the other hand, may involve
the use of visuals to convey actual information which the speaker is not
providing either due to time constraints or because the visuals do this in a
much clearer or more interesting way. In many testing contexts, the answer
to this issue is ultimately a practical one: in order to make it fair, all test
takers would need to have equal access (in terms of visibility) to a video
screen. For many test development teams this is simply not a viable option.
1.5.4 Task
There are a number of ways in which the task can contribute to the difficulty
experienced by listeners. These include the test method (how much does
the listener need to read and/or write in order to complete the items? Is the
method familiar? Is it appropriate to the type of listening being targeted?);
the wording of the instructions (Do these prepare the test taker for the task
they are to encounter? Do they introduce the topic in a helpful way?); the
example (Has this been included? Does it fulfil its role?); the total number
of items (Is there sufficient redundancy between the items for the listener to
process the input and complete the task before the next item needs answer-
ing?) amongst others. These issues are discussed in more detail in Chapter 4.
1.5.5 Listening environment
The actual physical location where the test takes place can also impact on
the difficulty level of the listening event. Such aspects as the acoustics of
the testing room as well as other conditions such as heat, space, light and
so on, can impact on the test taker and by extension his/her performance
on the test. Venues should be checked the previous day to field trials and
live administrations to minimise any external factors which might influ-
ence test performance (see 6.2.5).
1.5.6 Speaker characteristics
In some countries, second language listeners may only be exposed to a

limited range of voices, such as those of their teachers and those which
appear on audio files accompanying their textbooks. Thus, being exposed
to authentic sound files featuring unfamiliar voices can pose a real chal-
lenge, as listeners have to be able to cope with the speakers gender, age,
voice quality, speech rate, stress, rhythm, pitch range and accents (see
Field 2013; Lynch 2010). Care needs to be taken when introducing input
which is totally different from what test takers have been used to before.
Sample tasks and sound files should be made available well in advance so
students (and teachers) can familiarise themselves with any new demands
(see 2.5.1.2 for a discussion on authenticity.)
1.5.6.1 Speed ofdelivery
The speed at which the speaker talks is likely to contribute to the difficulty
level of the input (Lynch 2010; Field 2013). Brunfaut (2016: 102) writes:
Since faster speech gives listeners less time for real-time processing, it has been
proposed that it results in more comprehension difficulties, particularly for less
proficient second language listeners. A number of experimental as well as non-
experimental studies have confirmed this hypothesis.
Many test developers have little idea of how fast people speak on
the sound files they select, and yet this is of crucial importance when
attempting to link a sound file with the appropriate level of ability (see
2.5.1.12). This holds true for the listeners mother tongue as well as for
second languages. According to Wilson (1998), when a sympathetic
speaker talks to a second language listener, not only does s/he uncon-
sciously adapt the content, but the speed of delivery is also spontane-
ously adjusted until the speaker is sure of what the listener can cope
with. He states:
What could be more natural than a native speaker slowing down their rate of
speech and using simplified vocabulary to a foreigner? What could be less natu-
ral than a native speaker talking at full speed to a foreigner and not grading
their language?
1.5.6.2 Number andtype ofvoices
The more voices there are on a sound file, and the more overlap there
is between them, the more difficult it becomes for the second language
listener to discern who is saying what. This is particularly true if more
than oneof the voices isfemale. Both these issues must be taken into
account when determining the difficulty level of a particular sound
file.
1.5.7 Listeners characteristics
The characteristics of a listener can also impact on how difficult s/he

perceives the input to be. For example, if the listening section is the final
part of a battery of tests, the listeners powers of concentration are likely
to be lower than they were at the beginning of the test due to test fatigue
and, as a consequence, mistakes may occur.
Test anxiety is another characteristic which has been shown to be a

contributing factor to successful listening. Brunfaut (2016: 108) states:
Empirical studies found moderate to strong negative associations between

anxiety and performance, or, put differently, less anxious listeners achieved
higher listening scores. This was found for a range of listening item types with a
variety of task demands, for different target languages, and test takers from a
variety of first language backgrounds.
Other personal characteristics such as age and health can impact on

how well the listener is able to sustain their attention; the degree of inter-
est in the topic as well as the motivation to complete the task(s) can also
be contributing factors.
1.6 Why is assessing listening important?

Given how difficult the act of listening is, some people might be asking
themselves, why should we go to the trouble of trying to assess it, especially
when that process is also likely to prove difficult? The simple answer is
that listening is central to communication in todays global world. When
arriving in a new country, it is probably the first skill in which most second
language learners try to obtain survival knowledge. It is also essential for
watching films, documentaries, YouTube videos and other media recorded
input which come complete with sound. Todays generation are more likely
to watch a How-To Video than read a (possibly lengthy) set of written
instructions when they need to know how to do something.
And yet in a number of countries, the assessment of test takers listen-
ing ability still does not happen. This is sometimes due to practical reasons
involving the difficulty, or the lack of knowledge involved, in accessing,
downloading and/or developing sound files. Another reason could be that
many teachers are simply unaware of how to go about creating a listening
task based on a sound file. There are numerous teachers who have received
little or no training in the science of test development; even if they have
been fortunate enough to have been given some instruction, it is much
more likely to have focused on developing tasks which target language in
use (grammar and vocabulary) or reading ability rather than listening skills.
In other countries, where listening does appear in the battery of tests

school students undertake, the percentage of marks given over to that skill
is oftenmuch lower than that which is allocated to reading or writing, as
the latter are seen to be more important (see 7.1.2). In some cases, reading
and writing tests are centrally developed under the auspices of the relevant
ministry, while the assessment of listening and speaking skills are left to
schoolteachers with little advice or guidelines as to how to go about accom-
plishing this. In addition, the amount of time available for the assess-
ment of listening is frequently seriously limited making it very difficult to
administer a test which can target the construct in a representative way.
All of the above is likely to have a negative washback effect on how the
skill of listening is perceived by both teachers and test takers alike, and
by extension this often impacts on the amount of lesson time the skill is
allocated. Given the not inconsiderable burden of developing appropriate
listening tasks, and the amount of marks they might carry in the overall
scheme of things, many teachers may simply rate their students subjec-
tively based on their perceptions of the students listening ability.
The introduction of standardised listening tasks can, on the other
hand, have an astonishing impact on both the teaching and testing of
listening. Austrias secondary school leaving examination (the matura)
is a case in point. The construction and development of a bank of listen-
ing tasks, which were then made available for teachers to use in the final
school leaving examination, had a huge positive effect on how listening
was taught, and as a result the type of sound files and tasks used in the
classroom changed practically overnight (see Green and Spoettl 2009).
Being able to access tasks which had been used in past test administra-
tions also helped to promote an awareness of the qualities of good listen-
ing tasks among various stakeholders.
1.7 Summary
This chapter has attempted to outline the importance of having a clear
idea of what is involved in assessing listening before any attempt is made
to try to measure the skill. It has also investigated the different types of
listening that we engage in, how the spoken and written language differ
and the impact this can have in terms of successful listening. The issues
which contribute to making listening difficult were also explored as well
as the importance of assessing listening.
1.7.1 Task development cycle
The subsequent chapters of this book investigate how we can move from
this rather abstract concept of what listening involves to the somewhat
more concrete manifestation of a listening task. Each chapter discusses
one or more of the various stages a task should go through before it
can be used in a live test administration. Figure 1.2 illustrates the stages
which occur within this task development cycle:
Fig. 1.2 Task development cycle

Although test developer training appears only once at the top of

Fig.1.2, it goes without saying that it occurs throughout the stages of the
task development cycle. The training should be considered a continuous
and iterative programme that can take anywhere from between two and
four years to complete in full (Buck 2009).
The first step along this training path, with regards to the assessment of
listening, is an introduction to the theory of what listening involves, and
how this can be transformed into a test. This would necessarily involve
exploring the issues of validity and reliability (see Chapter 2) and the cru-
cial role they play in achieving meaningful test scores. It would also include
an investigation into the conditions under which listening ability should
be measured and how it can be assessed. This discussion results quite natu-
rally in Stage 1: the development of the test specifications (see Chapter 2).
As will be discussed in that chapter, the development of the test specifi-
cations is an iterative process and decisions will be reviewed, and changes
made to them, throughout the task development cycle leading up to the
live test administration (Stage 14).
Once the test developers have done as much work as they can on the
test specifications, they need to move on to the identification of appropri-
ate sound files (Stage 2) in accordance with the decisions they have made
(see 2.5.1). Once found, these sound files need to go through the text-
mapping procedure (Stage 3 and Chapter 3). Textmapping involves work-
ing together with at least three other test developers in order to determine
whether the sound files they have chosen are suitable for the type(s) of lis-
tening behaviour they have identified. During this stage, colleagues will also
provide feedback on other attributes of the sound file including its difficulty
level, topic, length, speed of delivery, and background noise (see 3.7).
Those sound files which have proved successful in textmapping will
then go forward to task development (Stage 4 and Chapter 4). During
this stage the test developers must decide which test method to use and
develop items accordingly (see 4.3). Item writing guidelines are of enor-
mous help at this stage of the task development cycle (see 4.4).
Once test developers feel that their tasks are ready, then it is time to move
to Stage 5 Peer review. This stage can involve a number of feedback phases
between the reviewer and the test developer as the arrows in Figure 1.2
reveal. It can also involve a change in test method or targeted construct (see
4.5.1). Those tasks which are finally given the green light at the end of this
stage will go forward to the field trial (Stage 6a); those which do not must
be dropped (Stage 6b). Inevitably, not every task will be successful, particu-
larly in the early stages of test developer training; this is one of the lessons
that both reviewers and test developers have to learn to accept.
The next stage in the task development cycle is the field trial (Stage 6a, see
Chapter 6). Prior to the trial taking place, some test developers may also be
involved in task selection for the trial test booklets (see 6.2.4) while others may
have the opportunity to take part in administering the trial, perhaps within
their own school or workplace. Invaluable insights come from the experience
of watching test takers respond to their own and/or their colleagues tasks.
Wherever possible, test developers should be encouraged to participate in
marking the field trial test papers (Stage 7) as again this will provide useful
feedback concerning how their tasks have performed (see 6.2.6).
Once all the trial papers have been marked, it is time for Stage 8
statistical analyses. It is strongly recommended that all test developers be
involved in this procedure as it is extremely helpful in explaining how
their tasks have performed and why some have succeeded and others
have failed (see 6.3.1 and Green 2013). In addition, probably for the
first time in the task development cycle, this stage also provides external
perceptions of the tasks in the shape of the test takers feedback on such
aspects as the sound files, instructions and tasks as well as how the test
was administered (see 6.1.9).
Stage 9 entails making one of three decisions concerning each and
every task which has gone through the field trial, based on the outcome
of the statisticalanalyses (Stage 8). The first option is that the task should
be banked with no changes and go forward to standard setting (see 7.2
and Stage 13) if this procedure is part of the task development cycle. The
second option is that the task should be revised. This is usually due to
some weakness which has come to light during the data analysis stage (see
6.3.2). The third option is that the task should be dropped as it has been
found to be unsalvageable for some particular reason (weak statistics,
negative feedback, inappropriate topic though the latter should have
been picked up long before the trial). For every task which is dropped, it
is important that the test developers learn something from the exercise;
not to do so would mean a waste of resources.
Stage 9b involves the revision of those tasks which were not banked or
dropped; this stage is similar to that of Stages 3 and 4, as it will involve
some peer review. Once the revised tasks are ready, they move to Stage 10,
which is Trial 2. (Other newly developed tasks can obviously be trialled
at the same time as the revised tasks.)
Stages 11 and 12 are a repeat of Stages 7 and 8, only this time there are
just two options available for those tasks which have already been revised.
These are bank or drop. The decision to drop a task which has been tri-
alled twice, and failed to meet requirements, is a practical one. Trialling,
marking and carrying out statistical analyses are time-consuming and
expensive. One exception some test development teams make is if there has
been a test method change after the first trial; that decision must depend on
the resources you have available. Experience, however, suggests that if a task
does not work after going through all of the above stages, including two
periods of peer review and two trials, it is probably not going to work. This
outcome has to be accepted, and lessons learnt for future task development.
Stage 13 involves submitting those listening tasks which have been
banked, to an external review process known as standard setting (see 7.2)
or to a stakeholder meeting (see 7.3). Not all test development teams
will be able to organise a standard setting session due to the resources
necessary to carry out this process (see 7.2.3-7.2.9), but for those test
developers who are involved in high-stakes testing or nationaltests, this
is a procedure you should at least be aware of, and preferably be involved
with. Those tasks which receive a green light from the judges in standard
setting are usually deemed eligible for consideration in a live test admin-
istration (Stage 14). Invaluable insights can be gained from the standard
setting procedure which can be fed back into test developer training.
The final stage of the task development cycle entails the writing of the
post-test report and statistical analyses of the live test results (Stage 15).
For reasons of accountability and transparency among others, it is impor-
tant that a post-test report be drawn up after the live test administration.
This should provide information about where and to whom the live test
was administered, as well as including the results of a post-test analysis
of the items and tasks. Although all the tasks which go into the live test
should already have good psychometric properties, it is still important to
analyse how they have performed in a real-test situation. Remember, no
matter how much care has been taken in selecting the trial test popula-
tion (see 6.2.1), the conditions can never be exactly the same. The test
takers who take part in the live test are much more highly motivated than
those who took part in the trial. It is important to verify that the statisti-
cal properties on which the tasks were chosen still hold true. In other
words, that the items still discriminate and contribute positively to the
internal consistency of the test (see 6.3.2.2 and 6.3.2.4). These post-test
insights will be of great benefit for the test developers and their future
task development work which, once the administration of the live test is
over, very often will start once more.
Not everyone reading this book will be able to carry out all of these
stages. In many cases, even where test developers would like to do this,
the challenges and constraints (Buck 2009) of their testing context will
make some stages very difficult to achieve. The important thing is to
attempt to do as many as possible.
DLT Bibliography
Alderson, J.C., Clapham, C., & Wall, D. (1995). Language test construction and
evaluation. Cambridge: CUP.
Baddeley, A. (2003). Working memory: Looking back and looking forward.
Nature Reviews Neuroscience, 4, 829-839.
Banerjee, J., & Papageorgiou, S. (2016). Whats in a topic? Exploring the inter-
action between test-taker age and item content in high-stakes testing.
International Journal of Listening, 30 (1-2), 8-24.
Brown, G., & Yule, G. (1983). Teaching the spoken language. Cambridge:
Cambridge University Press.
Brunfaut, T. (2016). Assessing listening. In D.Tsagari & J.Banerjee (Eds.), Handbook
of second language assessment (pp.97-112). Boston: De Gruyter Mouton.
Buck, G. (2001). Assessing listening. Cambridge Language Assessment Series.
Eds. J.C. Alderson and L.F. Bachman. Cambridge: CUP.
Buck, G. (2009). Challenges and constraints in language test development. In
J.Charles Alderson (Ed.), The politics of language education: Individuals and
institutions (pp.166-184). Bristol: Multilingual Matters.
Bygate, M. (1998). Theoretical perspectives on speaking. Annual Review of
Applied Linguistics, 18, 20-42.
Chafe, W.L., & Danielewicz, J. (1987). Properties of spoken and written lan-
guage. In R.Horowitz and S.Jay Samuels (Eds.), pp.83-113.
Fehrvryn, H. K., & K. Piorn. Alderson, J. C. (Series Ed.). (2005). Into
Europe. Prepare for modern English exams. The listening handbook. Budapest:
Teleki Lszl Foundation. See also http://www.lancaster.ac.uk/fass/projects/

examreform/Media/GL_Listening.pdf
Field, J. (2013). Cognitive validity. In A. Geranpayeh & L. Taylor (Eds.),
Examining listening. Research and practice in assessing second language listening
(pp.77-151). Cambridge: CUP.
Fortune, A. (2004). Testing listening comprehension in a foreign language Does
the number of times a text is heard affect performance? MA Thesis, Lancaster
University.
Green, R. (2013). Statistical analyses for language testers. New York: Palgrave
Macmillan.
Green, R., & Spoettl, C. (2009). Going national, standardised and live in Austria:
Challenges and tensions. EALTA Conference, Turku Finland. Retrieved from
http://www.ealta.eu.org/conference/2009/docs/saturday/Green_Spoettl.pdf
Harding, L. (2015, July). Testing listening. LanguageTesting at Lancaster summer
school. Lancaster, UK: Lancaster University.
INTAN (National Institute of Public Administration) ELPA Test. See: https://
www.intanbk.intan.my/iportal/index.php/en/elpa-elpa
Lynch, T. (2009). Listening in the language classroom. Cambridge: Cambridge
University Press.
Lynch, T. (2010). Teaching second language listening: A guide to evaluating, adapt-
ing, and creating tasks for listening in the language classroom. Oxford, UK:
Oxford University Press.
Ockey, G.J. (2007). Construct implications of including still image or video in
computer-based listening tests. Language Testing, 24, 517-537.
Rvsz, A., & Brunfaut, T. (2013). Text characteristics of task input and diffi-
culty in second language listening comprehension. Studies in Second Language
Acquisition, 35(1), 31-65.
Rost, M. (2011). Teaching & researching listening (2nd ed.). Harlow: Longman.
Tannen, D. (Ed.) (1982a). Spoken & written language: Exploring orality and
literacy. In Advances in Discourse Processes (vol. IX). Norwood, NJ: Ablex.
Tannen, D. (1982b). The oral literate continuum of discourse. In D.Tannen (Ed.)
1-6.
Vandergrift, L. (2011). L2 listening: Presage, process, product and pedagogy. In:
E.Hinkel (Ed.), Handbook of research in second language teaching and learning
(vol. 2, pp.455-471). NewYork: Routledge.
Wagner, E. (2013). An investigation of how the channel of input and access to
test questions affect L2 listening test performance. Language Assessment
Quarterly, 10 (2), 178-195.
White, G. (1998). Listening. Oxford: Oxford University Press.
Wilson, J.J. (2008). How to teach listening. Harlow: Pearson.
2
How can test specifications help?
2.1 What are test specifications?

Designing a new test can be a daunting prospect for test developers due
to the number of decisions which need to be made. It helps enormously
if these can be documented from the very beginning so that they can
be reviewed throughout the task development process. The document
which is best suited to this purpose is known as the test specifications.
It is basically a detailed description, or a blueprint, of the test you wish
to construct. Like any blueprint, the details must be carefully thought
through before any test development takes place.
In general, test specifications consist of three main parts. The first part
focuses on describing the construct in other words, the ability or abili-
ties that you are attempting to measure in your test. The second exam-
ines the conditions under which those abilities will be measured, while
the third provides information concerning how the test takers perfor-
mance on the test will be graded. Some researchers recommend includ-
ing the item writing guidelines in the test specifications (see Haladyna
and Rodriguez 2013), though this can make the document quite wieldy.
In my experience, it is more practical to keep the document short

DOI10.1057/978-1-349-68771-8_2
(seeFigure2.6), as this is more likely to encourage test developers to use

it, and to have the guidelines as a separate document.
Test specifications should be seen as a working document which grows
from the initial concept of the test, gradually becoming more refined as
the test design becomes clearer and more stable. Haladyna and Rodriguez
(2013: 40) refer to test specifications as a very valuable guidance tool
for test development and a basis for content-related validity evidence. You
should expect the test specifications to go through a number of versions
during the tests development and subsequent trialling and analysis.
2.2 Purpose ofthetest

The very first issue that the designers of the test need to be clear about is
the general purpose of the test they are going to develop. For example,
is the test going to be used for achievement purposes, a test which is
oftengiven at the end of a course or term, and which is generally based
on a syllabus or course book? Or is the purpose of the test to place the test
takers into a specific class or stream according to either their overall abil-
ity or their ability in a particular skill in other words, a placement test?
Another purpose could be to determine the test takers proficiency level.
Such tests are not usually based on a syllabus or course; instead they are
designed to show a test takers general language proficiency. Proficiency
test results may be used for deciding whether the test taker has sufficient
linguistic ability to undertake a specific course, or to work within a par-
ticular profession. If, on the other hand, you wish to identify your test
takers strengths and weaknesses, then your test is more likely to be diag-
nostic in nature (see Buck 2001; Alderson etal. 1995).
2.3 Target test population

In determining the purpose of the test you are to design, there must be a
specific test population in mind. This test population may be known (as,
for example, in the case of an achievement test) or unknown (as with a
proficiency test). In either case, it is important to take into c onsideration
2 How can test specifications help? 29
a number of factors about the test takers. For example, their age, in terms
of the degree of cognitive processing the materials may require; compare
young learners with adults test takers for instance. Age will also have some
bearing on the type of topics that are chosen. In addition, the test takers
gender, first language and location should also be taken into account to
ensure that the materials chosen contain no potential sources of bias. For
example, those living in an urban environment may have an advantage
if some of the sound files are based on specific subjects which are not so
familiar to those who live in rural areas.
2.4 The construct

The terms construct and/or construct validation may be unfamiliar to
some readers. Useful definitions of these two terms are given in Ebel and
Frisbee (1991: 108):
The term construct refers to psychological construct, a theoretical conceptualisa-

tion about an aspect of human behaviour that cannot be measured or observed
directly. Construct validation is the process of gathering evidence to support
the contention that a given test indeed measures the psychological construct the
makers intended it to measure. (cited in Alderson etal. 1995: 182)
Put simply, the construct is the theory on which the test is based. To
expand on this a little, if you are designing a listening test, it is the defi-
nition of what listening is in your particular context: for example, an
achievement test for 11-year-olds, a proficiency test for career diplomats
and so on. Once defined, this construct (or theory) has to be transformed
into a test through the identification of appropriate input and the devel-
opment of a suitable task. Clearly, the definition of what listening is will
differ according to the purpose of the test and also the target test popula-
tion. The construct on which a listening test for air traffic controllers is
based, for example, will be quite different from one which would be used
in a test for young learners.
Defining the construct accurately and reliably is arguably one of the
most important responsibilities of test designers. This is because during
the development of the test specifications and tasks, they will need to
collect validity evidence to support their definition of the construct. This
evidence can be of two kinds: the non-empirical type (Henning 1987;
or interpretative argument, Haladyna and Rodriguez 2013); and the
empirical type based on quantitative and qualitative data (see Chapter 6).
The test designers also need to be aware of the two main threats to con-
struct validity: construct under-representation and construct irrelevant
variance. (These terms are discussed below.)
The construct can be based on a number of sources. For example, in
the case of an achievement test, insights can be gained from the curricu-
lum, the syllabus or the national standards. The construct could also be
based on a set of language descriptors such as those found in the Common
European Framework of References (CEFR), in the Standardised Agreement
(STANAG) used in the military field or on the descriptors developed by
the International Civil Aviation Organization (ICAO) for use with air
traffic controllers and pilots, to name just a few. A third source might be
the target language situation. In this case, the construct could be based
on a set of descriptors outlining the types of listening behaviour test tak-
ers would need to be able to exhibit in a given context. For example, the
listening skills perceived to be necessary to cope with tertiary level studies
or employment in an L2 context. Finally, the construct could be based
on a mixture of these sources, for example, the school curriculum, the
national standards and the CEFR descriptors.
Figures 2.1 to 2.3 below show extracts from different sets of language
descriptors. Figure 2.1 shows the descriptors for CEFR Listening B2.
(The acronyms at the end of the descriptors represent the names of the
tables from which they have been taken, for example, OLC = Overall
Listening Comprehension.)
Figure 2.2 shows the descriptors pertaining to STANAG Level 1
Elementary.
Figure 2.3 displays the descriptors relevant for assessing a test takers
listening ability at ICAO Level 4 Operational.
These three sets of descriptors offer test developers useful insights into
the types of listening behaviour expected at those levels, as well as provid-
ing additional information about the conditions under which listening
takes place (part two of the test specifications see 2.5). For example,
in terms of what the listener is expected to be able to comprehend, the
1. Can understand the main ideas of propositionally and linguistically complex
speech on both concrete and abstract topics delivered in a standard dialect,
including technical discussions in his/her field of specialisation. (OLC)
2. Can follow extended speech and complex lines of argument provided the topic is
reasonably familiar and the direction of the talk is sign-posted by explicit markers.
(OLC)
3. Can with some effort catch much of what is said around him/her, but may find it
difficult to participate effectively in discussion with several native speakers who
do not modify their speech in any way. (UCBNS)
4. Can follow the essentials of lectures, talks and reports and other forms of
academic / professional presentation which are propositionally and linguistically
complex. (LMLA)
5. Can understand announcements and messages on concrete and abstract topics
spoken in standard dialect at normal speed. (LAI)
6. Can understand most radio documentaries and most other recorded or broadcast
material delivered in standard dialect and can identify the speakers mood, tone
etc. (LAMR)
7. Can use a variety of strategies to achieve comprehension including listening for
main points checking comprehension by using contextual clues. (ICI)
Fig. 2.1 CEFR B2 descriptors
descriptors mention main ideas of propositionally and linguistically com-

plex speech, the essentials of lectures, talks and reports, concrete utterances,
extended speech and complex lines of argument.
They also provide information regarding the range of topics:
familiar/unfamiliar, everyday, and specialised , as well as details about the
type of listening contexts, for example, talks, conversations, and questions
and answers, which will help the test developer to determine appropriate
input types for the listening tasks. In addition, some of the descriptors
define aspects of the speakers characteristics, such as those related to the
o Can understand common familiar phrases and short simple sentences about
everyday needs related to personal and survival areas such as minimum courtesy,
travel, and workplace requirements when the communication situation is clear and
supported by context.
o Can understand concrete utterances, simple questions and answers, and very
simple conversations. Topics include basic needs such as meals, lodging,
transportation, time, simple directions and instructions.
o Even native speakers used to speaking with non-natives must speak slowly and
repeat or reword frequently.
o There are many misunderstandings of both the main idea and supporting facts.
o Can only understand spoken language from the media or among native speakers
if content is completely unambiguous and predictable.
Fig. 2.2 STANAG Level 1 descriptors
COMPREHENSION
Comprehension is mostly accurate on common, concrete, and work related topics
when the accent or variety used is sufficiently intelligible for an international
community of users. When the speaker is confronted with a linguistic or situational
complication or an unexpected turn of events, comprehension may be slower or
require clarification strategies.
Fig. 2.3 ICAO Level 3 descriptors
expected speed of delivery, dialect, and degree of repetition. These provide

further insights into the conditions under which the successful listener
should be able to operate if s/he is at the given level. Finally, references are
also made to the nature of the sound files discourse structure.
All of the above is immensely valuable in helping the test developer
decide what is appropriate for the ability level s/he is attempting to mea-
sure, and, by extension, what is above and below that level in terms of the
expected construct, topic(s), speaker characteristics and discourse structure.
Unfortunately, language descriptors, as well as other sources such as the
curriculum and the national standards, do not always describe the various
types of listening behaviour in sufficient detail for them to assist in test
design. In such situations, it is useful to add a further set of definitions
which describe the different types of listening behaviour in more practical
terms. Field (2013: 149) supports this approach, saying even a simple
mention of listening types using listening for categories or the param-
eters local/global and high attention/low attention might provide useful
indicators. Such additional descriptors could be added to the test speci-
fications under a separate heading as shown in Figure 2.4 (see also 4.1):
General Focus Definition
Gist (G) Listening selectively to identify the overall idea or the
macro-proposition.
Listening for specific Listening selectively to identify names, dates, places,
information (SI) numbers, acronyms and so on.
Listening for important details Listening selectively to identify words / phrases which
(ID) are important in the sound file.
Search listening (SL) Listening for words that are in the same semantic
field. For example, the word doctor might bring to
mind such words as hospital, clinic, accident,
university, health, medicine and so on (Field 2013)
Listening for main ideas and Listening carefully in order to understand explicitly
supporting details (MISD) stated main ideas and supporting details.
Listening to infer Listening carefully to understand implicit meaning. For
(propositional) meaning example, listening to infer the speakers attitude
(IPM) towards a particular line of argument.
Fig. 2.4 General listening focus

As mentioned by Field (2013) above, it might also be useful for

test designers to think about describing listening at the global and/or
the local level. Global listening generally entails listening at the inter-
sentential level, listening to larger chunks of a sound file or listening
to most of the input in order to complete a task. This type of listening
would, for example, include listening for gist, listening for main ideas
and listening to infer (propositional) meaning. Local listening, on the
other hand, would involve mainly sentential level listening and would
include listening for specific information and important details as well
as search listening.
Spending time on defining the construct should help to ensure that
it is described in a practical and representative way, thus avoiding (or
minimising) construct under-representation. For example, if your test is
targeting B2, and only descriptor numbers 3 and 5 from Figure 2.1 are
included in the test specifications, it could be argued that the test taker
is not being asked to exhibit a sufficient range of the listening behaviours
expected at that ability level. This would suggest, in turn, that the scores
from such a test lack reliability in terms of arguing that a particular test
taker is at B2.
In addition, a clear definition of the targeted construct is likely to
minimise the impact of other skills (see 4.3.6). For example, a test takers
performance on a listening test may be affected if they are required to
carry out a disproportionate amount of reading or writing. This phe-
nomenon is referred to as construct irrelevant variance. It can also come
from other sources such as cheating, scoring errors, fatigue, rater severity,
differing time limits across test administrations inter alia (Haladyna and
Rodriguez 2013).
2.5 Performance conditions

The function of the second part of the test specifications is to describe
the conditions under which the construct will be measured. This involves
making decisions about a number of issues related to the sound file and
the items.
2.5.1 Input
2.5.1.1 Source
Finding appropriate sound files is one of the most challenging aspects of

developing listening tasks and the test specifications should provide as
much help as possible in terms of listing the potential sources test devel-
opers can use. By far the most popular source at the present time is the
internet. It offers a wide range of possibilities, including talks, interviews,
speeches, discussions, lectures, phone-ins and so on. Other alternative
sources include the radio or television. One must be aware, though, that
much of this material is copyright protected and it is strongly recom-
mended that permission to use any sound file be obtained before any
work on task development begins so as not to waste valuable task devel-
opment time. Care should also be taken to ensure that the sound files
work without any accompanying visuals, unless they are going to be used
as a video clip (see 1.5.3.4 and 2.5.1.10).
Self-created materials are another source that can be used as the basis
for a listening task. These could include interviews with individuals talk-
ing about themselves or being asked their opinions on a variety of topics;
similarly, a conversation or discussion between two people can provide
a useful basis for a sound file. Appropriate preparation needs to go into
the development of self-created materials, to encourage conditions which
reflect a real-life listening event. For example, care must be taken not to
script the interaction; instead it is recommended that a list of questions or
bullet points be developed which can be used in interviews or monologues.
If it is at all possible, try to include an audience when creating such
self-created materials; it is amazing what a positive impact this can have
on the way the speakers talk and actwhen there is someone to make eye
contact with,and this will be reflected in the level of naturalness in the
resulting sound file. Good speakers, in general, want to make eye con-
tact with their audience and adjust their speech in accordance with their
perceptions of how well it is being received. Similarly, they rephrase what
they are saying if they see that the point they are trying to make has not
been understood. It should be noted, however, that not everyone (includ-
ing some native speakers) is good at producing spontaneous speech or
even talking to bullet points. You should therefore always allow for two
or three attempts for the speakers to warm up, for the recording to come
across as being as natural as possible.
Finding readily available listening input is particularly difficult at the
lower level of the ability spectrum. The development of talking points
as the basis for creating sound files, although detracting from cognitive
validity (Field 2013: 110), is one possible solution when simply no other
materials are available. Talking points provide speakers with some sort of
framework within which they can talk about topics which are appropriate
for lower ability levels whileat the same time allowing for at least some
degree of spontaneity. The framework should be based on an imaginary
listening context in order to encourage appropriate linguistic features and
not on a written text.
The challenge in developing talking points is to provide just enough
key words for the speakers to produce naturally spoken language while
simultaneously avoiding either a scripted dialogue or a framework which
is too cryptic. Speakers who are asked to work on talking points may
need some initial practice; to help them, it is recommended that the
talking points appear in a table form so that it is clear who says what
when (see Figure 2.5). Once recorded these can then be textmapped (see
Chapter3), and a task developed.
Johns mother needs him to go to the shop.

Mum John
John, shop?
OK. Need?
Bread, eggs
Eggs ?
Six oh and milk
Large, small?
Large. Money
Fig. 2.5 Talking points

2.5.1.2 Authenticity
What makes a sound file authentic? This is not an easy question to answer
(see Lewkowicz 1996). A speech given by a high-ranking diplomat which
exhibits many written characteristics is no less authentic than a conversa-
tion which reflects more oral features, such as pauses, hesitations, back-
tracking and redundancies. They are both parts of the oral to written
continuum from which test developers might select their sound file mate-
rials. What makes it more or less authentic is its appropriateness to the
given testing context. For example, using the speech mentioned above as
part of a test for diplomats would carry a lot of cognitive (and face) valid-
ity (even more soif the speech maker isphysically present) but this would
not be true if it were used in a test for air traffic controllers. So part of the
authenticity argument has to be the extent to which it relates to the target
test population as well as the purpose of the test.
Let us look at some more examples. Is a sound file exhibiting a range
of non-standard accents authentic? Answer yes, you would definitely
come across this scenario in a university or joint military exercise context.
Could it be used in testing? Answer yes, if that is what test takers would
be faced with in the real-life listening context. What about the relation-
ship between authenticity and the speed of delivery? Would a sound file
with two people talking at 180 words per minute be considered authen-
tic? Answer yes, for higher-level listeners but arguably no, for lower
level ones, as we would not expect someone of that level to be able to
cope with it. All of these examples argue for not divorcing authenticity in
a sound file from the context in which it will be used.
The key factor which test developers need to ask themselves is whether
the language and its related characteristics (accent, speed of delivery,
degree of oral features and so on) reflect a real-life speaking and listening
event. Many of the recordings to be found on EFL websites do not meet
these criteria; this is because the materials have often been developed
with the purpose of language learning and as such the speed of delivery
has often been slowed down or the language simplified artificially. If your
aim in developing a listening test is to obtain an accurate picture of your
test takers ability to understand real-life input, then it is strongly recom-
mended that these sources be avoided (see Fehrvryn and Piorn 2005,
Appendix 1 2.1.2).
When selecting sound files remember that it is not necessary that every
word be familiar to the target test population; provided that the unknown
words are not seminal to understanding the majority of the sound file
(and this should be picked up during the textmapping procedure if this
is the case see Chapter 3), this should not be a problem. On the other
hand, where there are a significant number of new or unfamiliar words,
the listener is likely to be overwhelmed very quickly and processing is
likely to break down.
Although test takers (and some teachers) may initially react in a nega-
tive way to the use of authentic sound files in listening tests, by using
them we are not only likely to get a more reliable test result but also add
validity to the testscores. As Field(2008: 281) states:
A switch from scripted to unscripted has to take place at some point, and may,
in fact, prove to be more of a shock when a teacher postpones exposure to authen-
tic speech until later on. It may then prove more not less difficult for learners to
adjust, since they will have constructed well-practised listening routines for
dealing with scripted and/or graded materials, which may have become
entrenched.
2.5.1.3 Quality
In real life listening, we sometimes have to struggle with input that is not
at all clear; announcements, especially those on planes, are often indis-
tinct or distorted. We have to ask ourselves, though, whether it would
be fair to assess our test takers listening ability under such conditions?
While this may be appropriate in some professions those working in
the aviation field, for example, do have to be able to understand unclear
speech for the majority of the test takers this is not the case, and there
should be a clearly justifiable reason for including sound files that fall
into this category in a test.
Background noise, on the other hand, is ubiquitous and to avoid
including at least some sound files with background noise in a test would
not be reflecting reality. What the test developer has to determine is
whether the noise is supportive or disruptive. In other words: does it help

the listeners by providing them with some idea of the context or does it
so distract them that the message is lost in spite of an appropriate level
of listening ability? Where the latter is the case, the sound file should be
avoided.
2.5.1.4 Level ofdifficulty
Obviously, the sound file must be in line with the targeted level of the
test. Due to the difficulties involved in finding appropriate sound files,
some test developers resort to using a sound file which is easier and make
up for this by producing items which are more difficult. Thus when the
sound file and items are combined they represent the targeted level. This
procedure means, however, that it is the items that have become the focus
of the test rather than the sound file itself. In reality, it should be the
sound file that is the real test the task is merely a vehicle which allows
the test developer to determine whether the test takers have compre-
hended it. Field (2013: 141, 144) cautions test developers against using
this procedure:
The fact is that difficulty is being manipulated by means of the written input
that the test taker has to master rather than by means of the demands of the
auditory input which is the object of the exercise. item writers always face
a temptation, particularly at the higher levels, to load difficulty onto the item
rather onto the recording.
Similarly, if the sound file is, for example, B2 but the items are B1,
the construct is unlikely to be tested in a reliable way, as the items
are not targeting the listening behaviour at the appropriate level. Of
course, it must be acknowledged that it is very difficult to ensure that
all items in a B2 task are targeting B2; in fact, it is more than likely
that in a task consisting of eight items, at least one is likely to be either
a B1 or a C1 item. This is where procedures such as standard setting
and establishing cut scores are very useful (see 7.2) as these items can
then be identified.
2.5.1.5 Topics
The test specifications should include a list of appropriate topics which

can be used by the test developers when selecting sound files as well as a
list of topics that should be used with discretion. The list of topics could
be based on the school curriculum (in the case of an achievement test),
on the perceived interest of the targeted test population, or at the higher
levels it could be based on a specialised field, for example, in the case of
a test which has been developed for specific purposes (military, diplo-
matic, medical, aeronautical fields and so on). Topic areas that may cause
offence or emotional distress, such as drug abuse, sex, violence, serious ill-
ness, disability and child abuse, inter alia, should be avoided in a test situ-
ation where the level of anxiety is already likely to be high. Some of these
topics may work well in the classroom, where the teacher can sensitively
control the discussion; in the test situation, this is not the case. Careful
thought should also be given to the inclusion of humour in sound files,
as this can be a source of construct irrelevant variance.
In addition, the topics chosen should be as accessible, interesting and
motivating as possible due to the positive impact this can have on a test
takers performance (see 1.5.3.2). White (1998) argues that where inter-
est is high, this can even lead to the test taker taking on the role of audi-
ence, that is, allowing him/her to forget temporarily that it is a testing
situation. Feedback from test takers over the years has shown that they
particularly enjoy those sound files where they feel they are being exposed
to some new information.
2.5.1.6 Discourse type
The test specifications need to include information about which discourse

types are appropriate to the difficulty level being targeted. For example, at
the lower end of the listening ability range, this may only include narra-
tive or descriptive input, while at higher levels others, such as argumen-
tative, problem/solution, expository, and persuasive, could be included.
A sound file that exhibits a recognisable discourse structure is generally
thought to be easier for test takers to follow. For example, a lecture that
starts with the speaker providing a clear overview of the areas s/he is going
to touch on, and which then proceeds to use clear discourse markers, is
felt to be easier than one where the speaker meanders through the talk
with apparently little direction and includes multiple asides. However,
Rvsz and Brunfaut (2013) report that the few research studies which
have explored the effect of cohesion on listening difficulty have produced
mixed findings.
2.5.1.7 Nature ofcontent
Another consideration that needs to be made is the extent to which the

content of the sound file is concrete or abstract, as the latter is more dif-
ficult to process than the former (Field 2013; Bachman 1990). It seems
reasonable to argue, therefore, that at the lower ability levels, arguably
up to B1in the CEFR, most of the content should be more concrete. At
the higher levels, though, a certain amount of abstract content could be
introduced. For example, a sound file where two speakers are discussing
the issues involved in how people use language codes to convey to other
listeners what they want them to know is likely to be more abstract in
content than one which focuses on a description of the most popular
tourist sites in Myanmar.
2.5.1.8 Number ofsound files needed
There are a number of reasons for including more than one sound file
in a test. First of all, including several sound files means you can expose
test takers to different discourse structures, topics and speakers. Secondly,
each new sound file provides the test taker with a fresh opportunity to
exhibit his/her listening ability; thus, if for some reason a test taker reacts
poorly to one particular sound file, there will be another opportunity to
exhibit his/her listening ability. Thirdly, using more sound files in a test
makes it possible to use different sound files for different types of listen-
ing behaviour (see Chapter 3). Fourthly, the inclusion of a number of
sound files is likely to reduce the temptation to overexploit a single sound
file by basing all the listening items on one piece of input.
2.5.1.9 Length ofsound files
Precise information regarding the appropriate length of the sound files is

not usually provided in language descriptors or other sources on which the
test specifications may be based. For example, in the CEFR B2 descrip-
tors in Figure 2.1 above, we are told that listeners at this level should be
able to handle extended speech, but this statement is not qualified in
any way. In Figures 2.2 (STANAG) and 2.3 (ICAO) no mention is made
regarding the length of sound files that those test takers are supposed to
be able to cope with. It is important, therefore, that the test specifica-
tions provide upper and lower limits for the sound files which are to be
selected, otherwise there is a danger that test developers will work accord-
ing to their own perceptions as to what is suitable at the given level.
The chosen length of the sound file should be commensurate with the
construct the task is attempting to target. For example, in order to mea-
sure the test takers ability to understand extended speech and complex
lines of argument at CEFR B2, it seems reasonable to argue that you
would need to include longer sound files of at least three or four minutes
duration. On the other hand, an 11-second sound file could be sufficient
to produce an example and two items targeting specific information and/
or important details at a lower ability level (see 3.4). When developing a
multiple matching task based on a series of short snippets, the length of
each may be of only 20 to 25 seconds duration. However, with shorter
sound files, care must be taken to allow time for the listener to get used
to the speakers voice (Field 2013: 116).
It should also be noted that length is only one aspect which contributes
to a sound files difficulty level in terms of sustained cognitive processing.
Other factors, such as the speed of delivery, lack of redundancy, lack of
discourse structure and lexical/grammatical density, can all contribute to
difficulty even on a relatively short sound file (see Buck 2011; Field 2008;
Geranpayeh and Taylor 2013).
In addition to describing the desired length of the individual sound
files, the test specifications should also include information on how
much time is actually available for the listening test as a whole. This is
important as the test designer must ensure that there is sufficient time to
listen to each sound file and the instructions; to study the example; and
to read through the items and complete them. If a decision is taken to

play all sound files twice (see 2.5.1.11 below), this will also need to be
factored in.
2.5.1.10 Mode ofdelivery
Test developers need to decide whether the test will use only sound files
or video clips as well, and whether these should be of the talking head
variety and/or content-based. These issues were discussed in 1.5.3.4. The
decision as stated there is often a practical one; to make it fair to all,
the test takers need to have equal access to the input, ideally provided
through individual screens at the desk where they are taking the test.
This, in many testing situations, is simply not a practical option.
2.5.1.11 Number oftimes heard
This issue is not an easy decision to make. As Geranpayeh and Taylor

(2013: 197) state:
A convincing case can be made for both approaches, depending upon factors
such as test purpose, cognitive demand, task consistency, sampling and practi-
cality, all of which reflect the need to balance competing considerations in test
design, construction and delivery.
Lets look in more detail at some of the issues involved. First of all, we
need to ask ourselves to what extent will listening once or twice impact
on the type of listening behaviour employed by the listener, and, by
extension, what effect will that have on the cognitive validity of the test?
Fortune (2004) suggests that listeners tend to listen more attentively if
they know they are only going to hear the input once. Reporting on
research carried out by Buck (1991) and Field (2009), Field (2013: 127)
suggests that test takers carry out different types of processing (lower-
and higher-level) when given the opportunity to listen twice. On the
first listening, they are establishing the approximate whereabouts of the
relevant evidence in the sound file and possibly making initial links with
one or more of the items. On the second listening, the actual position
of the information is confirmed and the initial answer(s) reviewed and
either confirmed or changed. Field also adds that given the cognitive
demands on the test taker (processing of input and confirming/eliminat-
ing distracters) plus the lack of visual and paralinguistic clues, that this
argues for being able to listen twice, as this goes way beyond the cognitive
demands of the real-life listening context.
On the other hand, where test takers simply need to identify specific
information or an important detail in a sound file, it seems reasonable to
argue that this should be achievable on the basis of listening once only.
The amount of content that needs to be processed in order to complete
an item is much less, and from a processing point of view should be less
demanding, than trying to infer propositional meaning. Where test takers
are allowed to listen twice, it becomes very difficult for the test developer to
create such selective listening items at higher levels of ability as the test tak-
ers know they will hear it all again if they miss the required information on
the first listening (see the discussion on Task 5.6, Chapter 5). This, in turn,
can result in the test developer making the items more difficult than they
should beby targeting more obscure (and possibly less important) details.
A second issue which should be considered is that playing every sound
file twice in a listening test takes up a lot of time, and consequently means
that there will be less time for other sound files. This could impact on the
construct coverage, as there may be insufficient time to play a range of
sound files targeting different types of listening behaviour and reflecting
different input types, topics and discourse styles.
Thirdly, and the oft-quoted argument, is that in real life we rarely listen
to the same sound file twice unless it is something we have downloaded
from the internet and/or been given for study purposes. Even in situ-
ations where we are able to ask for clarification from the speaker, s/he
generally reformulates what has been said in order to make the message
clearer. There are also many occasions where even if we do not hear the
input again, we can manage to complete any gaps by using our ability to
infer meaning.
Having said all of the above, there are, of course, counterarguments.
In real life listening, we are not usually asked to simultaneously com-
plete what can be a detailed and demanding task, potentially including a
certain amount of reading (multiple choice task format) or reading and

writing (short answer format) as well as working under time constraints
(Field 2013: 127). Field refers to this phenomenon as divided attention
(ibid.: 148). Being able to listen to the sound file a second time helps to
alleviate this double burden, making it more possible for the listener to
cope with the demands of the task. A further argument in favour of lis-
tening twice is that we cannot control for unexpected noise during a live
test administration and this therefore makes it unfair for the listener who
might otherwise have been able to cope with the demands of the task.
One alternative solution to playing everything once or everything twice
is to make decisions on a case-by-case basis. These decisions should take
into account the construct being targeted, the difficulty and length of the
sound file, and the test method. If a decision is made to play sound files
once only, the test developer must also make sure that there is s ufficient
redundancy in the sound file for the test takers to process the input and
complete the task. This should be relatively easy to factor in if a textmap-
ping approach is incorporated into the task development process (see
Chapter 3) and can be checked during the field trials (see 6.1).
2.5.1.12 Speaker characteristics
To help test developers find appropriate sound files, a range of speaker

characteristics, such as the speakers age, gender, accents, speed of delivery
and also the number of voices allowed at any one time, should be included
in the test specifications. For example, in a test for 11-12 year olds, the
inclusion of sound files featuring young children is likely to appeal to the
test takers.
The speed of delivery is one characteristic that is often underestimated
by test developers, yet it plays an important role in terms of successful
comprehension (Griffiths 1992). Unfortunately, performance descriptors
do not often provide the level of detail test developers need; instead they
tend to employ words such as slowly (STANAG Level 1), normal speed
(B2 LAI), slower (ICAO Level 4) and so on. How slow is slow? How
fast is fast? Research carried out by Tauroza and Allison (1990) found
that the average speech rates for British English, ranged from 140 words
per minute (wpm) for lectures to non-native speakers, up to 210 wpm

for conversations.
Based on the listening tests I have been involved with, I would argue
that a speaker of English who talks at over 200 wpm is fast, and listen-
ing to someone speaking at that speed is demanding even for a native
speaker if the act of listening has to be maintained for any length of time.
Someone speaking at 120-140 wpm, on the other hand, I would classify
as slow and anything around 100 wpm as being rather painful to listen
to. Field (2013: 118) reminds us that unnaturally slow speech can affect
the natural rhythm of speech.
The number of voices also needs to be quantified in the test specifica-
tions. At the lower levels of ability it is recommended that the number be
limited to one of two voices which are easily distinguishable in terms of
pitch (low versus high) and which reflect standard accents. Higher-level
test takers, particularly those who may go on to study through an English
language medium, should be able to cope with more voices as well as a
wider range of accents (see Harding 2011, 2012).
2.5.2 Task
2.5.2.1 Instructions andtheexample
The test specifications should define how the instructions are written; for
example, clear, simple and short instructions. They should also indicate
the language in which the instructions should be presented, that is L1
or L2, and whether an example should be included (see 4.2 for argu-
ments regarding this issue as well as the importance of using standardised
instructions).
2.5.2.2 Test method
The test methods which are felt to be suitable for testing listening need to
be agreed upon and added to the test specifications. Due to the fact that
there is no written text for test takers to refer to, the role of memory must
be carefully considered:
To load recording difficulty or task difficulty too heavily on to memory is to skew

a test in favour of a trait which, while it supports listening, also has functions
(i.e. retention and recall) that fall outside. (Field 2013: 149)
Experience has shown that three particular methods work reason-

ably well when assessing listening ability. These are multiple matching
(MM), multiple choice (MCQ) and short answer question (SAQ) tasks.
The latter includes closed questions as well as open questions of the sen-
tence completion and table completion variety. The test specifications
should indicate the maximum number of words the test taker is expected
to have to use to complete the SAQ questions (see 4.3.1.2). They also
need to state whether the MCQ items will be of the three or four option
variety (see 4.3.1.3 for a discussion of this issue). The advantages and
disadvantages of these methods are discussed in Chapter 4 and sample
tasks provided in Chapter 5.
It is important that all methods which are listed in the test specifica-
tions should be familiar to the target test population. Where a test devel-
oper wishes to introduce a new test method, time must be allowed for the
test takers to become used to it through a process of trialling and access
to sample tasks. Ideally, these should be made available at least one year
before being used in a live test administration.
2.5.2.3 Number ofitems
The total number of listening items needed depends on the type of test
that is being developed. For example, if the test is a uni-level test, that
is, with just one difficulty level being targeted, the number is likely to be
fewer than if it is a bi-level (two levels, say B1-B2) or a multi-level test
such as might appear in a proficiency test which has been developed to
handle a heterogeneous test population.
The targeted level of difficulty will also impact on the number of items;
the higher the level of proficiency, the more complex the construct is
likely to be, and thus the need for more items reflecting the different types
of listening behaviour that it will attempt to measure. The purpose of the
test (achievement versus proficiency) and the degree of stakes involved
(classroom test versus university entrance test) should also be taken into
account. Based on a wide range of test development projects, experience
has shown that at the higher end of the learners ability spectrum, 25 to
30 well-constructed test items should provide a reasonable idea of a test
takers listening ability. At the lower end, where the test construct is less
diverse, 10 to 15 items may be sufficient.
On the issue of how many items there should be in a task, many test
development teams feel that there should be a minimum of five items in
order to make efficient use of the time available in the listening test. This
would mean that in order to assess listeners ability to identify the gist,
a number of snippets would need to beincluded in one task in order to
have asufficient number ofitems (see Into Europe Assessing Listening
Task 44 for an example of this kind of task).
2.5.2.4 Number oftasks
The number of tasks, like the number of items, will depend on whether
you are aiming to develop a uni-level, a bi-level or a multi-level test. It
will also be linked to the level of difficulty the higher levels of abil-
ity will require more tasks due to the complexity of the construct being
targeted. For example, if you wish to develop 25 to 30 items, four tasks
with approximately seven to eight items in each would be optimal (see
also 4.3).
2.5.3 Criteria ofassessment
The final part of the test specifications focuses on the criteria of assess-
ment that raters employ when marking the test takers responses. In
listening this is generally much less complex than it is for speaking or
writing as no rating scale per se is needed. The key should, however, be
as complete as possible. Field trials (see Chapter 6) help enormously in
terms of providing alternative answers to the key for short answer items;
trials can also be useful in putting together a list of the most common
unacceptable answers. This should speed up the time needed to rate the
answers and should also increase marker reliability.
In addition, test developers need to agree on how spelling mistakes or

grammatical errors that appear in test takers responses should be dealt
with. Good testing practice contends that meaning is more important
than spelling (see Fehrvryn and Piorn 2005, Appendix 1). In other
words, provided the rater can understand what the test taker has written
and this mirrors the key, a point should be awarded.
The weighting of each item also needs to be agreed upon. In gen-
eral, allowing each item to carry one mark seems to be a reasonable one;
increasing the weighting of the more difficult items only serves to artifi-
cially inflate the difference between the better test takers and the weaker
ones. If a particular type of listening behaviour is felt to be more impor-
tant, then it is more reliable to add more items to the test to reflect this
rather than double weighting them (Ebel 1979, cited in Alderson etal.
1995: 149). Negative scoring should also be avoided: when test takers
know that they will lose a mark if their answer is wrong, they may be
more hesitant to attempt the question and this could result in a less reli-
able picture of their ability. (See also 7.1.)
2.6 Why do weneed test specifications?

By now it should have become clear why test developers need this docu-
ment and how important it is to discuss and record the decisions that
need to be taken before task development work starts. To summarise,
test specifications help to define the construct underlying the test, which
enables the test developer to make a direct link between the theory on
which the test is based (be this derived from the curriculum, from the
national standards or from a set of language performance descriptors)
and the actual tasks or items (Alderson 2000). In making the construct
accessible in this way, it should be possible to link every item and/or task
with the descriptors or definitions which are listed in the first part of the
test specifications. Where this is not possible, it would suggest that either
something is missing from the specifications or the item has not been
successful in targeting the construct.
Test specifications help to make the process of test development more
accountable. It does this not only for the test developers, but also for
other stakeholders such as teachers, students, future employers, educa-

tional authorities inter alia. Some tests have two types of test specifica-
tions associated with them: an internal and external version. The former
is the one used by test developers, and is likely to include more technical
details and more metalanguage, as befits its readership. The internal ver-
sion may also appear in tabular form for ease of use and access (see Figure
2.6). The external version, on the other hand, may be in a more prose-like
form and contain more detailed explanations as to why particular deci-
sions have been taken. This external set of test specifications is sometimes
referred to as the test syllabus. In some contexts, the two versions are
identical (Haladyna and Rodriguez 2013: 40).
Test specifications are also useful in ensuring consistency in task devel-
opment. Testing teams, like any other team, will witness the arrival of
new team members while previous ones move on to other posts. Having
a blueprint to help newcomers to the team is an enormous aid to conti-
nuity in the work of task development. They also help those outside the
immediate circle of stakeholders to obtain a sense of the quality of the
test. These might include external evaluators of the test, judges involved
in standard setting (see 7.2), textbook publishers (perhaps hoping to link
a new textbook with the test), educational authorities (interested in not-
ing the possible links with the curriculum, syllabus or higher education
needs) and future employers (interested in the link between the test and
vocational needs). A lack of test specifications could be construed as sug-
gesting that the construct and performance conditions underlying the
test may not have been given sufficient consideration.
It will be clear from the issues discussed in this chapter that test specifi-
cations take time to develop and usually go through a number of versions
before the test design becomes completely stable. For example, decisions
taken at the initial test design stage might prove to be impractical or
overadventurous given the available resources, and in light of this changes
have to be made. Such changes may happen at the beginning of the task
development work, but some issues may only come to light during the
trial and statistical analyses stages (see Chapter 6). The process of devel-
oping test specifications must therefore be viewed as an iterative one with
the document becoming more and more refined as the process continues.
Test specifications work best when they are based on a consensus of

informed opinion such as those provided by members of the test develop-
ment team as well as other stakeholders (for example, those involved in
standard setting). For high-stakes examinations it can take between two
and three years before the test specifications appear in the public domain.
Once released, it is quite common for them to be used for a number of
years before once again undergoing a series of reviews.
2.7 Summary
Many of the issues raised in this chapter will be revisited in Chapter 3,
which looks at a procedure that can be used to exploit sound files, and
Chapter 4, which takes the results of those procedures and explores how
they can be transformed into tasks.
To complete this chapter on the issue of how test specifications can
help, Figure 2.6 provides a summary of the type of information you
should have answers to before beginning any work on task development:
Overall purpose To assess the test takers ability at level X (in accordance
of the test with X language descriptors, X curriculum, X national
standards and so on)
Construct Should include relevant descriptors / definitions or useful
extracts from the syllabus / curriculum / national standards
Target test Should include information on:
population o Test takers age / L1 / gender / location
o Any relevant background information, for example,
university graduates, primary school students, senior civil
servants and so on.
Fig. 2.6 Test specifications template

Input Should include information on:
o Source (where the sound files come from)
o Type e.g. monologue, dialogues
o Targeted level of difficulty
o Topics those to be included and those that need to be
used with discretion
o Discourse type, for example narrative, descriptive,
argumentative, persuasive and so on
o Mode of delivery (sound files, video clips)
o Number of sound files / video clips
o Number of times heard
o Speaker characteristics: age range, accents, gender,
speed of delivery, number of voices
o Length (minimum-maximum time of sound files)
o Background noise
o Nature of content (abstract / concrete)
Test Method Those which will be used e.g. multiple choice, multiple
matching, short answer questions
Items Number of items per task (minimum maximum)
Tasks Number of items in the test as a whole
Number of tasks per test
Instructions Language of instructions (target language and / or L1)
Clarity
Inclusion of example
Test time Overall time available for the listening test
Criteria for Weighting, for example, 1 point per item
marking Issues related to the handling of errors in grammar and
spelling versus meaning / communication.
Fig. 2.6 Continued

For further discussion and other examples of test specifications, see

Davidson and Lynch (2002), Fulcher and Davidson (2007) and Haladyna
and Rodriguez (2013).
DLT Bibliography
Alderson, J.C. (2000). Assessing reading. Cambridge, UK: Cambridge University
Press.
Bachman, L.F. (1990). Fundamental considerations in language testing. Oxford:
Buck, G. (1991). The testing of second language listening comprehension.
Unpublished PhD thesis, University of Lancaster, Lancaster, UK.
Davidson, F., & Lynch, B.K. (2002). Testcraft: A teachers guide to writing and
using language test specifications. New Haven: Yale University Press.
Ebel, R.L. (1979). Essentials of educational measurement (3rd ed.). Englewood
Cliffs, NJ: Prentice-Hall.
Ebel, R.L., & Frisbie, D.A. (1991). Essentials of educational measurement (5th
ed.). Englewood Cliffs, NJ: Prentice-Hall.
Fehrvryn, H. K., & K. Piorn. Alderson, J. C. (Series Ed.). (2005). Into
Europe. Prepare for modern English exams. The listening handbook. Budapest:
Teleki Lszl Foundation. See also http://www.lancaster.ac.uk/fass/projects/
examreform/Media/GL_Listening.pdf
Field, J. (2008). Listening in the language classroom. Cambridge: Cambridge
University Press.
Fortune, A. (2004). Testing listening comprehension in a foreign language Does
the number of times a text is heard affect performance? MA Thesis, Lancaster
University.
Fulcher, G., & Davidson, F. (2007). Language testing and assessment. NewYork:
Routledge.
Geranpayeh, A., & Taylor, L. (Eds.) (2013). Examining listening. Research and
practice in assessing second language listening. Cambridge: CUP.
Griffiths, R. (1992). Speech rate and listening comprehension: Further evidence
of the relationship. TESOL Quarterly, 26, 283-391.
Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test
items. Oxon: Routledge.
Harding, L. (2011). Accent and Listening Assessment. Peter Lang.
Harding, L. (2012). Accent, listening assessment and the potential for a shared-
L1 advantage: A DIF perspective. Language Testing, 29, 163.
Henning, G. (1987). A guide to language testing: Development, evaluation,
research. Cambridge, MA: Newbury House.
Lewkowicz, J.A. (1996). Authentic for whom? Does authenticity really matter?
In A. Huhta, V. Kohonen, L. Kurki-Suonio & S. Luoma (Eds.), Current
developments and alternatives in language assessment. Proceedings of LTRC,
pp.165-184.
Rvsz, A., & Brunfaut, T. (2013). Text characteristics of task input and diffi-
culty in second language listening comprehension. Studies in Second Language
Acquisition, 35 (1), 31-65.
Tauroza, S., & Allison, D. (1990). Speech rates in British English. Applied
Linguistics, 11, 90-195.
White, G. (1998). Listening. Oxford: Oxford University Press.
3
How do weexploit sound files?
3.1 Identifying thepotential use

ofasound file
In the previous chapter we investigated the factors that should be taken into
consideration when trying to identify appropriate sound files. This chapter
looks at some procedures the test developer could follow in order to exploit
those sound files. Before looking at these in more detail, let us take a moment
to explore the approaches that many test developers currently follow.
When starting work with a new group of test developers, I have always
found it useful to find out about the methods they use for exploiting sound
files. I do this by asking a series of questions. The first one relates to how
they decide which bits of a sound file to exploit. Their responses have
included: the bit that looks interesting or I always need to develop X number
of questions, so I need to have one on each part of the sound file or I thought
it would lend itself to being turned into a question or Ive just taught my class
something about that, so I thought I would use this sound file to test it and so
on. Such answers suggest that, in general, test developers make individual
Electronic supplementary material: The online version of this chapter (doi:10.1057/978-1-349-

68771-8_3) contains supplementary material, which is available to authorized users.

DOI10.1057/978-1-349-68771-8_3
decisions about the sound files they want to use based on their own indi-
vidual teaching needs and interests, or the perceived needs of their students.
My second question focuses on whether as test developers they have
ever faced any problems with the procedure(s) they have followed. Their
answers are usually in thepositive and are associated with their students not
being able to answer the questions for one reason or another; or producing
totally different responses from those which had been expected. My third
question is aimed at finding out whether their colleagues would target the
same part(s) of the sound file if they wanted to use the same sound file
to develop a task. The responses onthisoccasionare often rather vague
and unsure, possibly because, for practical reasons, many test developers
and teachers tend to create their own tasks and rarely work in teams. My
fourth question then asks them to consider whetherlisteners in general
would target or rather take away the same information. Responses sug-
gest that the test developers are not sure that everyone would take away
the same information and/or details when listening to a sound file.
In light of the last response, my final question to the test developers
focuses on whether different listeners taking away something different
from a sound file is a problem. The test developers usually confirm that
if this happened in a teaching situation it would be seen as productive,
as it could lead to discussion among the students. They add, however,
that in a testing scenario it could be problematic in terms of determining
which interpretations should be considered right and which should be
considered wrong.
Research in the 1980s into how the meaning of a written text was
constructed by the reader suggested a continuum ranging from a passive
acceptance of the ideas in the text to an assertive rewriting of the authors
message (Sarig 1989). This differing approach to texts, and by extension
to sound files, has obvious implications for test development in terms
of deciding which interpretations made by a reader or a listener can be
accepted as being correct and which incorrect.
While the argument put forward by Sarig (1989: 81) that More leeway
should be left for interpretations which never occurred to test developers
seems a reasonable one, it should perhaps take into account Alderson
and Shorts (1981) belief that although individual readers may interpret
a text in slightly different ways, a consensus among readers would help
3 How do weexploit sound files? 57
to define the limits on what a given text actually means. This position
is also supported by Urquhart and Weir (1998: 117) who argue that,
When constructing test tasks, testers need to operate with a consensus as to
what information readers may be expected to extract from a text Nuttall
(1996: 226) suggests that, a useful technique for deciding what mean-
ing to test in a text is to ask competent readers to say what the text means.
Experience has indeed shown that involving students in such a process is
highly informative for the teacherand/ortest developer as well as enjoy-
able for the students.
3.2 A
procedure forexploiting sound files:
Textmapping
So what is textmapping? Textmapping is a systematic procedure which
involves the co-construction of the meaning of a sound file (or text). It is
based on a consensus of opinion as opposed to an individual interpreta-
tion of a sound file (or text). It uses the sound file and not the transcript
as the basis for deciding what to focus on as the latter encourages less
attention being paid to what a listener, as opposed to a reader, might
actually take away. In addition, as there are no time indicators in a tran-
script, the reader has no real idea of the speech rate of the speaker(s)
or the amount of redundancy present, and is completely unaware of the
extent to which words may have been swallowed or not stressed by the
speaker(s). As Lynch (2010: 23) states:
a transcript and the speech it represents are not the same thing, the original is a
richer, contextualized communicative event.
Further support for this approach comes from Field (2013: 150):
It is also important that the physical recording rather than the script alone
should form the basis for the items that are set, enabling due account to be
taken of the relative saliency of idea units within the text and of aspects of
the speakers style of delivery that may cause difficulty in accessing
information.
Weir (2005: 140) adds:
It is crucial that test writers map a text whilst listening to it in advance of writ-
ing the questions in order to ensure they do not miss out on testing any of the
explicit or implicit main ideas or important details, where this is the purpose of
the listening exercise.
The procedure on which textmapping is based encourages the test

developer to be a listener by focusing solely on the sound file. It requires
the textmapper to adopt the appropriate type of listening behaviour,
thus making it possible for him/her to take account of the speed of deliv-
ery, the time needed to process the input and the articulation of the
speaker(s).
The textmapping procedure involves using a range of different meth-
ods depending on the type of listening (or reading) behaviour being tar-
geted by the test developer. (See Urquhart and Weir 1998, Appendix 2
for an example of how to map a reading text.)The listening behaviours
employed in textmapping should reflect as closely as possible those of real-
life listening contexts. Textmapping minimises the impact of individual
interpretations of a sound file, by being based on a consensus of opinion.
It requires a minimum of three other people in addition to the test devel-
oper who originally identified the sound file that is to be textmapped.
Obviously, where it is feasible to involve more people, one is likely to
have even more confidence in the results. For practical reasons, though
it is not always possible to get more than four test developers or teachers
together to carry out the textmapping procedure. However, once the test
developers have been trained in how to carry out the textmapping proce-
dures in face-to-face sessions, and feel comfortable with the demands of
the methods, textmapping can be carried out on an individual basis and
the results sent to each other by email or placed on a platform.
During the textmapping procedure it is of the utmost importance that those
involved understand that their role is simply that of a listener. Textmappers
must take off their teacher, researcher or test developer hat (if they have any
of these) and focus solely on the sound file. In other words, they should not
try to second-guess whether the information they take away from the sound
file, be it an important detail or a main idea, can (or cannot) be turned into
an item. Nor is it their job to decide whether something in the sound file is
so obvious that it can never be tested, and thus choose not to write it down.
Such decisions come later. The textmappers job is simply to document what
they take away from a sound file while employing the type of listening behav-
iour they have been asked to use by the person who provided the sound file.
So how does it all work? Sections 3.3, 3.4, and3.5 describe the procedures
that should be followed when textmapping for gist, specific information
and important details, and main ideas and supporting details respectively.
3.3 Textmapping forgist

Let us imagine that as part of the test you are developing, you need to
assess the test takers ability to identify the overall idea, or, in other words,
the gist. The first thing you need to do is to find an appropriate sound
file. In identifying such a file, it is likely that you will have to listen to
it twice first of all to confirm its suitability in terms of difficulty level
and topic, and secondly to determine its appropriateness for the targeted
construct, gist. It is during that second listening that you should develop
your own textmap (as described below) so that you can check whether it
is likely to work in terms of the targeted behaviour.
3.3.1 Defining thelistening behaviour
Identifying the gist of a sound file basically requires the listener to synthe-
sise the main ideas or arguments being put forward in order to come up
with the overall idea the speaker is attempting to get across. For example,
the listener might be asked to identify the gist of a report on a recent
natural disaster, or that of a short speech made by the principal at the
beginning of the academic year, or someones overall opinion of a newly
introduced agricultural policy. Inviting a small group of textmappers to
do this helps to minimise any individual idiosyncrasies that might have
been taken away by a single test developer.
Before starting the textmapping process, however, it is first of all essential
to check everyones understanding of the term gist, as this is very often
confused with the terms topic and/or summary. The most practical way
to do this is to focus on the number of words that are likely to be involved.
For example, it could be argued that the topic is often summed up in just
two or three words; a summary, on the other hand, usually requires a num-
ber of sentences; while the overall idea often needs something in between in
terms of length. Asking textmappers to use between 14 and 20 words (10
words minimum) often helps to guide them towards identifying the gist,
rather than the summary or the topic. (The number of words will of course
depend to some extenton the length and density of the sound file used.)
Secondly, in order to encourage a focus on the gist of the sound file
rather than the details, the textmappers should be instructed that they
are not allowed to write anything down during the exercise. Thirdly, it is
important that they be made to understand the importance of remaining
quiet, not only while listening to the sound file, but immediately after-
wards when they write down the overall idea. This silence is crucial in the
textmapping procedure due to the amount of information the working
memory can retain at any one time. This content can easily be dislodged
by an inadvertent comment from one of the textmappers. Another reason
for remaining silent at this stage is to minimise any possible influence on
what an individual textmapper might write down.
Finally, just before beginning the textmapping session, it should be
made clear that there is no such thing as a right or wrong textmap;
it is more than possible that an individual textmapper could take away
something quite different from another due, for example, to their own
personal interpretation or reaction to the sound file. This does not make
it wrong, just different.
Once the textmappers are clear as to what they have to listen for, and
how they are going to do this, provide them with the context of the
sound file so that they can activate any relevant schema and not go into
the listening event cold. Then remind them of the key points (put them
on screen if possible) (See Figure 3.1).1
The sound file should then be played once only regardless of how
many times it will be played in a future task. This is due to the fact that
1
The sound file for this example is Track 6, CD2 (Task 30) Into Europe Listening. For textmap-
ping purposes, the sound file was started at the end of the instructions (at 30 seconds). The sound
file can be found at: http://www.lancaster.ac.uk/fass/projects/examreform/Pages/IE_Listening_
recordings.htm.
o Do not write anything while listening
o Identify the overall idea
o Use 14-20 words (10 words minimum)
o Keep quiet until everyone has finished
o Context: A natural disaster
Fig. 3.1 Instructions for gist textmapping
overexposure to the contents of the sound file is likely to result in more

than simply the gist being written down.
When all the textmappers have finished writing down the gist, ask them
to check how many words they have used. Where the number is lower
than expected, this might suggest the topic has been written down; where
the number is much higher, this could imply that it is more of a sum-
mary. Such findings are part of the learning process involved in carrying
out textmapping and any major differences will disappear with practice.
3.3.2 Checking forconsensus
The next stage involves comparing what each listener has taken away from
the sound file to see whether there is a consensus. In textmapping, high
consensus is defined as n1, so if there are six textmappers, five of them
(83 per cent) should have written down approximately the same thing.
Low but still substantial consensus (Sarig 1989) would constitute approxi-
mately 57-71 per cent agreement. Checking for consensus will obviously
involve some negotiation, as the textmappers will have used different
words in phrasing the gist due to the transient nature of the input as well
as influence from their own personal lexicons and background knowledge.
The person who originally identified the sound file should collate the
results by asking each textmapper in turn what they have written down
and recording this verbatim. (The collator should remain silent about
his/her own textmap results until the very end of this process so as not
to influence the proceedings.) Textmappers should not change what they
have written in light of someone elses textmap. When the list of text-
maps is complete it might look something like the following:
1. Severe earthquake in Peru caused major damage to buildings; people were
killed, injured and made homeless.
2. The Red Cross helps after a heavy earthquake caused major destruction to
buildings in South America.
3. A massive earthquake destroying homes of people and damaging historic
buildings took place in Lima. Emergency operations sent in.
4. A major earthquake in Peru caused heavy damage; international and Red
Cross relief co-ordinated, the epicentre far from the capital Lima.
5. There was a major earthquake in South America lasting for about 2 minutes;
buildings were destroyed, people were killed and injured; help was organised
quickly.
6. Strong earthquake in Peru. People were made homeless and historic
buildings damaged. Red Cross relief started immediately.
Fig. 3.2 Gist textmapping results
The textmappers should take a general look at these results and decide
whether or not there seems to be a consensus of opinion. Remember,
high consensus in textmapping constitutes n1 so if there are six text-
mappers and only five have similar overall ideas this would still equate
to a consensus. Where textmappers feel that there is a consensus, they
should then be asked to look in more detail at the answers given in order
to identify communalities. For example, the highlighting in Figure 3.3
below shows a number of similarities across the textmaps.
The results reveal that where the textmappers have identified key words
(important details) as an essential part of the gist, for example, earthquake
or buildings, their answers are less varied as we would expect. However,
when it comes to describing what has happened (damage/destruction),
how strong the earthquake was (strong/massive/major/severe), or the aid
which was involved (Red Cross/help/emergency operations/rescue), there
1. Severe earthquake in Peru caused major damage to buildings; people were
killed, injured and made homeless.
2. The Red Cross helps after a heavy earthquake caused major destruction to
buildings in South America.
3. A massive earthquake destroying homes of people and damaging historic
buildings took place in Peru. Emergency operations sent in.
4. A major earthquake in Peru caused heavy damage; international and Red
Cross relief co-ordinated, the epicentre far from the capital Lima.
5. There was a major earthquake in South America lasting for about 2 minutes;
buildings were destroyed, people were killed and injured; help was organised
quickly.
6. Strong earthquake in Peru. People were made homeless and historic
buildings damaged. Red Cross relief started immediately.
Fig. 3.3 Highlighted communalities (gist)
is more variation. This is partly due to the fact that as there is no written
word to rely on, listeners will employ different words based on their
personal schema and internal lexicons. Figure 3.4 shows the list of com-
munalities which suggests that there is consensus on the overall idea.
o strong / massive / major / severe
o earthquake
o Peru / South America
o damage / destroyed
o buildings
o Red Cross / help / emergency operations / rescue
Fig. 3.4 Communalities (gist)

Where low but substantial consensus occurs on some of the key

words used to define the gist, the test developer can choose to make these
optional in terms of the words expected in the answer. In the above case,
all key words were present in all six textmaps with the exception of build-
ings, which was present in five out of six textmaps, which still constitutes
a high consensus.
3.3.3 The Gist textmap table
The next step in the textmapping procedure is to put this information

into a table (see Figure 3.5). The complete wording of the gists pro-
duced by each textmapper should be transferred to the table, and not
simply the words that the task might focus on; the highlighting indicates
the key words that an item/task should target. Practical experience has
shown that it is useful to insert the textmappers initials at the top of the
columns as a record of who was involved in the procedure. This facilitates
being able to go back to an individual textmapper if necessary which is
particularly useful when the textmapping procedure is being carried out
online.
The reasons for using a textmap table are threefold. Firstly, it helps the
test developer keep a systematic record of the sound files that have been
textmapped and makes it possible to track which ones have worked and
which have not. Secondly, recording the information in this way is more
likely to lead to an item that targets the gist, and not simply the specific
information (Peru) or important detail (earthquake), which may consti-
tute part of the gist textmap. Thirdly, it is a very useful tool to be able
to refer to during peer review (see 4.5) and/or when reviewing the tasks
performance post-trial.
Fig. 3.5 Gist textmap table

Where consensus has been achieved, the final steps in the textmap-
ping procedure involve deciding on an appropriate test method and
the development of the task itself. These issues will be dealt with in
Chapter 4.
3.3.4 Summary ofthegist textmapping procedure
Figure 3.6 details the textmapping stages for gist.
1. Identify a suitable gist sound file. Make your own textmap the second time you
listen to the recording.
2. Find at least three other people who have not been exposed to the sound file.
3. Explain that you want them to textmap the sound file for gist. Check their
understanding of gist (overall idea / macro-proposition).
4. Remind them of the following:
o They should not write while listening.
o The sound file will be played once only.
o They should remain silent until everyone has finished writing.
o They should try to use between 14 and 20 words (depending on the length /
complexity of the sound file). Minimum of 10 words.
5. Provide a general context to the sound file. Be careful not to give too much
information as this might influence how they perceive the gist.
6. Play the sound file once only and then allow the textmappers time to write the gist.
Ensure that silence is maintained until the last textmapper finishes.
Fig. 3.6 Gist textmapping procedure

7. Ask the textmappers to count the number of words they have written. This is
useful in determining whether the textmappers have identified the gist or the topic
or whether they have written a prcis!
8. The person who originally identified the sound file should then record what each
textmapper has written. If this can be projected onto a screen so all can see, this
helps; if not, gathering around the computer screen may also work.
9. The group should carry out an initial general review of the gists to see if there is
some kind of consensus in terms of what has been written down.
10. Where this is not the case, it would suggest that the sound file does not lend
itself either to gist or to one interpretation of the sound file. It may, however, be
possible to use it for something else (see 3.5 Textmapping for Main Ideas below).
11. Where there is a consensus, the communalities thatappear in the textmaps
should be highlighted. For example, in the above sound file earthquake,
damage and so on.
12. Once this is complete, the communalitiesshould be written out (plus/minus
optional words if these occur). This list should formthe basis of the targeted
answer.
13. The textmap results should be added to a textmap table (see Figure 3.5above).
14. A suitable test method should be identified and task development work should
begin (see Chapter 4).
Fig. 3.6 (Continued)
3.3.5 Textmapping multiple gist files
Sometimes test developers find a number of short sound files which

are related in terms of the subject matter, for example, peoples atti-
tude towards cold weather, which they would like to use for gist. In
this case, the textmappers should be asked to textmap each sound file
separately and write down the gist at the end of each one. When all
the sound files have been textmapped, discuss the results in the same
way as in the Natural disaster example above. If there is too much
overlap in the gist textmaps regarding two of the snippets, one of
them may have to be dropped. This procedure should not be used for
a continuous piece of spoken discourse where there is no logical reason
for segmenting it.
3.4 T
extmapping forspecific information
andimportant details (SIID)
It could be argued that when listening to identify specific information

or an important detail most listeners use a type of selective listening
provided their ability enables them to do so (that is, where their listen-
ing ability is at approximately the same level as the sound file). This
is because what the listener needs to identify is often captured within
a few words, and it is therefore not necessary to listen carefully to
the complete piece of spoken discourse. For example, where a listener
needs simply to identify the name of a person, a number or a particular
location, this is often encapsulated in one or two words. This is also
true of important details for example, the name of an item which has
been lost. Sound files that tend to lend themselves to this type of selec-
tive listening behaviour often include a number of facts or details the
important question which needs to be answered from a test develop-
ment point of view is: which ones should be targeted? This is again
where textmapping can help.
Once an appropriate SIID sound file has been identified, and the
person who selected it has completed his/her own textmap, three more
listeners need to be invited to listen to the recording. Where the text-
mappers are new to textmapping for SIID, it is useful to check their
understanding of what this constitutes. A list of different types of SIID,
such as those presented in Figure3.7, can help remind the textmappers of
what they should be listening out for before the sound file begins.
Specific Information
o Dates e.g. 12.12.08
o Times e.g. 6 oclock
o Places e.g. Paris
o Names e.g. Mr Smith
o Prices e.g. 5
o Percentages e.g. 56%
o Numbers e.g. 100
o Tel numbers 01936 884662
o Measurements e.g. kilos etc.
o Acronyms e.g. U.N.
o Addresses e.g. 2 Hanover Square
o Website links e.g. www.hotmail.com
Important details
o Tend to be nouns or adjectives
Fig. 3.7 Different types of SIID
Experience has shown that where textmappers knowledge of SIID is

not checked (and sometimes even experienced textmappers need a gentle
reminder), non-SIID entries start to creep into the SIID textmaps. Once
everyone is clear with regards to what can be categorised under SIID, the
textmappers should be provided with the sound files context, in this case
answering machine messages.2 They should then be told to simply listen
and identify the SIID without writing anything down. Again, and for
2
The sound file for this example is taken from the VerA6 project, Germany and can be found on
the Palgrave Macmillan website.
the same reasons as mentioned above in the gist exercise, the textmap-
pers should be reminded of the importance of remaining quiet, not only
throughout the playing of the sound file but also immediately afterwards
when the textmappers write down the SIID they have taken away from
the sound file.
The sound file should be played once only regardless of how many
times it will be played in a future task. This is because overexposure to
the sound file is likely to result in more SIID being captured than any test
taker might fairly be asked to identify. Once everyone has completed his/
her list of SIID, it is useful to ask the textmappers to do two additional
things. Firstly, they should be asked to look through their lists and make
sure that the entries can be classified as specific information or important
details; ask them to refer to the information in Figure3.7 above or a simi-
lar list that you might have compiled. Anything not in the list needs to
be discussed (see 3.4.2) and if it is not SIID should be deleted. Secondly,
the textmappers should be asked how many entries they have managed
to write down. A smaller than expected number might be interpreted as
suggesting that the sound file does not really lend itself to SIID (or that
the textmapper has not textmapped for the right type of information).
A larger than expected number might mean that the list still contains
entries that are perhaps not what would be classified as SIID.For exam-
ple, there might be verbs or partial ideas in the list of entries that have
been written down.
As with gist, the next stage in the SIID textmap procedure involves com-
paring what each listener has written down to see whether a consensus
has been reached. This is likely to involve much less negotiation than gist,
as SIID tends to be more concrete. Textmappers sometimes have prob-
lems with remembering numbers accurately unless they can write them
down as they listen (see 3.4.5 for an alternative SIID procedure below)
and the test developer must use his/her discretion to decide whether to
accept very similar numerical combinations given that in a real life listen-
ing event, if we know we have to recall a number, we often write it down

or ask the speaker to repeat it.
The easiest way to collate the information resulting from a SIID text-
map is for the original finder of the sound file to ask the first textmap-
per for an entry s/he has written down. (Again the test developers own
textmap should not be revealed until the end of this procedure to mini-
mise any undue influence.) The others should then be asked if they have
this entry or not. The total number of textmappers who have the item
should be written next to the word, for example, Dad 11/14. The second
textmapper should then be asked for the next entry s/he has on his/her
list and again the answer checked with the other textmappers. This pro-
cedure should continue until all list entries have been discussed. Where
an entry does not fall under either specific information or important
detail, it mustbe rejected.
It should be noted that due to the way listeners remember SIID, the
items might not be discussed in sound file order; this is not a problem
as it can be rectified later in the textmap table when the time is added
(see Figure3.9). It is important to be quite strict with the way the results
are collated; in order for a SIID entry to be counted, it must have been
written down saying that they meant to write it down is not sufficient!
Figure3.8 shows the textmapping results from the 11-second sound file,
involving 14 textmappers:
SIID Consensus
1. Dad 11/14
2. John 12/14
3. Airport 13/14
4. 30 minutes 13/14
5. Taxi 12/14
Fig. 3.8 SIID textmapping results

3.4.3 The SIID textmap table
Once the results have been collated, the textmappers must decide whether
there are sufficient items to make it feasible to turn them into a task. In
order to do this, the distribution of the SIID within the sound file needs
to be taken into account. The easiest way to do this is shown in Figure3.9
below:
Fig. 3.9 SIID: Textmap Table1
Entering the information about the time a particular piece of SIID

occurs makes it possible to see how much time there is between each piece
of textmapped information. This, in turn, helps to determine whether
the test takers will have enough time to answer an item; in other words,
whether there will be sufficient redundancy for them to do this given that
they must read (or re-read) the question, process the incoming piece of
input, and complete the item simultaneously. These time m arkers also
help to indicate whether a large chunk of the sound file might be left
unexploited.
What will be noted in the above example is that due to the short-
ness of the sound file (11 seconds), all of the items are very close to
each other. This means that only one or two of the items can be used,
as there is insufficient redundancy. In this particular case, this should
not be seen as a problem; given the length of the sound file, one or
two items (plus an example) would be more than adequate to justify

its inclusion in a battery of tasks provided this complies with the test
specifications or is being used together with other short snippets focus-
ing on SIID.
The final column of the table should be completed when the test devel-
oper has decided which textmapped points will appear in the task. Thus,
for example, if the test developer decided to focus on airport and taxi
and use John as the example, the Target column would appear as shown
in Figure3.10 below:
Fig. 3.10 SIID: Textmap Table2
Only those parts of the textmap being targeted in the example and
actual items should have information in the Target column. Thus
above, 0 (representing the example) is opposite John, and Q1 and
Q2 are opposite Airport and Taxi respectively. Six seconds is a rela-
tively short time between items but if the test method is a multiple
choice picture task where, for example, the test takers simply have to
recognise the correct venue and mode of transport, then it may prove
sufficient.
The final step in the use of the SIID textmap results is deciding on
an appropriate test method and the development of the task itself (see
Chapter4).
3.4.4 Summary oftheSIID textmapping procedure
Figure 3.11 details the textmapping stages for SIID:
1. Identify a suitable SIID sound file and produce your own textmap.
3. Explain that you want them to textmap the sound file for SIID and check their
understanding of what is meant by SIID.
They should not write while listening.
The sound file will be played once only.
They should remain silent until everyone has finished writing.
5. Provide a general context about the sound file. Be careful not to give too much
information as this might influence what the textmappers write down.
6. Play the sound file once only and then allow the textmappers time to write a list of
SIID.
7. Ask the textmappers to count the number of SIID they have written. This is useful
in determining whether the sound file works for SIID and/or whether the
textmappers have mapped for the appropriate type of information. They should
also be asked to delete any non-SIID entries.
8. The first textmapper should be asked to read out an entry s/he has written down
and the other textmappers asked if they have it. The total number should be
written next to the entry, for example, Dad 11/14, so that a consensus can be
verified or not. The second textmapper should then be asked for his/her next
entry and so on until all entries have been discussed.
9. The list of SIID and their degree of consensus should be discussed and a decision
made as to whether the sound file provides a sufficient number of SIID to warrant
making a task.
Fig. 3.11 SIID textmapping procedure

10. Where this is not the case, it would suggest that the sound file does not lend itself
to SIID. It may, however, be possible to textmap it for something else (see 3.6
below).
11. The textmap results should be added to a textmap table (see Figure 3.10 above)
and the time added in order to check for sufficient redundancy between potential
items.
12. A suitable test method should be identified and task development work should
begin (see Chapter 4).
3.4.5 Textmapping longer SIID sound files
Obviously where a sound file is longer or more dense (this may be as

short as 15 to 20 seconds in some cases), there will be a natural ten-
dency, due to the limitations of the working memory, for textmappers to
remember SIID either from the beginning or the end of the sound file.
This is likely to cause problems when it comes to task development, as
some important details may not have been recalled and there may be long
unexploited parts leftin the sound file.
To avoid this situation, and to more closely replicate what we are likely
to do in real-life listeningevents that involve longer pieces of spoken
input, it is recommended that textmappers should be allowed to take
notes during textmapping. They should be asked to identify only those
SIID which appear to be important in terms of what the sound file is try-
ing to convey. Some textmappers may, however, find this difficult to do
while simultaneously listening and writing due to the cognitive demands
involved. It is recommended, therefore, that once they have stopped writ-
ing, they should circle or highlight the words which fall into that cate-
gory. Where sound files produce a lot of textmapped entries, textmappers
should prioritise the top ten entries, followed by a further five in terms of
importance in understanding the sound file. This should make it easier
when it comes to collating the results.
3.4.6 Textmapping multiple SIID sound files
As with gist, where there are a number of related short sound files, for
example, different messages on an answer machine, each sound file should
be textmapped separately, the SIID written down at the end of each one
and then the findings discussed sound file by sound file.
3.5 T
extmapping formain ideas
andsupporting details (MISD)
The procedure used to identify MISD is different from those discussed

so far as it requires careful listening on the part of the textmappers. It
also requires them to eliminate all non-major or redundant propositions
(Sarig 1989: 84). As trying to do this based solely on memory would put
an impossible strain on the listener, textmappers should be told to write
while they listen.
As with gist and SIID, it is useful to discuss the difference between
the overall idea (macro-proposition) and the main ideas (propositions),
and possibly also what constitutes a main idea as opposed to a support-
ing detail. The difference between the overall idea and main ideas is rel-
atively straightforward as it is generally acknowledged that main ideas
are constituent parts of the overall idea. The difference between main
ideas and supporting details, however, is more problematic. Textmappers
quite often disagree on this issue to the extent that one textmapper might
write something down and label it a main idea, while another might
write exactlythe same thing down and call it a supporting detail. The
example shown in Figure 3.12 has been found to help some teams of test
developers in the past.
Ferguson was a very skilful player in his youth. He was a top goal
scorer at six different Scottish clubs between 1957 and 1974.
Ferguson was a very skilful player in his youth = main idea
He was a top goal scorer = supporting detail
at six different Scottish clubs between 1957 and 1974 = specific
information / important details
Fig. 3.12 Main ideas, supporting details and SIID
Where textmappers are still a little unclear as to the difference, tell

them to write down the main idea and/or supporting detail if they feel it
is something they would take away while listening to the sound file, and
its nomenclature can be addressed once it is determined whether there is
a consensus on it or not.
Prior to starting the MISD textmap exercise, inform the textmappers
that once the sound file has finished they will have time to look back
through what they have written and be able to complete any notes. They
should also be reminded to check that what they have written is MISD
(hint: lookingfor verb structuresoftenhelps here).
Then provide the textmappers with the general context of the sound
file You are going to listen to an interview with an astronaut3 and play the
sound file once only. The textmappers should be encouraged to use the
words from the sound file as much as possible as this makes itfar easier
to discuss the textmap results; inevitably though some paraphrasing will
take place for the same reasons as mentioned under gist above.
The person responsible for the sound file should collate the textmaps by ask-
ing the first person in the group for the first main idea/supporting detail they
3
The sound file for this example is Track 4, CD1 (Task 21) Into Europe Listening. For textmap-
ping purposes, the sound file was started at the end of the instructions (at 34 seconds). The sound
file can be found at: http://www.lancaster.ac.uk/fass/projects/examreform/Pages/IE_Listening_
recordings.htm.
have written down. Once recorded, the others should be asked if they have
the same point and then the number of people, for example, 5/6, should be
added. It should be noted that this procedure involves some negotiation due
to the paraphrasing the various textmappers will have used. Those options
which have the same meaning should be accepted. The next textmapper
should then be asked for his/her next main idea/supporting detail and the
above process repeated. This method should be followed for all the MISD
that the textmappers have written down. Once again, it is possible that the
order in which the MISD are discussed will differ slightly among the text-
mappers; this can be rectified once the ideas are moved to the textmap table.
While collating the results of the textmap, you may find a split in the
consensus (for example, 2:2) between those who have written down the
main idea and others who have identified the related supporting detail.
For example, in this particular sound file, some textmappers might have
written: She doesnt come from a rich family background (= main idea)
while others might have identified: She saved her money for flying lessons
(= supporting detail). Such a result would mean that there is no consen-
sus on either the main idea or the supporting detail. However, it seems
reasonable to argue that it was simply a personal choice as to which part
was written down and that where this happens the test developer could
combine the textmapping results and then decide which aspect to focus
on in the item.
3.5.3 The MISD textmap table
Once all the MISD have been discussed, the textmappers again need to
review the total number of points on which consensus has been reached
in order to decide whether these are sufficient to make a task worthwhile
developing, taking into consideration the length of the sound file. If the
answer is in the positive, the next thing that needs to be checked is the
distribution of the textmapped points. Again, putting these into a table
helps. Unlike SIID, the time it takes for a main idea to be put into words
is likely to take more than one second. The complete amount of time
taken should appear in the table so as to provide as accurate a picture
as possible regarding the amount of time occurring between each of the
textmapped points:
Textmapped point WW NS DB BC MC Time*
Neil Armstrong was the first human on 00:36 00.41
the moon.
EC was the first woman space shuttle 00:45 00.48
commander.
She was fascinated by space especially 01:05 01.10
the astronauts.
Being a woman she didnt believe she 01.12 01.16
could become an astronaut.
Later the astronaut job became a reality. 01.19 01.26
Shes not from a rich background / 01.32 01.37
couldn't afford flying lessons.
She financed her flying lessons by taking 01.47 01.56
a part-time job.
The launch is the most exciting moment 02.06 02.15
because of the fast pace.
Also, looking back at the earth and Zero- 02.23 02.32
G.
EC has a (3 year old) daughter who is 02.40 02.52
excited about the space programme.
(At this age) the child doesnt realise the 02.53 03.08
risk involved.
* Actual recording starts at 33 seconds
Fig. 3.13 MISD Textmap Table
You will note that the above table includes points on which the text-
mappers did not have a consensus; some test developers find this useful
information to record so that they can avoid tapping into it when they
are developing items located nearby in the sound file. It also acts as a
reminder as to why a certain part of the sound file has not been targeted.
Figure 3.13 reveals that, in some cases, as one idea finishes, another
begins. This brings to light the issue of how much time test takers
need between items in order to complete them. The answer to this
is dependent on a number of factors. Firstly, the test method; for
example, if the test taker is confronted with a multiple choice item,
s/he may need more time due to the amount of reading involved, as
opposed to an item which simply requires a one-word answer, for
example, taxi in the SIID example above. Secondly, the type of lis-
tening behaviour; in general, items focusing on main ideas are likely
to require more redundancy than those focusing on SIID as more
processing time will be needed, especially if the task requires the test
takers to infer propositional meaning. Thirdly, the difficulty level of
the sound file and task, the type of content (concrete versus abstract)
and the topic will also impact on the amount of time needed. With so
many variables involved, it is very difficult to recommend an appro-
priate amount of time needed between items, and this is one of the
many reasons why peer review (see 4.5.1) and field trialling are so
important (see 6.1).
3.5.4 Summary oftheMISD textmapping procedure
Figure 3.14 details the textmapping stages for MISD.
1. Identify a suitable MISD sound fileand carry out your own textmap.
3. Explain that you want them to textmap the sound file for MISD and check their
understanding of what constitutes a main idea / supporting detail .
Fig. 3.14 MISD textmapping procedure

o Textmappers can write while listening and should try to use the words of the
sound file as far as possible to facilitate the post-textmap discussion.
o The sound file will be played only once.
o Textmappers should remain silent until everyone has finished writing.
5. Provide a general context for the sound file.
6. Play the sound file once only and then allow the textmappers time to finish writing.
7. Ask them to read through what they have written, to finalise any notes and to
confirm that what they have written down is MISD and not SIID or the gist.
8. Ask the textmappers to count the number of MISD they have written down. This is
useful in determining whether the sound file has sufficient ideas on which to
develop a task.
9. The first textmapper should be asked to read out the first MISD s/he has written
and the other textmappers asked if they have it. The total number should be
written next to the MISD, for example, 5/6 in order to confirm whether there is a
consensus or not. The second textmapper should then be asked for the next
point, and the results recorded in the same way. This procedure should be
repeated for the remaining points.
10. The list of points and thedegree of consensus should then be discussed and a
decision made as to whether the sound file provides sufficient points to warrant
developing a task.
11. Where this is not the case, it would suggest that the sound file does not lend itself
to MISD.
12. The textmap results should be transferred to a textmap table and the time added
in order to check the amount of redundancy between potential items.
13. A suitable test method should be identified and the task development work
should begin (see Chapter 4).

3.6 Re-textmapping
Sometimes the initial textmap does not work for one reason or another
because of disparate or insufficient entries, for example. If the sound file
was textmapped for SIID, based on memory only, it is possible to textmap
it again to see if it would work for careful listening, that is, MISD.This
is particularly useful given the amount of time it takes to find a suitable
sound file. The important issue to remember here is the order in which
the textmapping procedures take place; that is, it should move from selec-
tive to careful. Once a sound file has been textmapped for MISD, it
cannot be re-textmapped for gist as the file is too well known and the
textmapped gists would reflect this. Thus:
selective listening careful listening =

careful listening selective listening =
When a textmapper is unsure as to which type of listening behaviour

might best suit the sound file, it therefore makes sense to start with
selective listening if possible. Another alternative to re-textmapping
with the same participants would be to find a second group of textmap-
pers to work on the sound file, but this often proves impossible for prac-
tical reasons, especially when working with second foreign languages.
3.7 Useful by-products

Another advantage to following the textmapping procedure is that feed-
back on other aspects of the sound file can be confirmed or refuted by
the textmappers at an early stage of task development. For example, itis
very useful to ask textmappers to provide their views on the sound files
suitability regarding the following aspects once the textmaps have been
completed but before they have been discussed:
1. The difficulty level of the sound file in terms of its density, speed of
delivery, lexis, structures, content (abstract versus concrete), back-
ground noise and so on.
2. The topic of the sound file in terms of its appropriateness for the tar-
get test population (the level of interest, its accessibility, gender/age/
L1 bias).
3. The length of the sound file in terms of its appropriateness to the test
specifications and to the construct being targeted.
If the sound file is inappropriate for whatever reason, the test developer
who found the sound file must be told. Not to do so will waste everyones
time and energy as the sound file will be deemed appropriate for task
development and more people than just the test developer will spend
time on it as the task moves from draft to peer review to trial.
3.8 Summary
Textmapping is not a foolproof system; involving human judgements as
it does, it cannot be. Having said that, it does provide a more systematic
approach to deciding how best to exploit a sound file and, if the pro-
cedure is followed carefully, goes some way to minimising some of the
idiosyncrasies that test developers may unwittingly introduce into the
assessment context. It certainly makes those involved much more aware
of what they are testing in terms of the construct and why. It also argues
for a fairer test, taking into account as it does the necessary redundancy
required when asking test takers to complete a task at the same time as
listening to a sound file. Using the sound file to carry out textmapping as
opposed to a transcript also acknowledges the true nature of the spoken
word. As Helgesen (quoted in Wilson (2008: 24)) so succinctly puts it:
Life doesnt come with a tapescript.
As mentioned above, it is crucial that those involved in this procedure

take on the role of textmappers and leave aside all other roles (researcher,
test developer, teacher, learner). Textmappers should not worry about
writing down the obvious or be concerned as to whether or not what
they have written down can be turned into an item. This is not their job
but that of the test developer and will only take place after consensus has
been confirmed.
Textmapping works best in a face-to-face scenario as the reactions, the

results and the clarifications can be dealt with immediatelyand more eas-
ily. However, for practical reasons this is not always possible. Where text-
mapping is carried out indistancemode, it is important to ensure that this
takes place in a peaceful environment where the textmapper is not likely
to be disturbed; any interruption could affect the outcome of the textmap
and the sound file may fail as a consequence. Textmapping a sound file
needs 100 per cent concentration, as a rewind is not permitted.
Having used this procedure for nearly two decades, I have found
that it saves a huge amount of time in the test development process. It
does this as unsuitable sound files are far less likely to get further than
the textmapping phase due to feedback from the other textmappers or
because the textmap itself fails. Secondly, far fewer tasks appear to need
revising at the trial stage, and although this could be due to other factors,
I would argue that textmapping undoubtedly plays a useful role.
DLT Bibliography
Alderson, J. C., & Short, M. (1981). Reading literature. Paper read at the
B.A.D.Conference, University of Lancaster, September.
Lynch, T. (2010). Teaching second language listening: A guide to evaluating, adapt-
ing, and creating tasks for listening in the language classroom. Oxford, UK:
Nuttall, C. (1996). Teaching reading skills in a Foreign language. London:
Heinemann.
Sarig, G. (1989). Testing meaning construction: Can we do it fairly? Language
Testing, 6 (1), 77-94.
Weir, C.J. (2005). Language testing and validation: An evidence-based approach.
NewYork: Palgrave Macmillan.
Wilson, J.J. (2008). How to teach listening. Harlow: Pearson.
Urquhart, A., and Weir, C. J. (1998). Reading in a second language. Harlow:
Longman.
4
How do wedevelop alistening task?
This chapter focuses on the next set of stages that a task needs to go
through once a sound file has been successfully textmapped. These
include:
the completion of a task identifier

the development of task instructions
the decisions regarding which test methods should be used, the num-
ber of times the sound file should be heard, the number of items, the
task layout, the mode of delivery and grading
the development of item writing guidelines
peer review and revision.
4.1 Task identifier (TI)

The task identifier is a small table which appears on the first page of
the task during the task development stage. It acts as a useful checklist
enabling the test developer to reflect on what s/he is attempting to do in
the task using the test specifications (see Chapter2) and the textmapping

DOI10.1057/978-1-349-68771-8_4
results (see Chapter3). It is also useful for the task reviewer(s) later on in
the test development cycle (see 4.5). Based on the sound file Earthquake
in Peru, which was discussed in 3.3, the TI would appear as shown in
Figure4.1 below.
Test developer: HF CEFR* Focus: General Focus:
B1.4 Gist
Level of difficulty (sound file): B1 Level of difficulty (task): B1
Test method: MCQ Topic: Natural disasters
Title of the sound file / task: Earthquake in Peru
URL: http://www.lancs.ac.uk/fass/projects/examreform/
Source
Date when downloaded:
Picture (if relevant):
Length of sound file: 2.43 seconds Words per minute: approximately 180
Version: 1 Date: 12.08.16
* Name as appropriate (for example, STANAG, ICAO, National Standards inter alia)
Fig. 4.1 Task identifier
What does the information in the TI tell us?
Test developer: to save time, use the test developers initials, for exam-
ple, HF
CEFR Focus: select the appropriate descriptor(s) from the test specifi-
cations that describes the listening behaviour(s) your task is attempt-
ing to measure. For example, in Figure4.1 the CEFR descriptor B1.4
is indicated. This is the fourth CEFR descriptor in this particular ver-
sion of B1 test specifications (hence B1 point 4) and the one that
relates to the testing of gist. If there is more than one relevant descrip-
tor, list them in terms of priority. This part of the TI is very important
as it concerns the construct.
General Focus: complete this with the listening behaviour(s) your
task is attempting to measure (see Figure2.4), for example here you
4 How do wedevelop alistening task? 87
can see Gist. This part is also very important. (See 2.4 for a discussion
as to why both the CEFR Focus and the General Focus are included in
the TI.)
Levels: should include information about the perceived levels of both
the sound file and the items. If you feel that the sound file and/or the
items might cover more than one level, include both for example,
B1/B1+. It is expected that these levels will be the same or very close.
Remember where there is a marked difference, for example, the use of
a more difficult sound file, even easy items will not help (see 2.5.1.4).
Test method: state which one you hope to use in the task. Again for
quick and easy completion, use sets of initials, for example, SAQ (short
answer questions), MCQ (multiple choice questions), MM (multiple
matching) and so on.
Topic: select an appropriate topic from the list which appears in the
test specifications (see 2.5.1.5).
Title of the sound file/task: this should be the same for both the
sound file and the task to make matching the two easier, especially
during the peer review stages.
Source: the copyright of sound files, video clips (if used) and/or any
pictures has to be obtained (unless you are using copyright free
sources). This box should provide full details of the sound file source/
video clip, the date it was downloaded (in case it is withdrawn and you
need to cite it when asking for copyright permission) and similar
information about any pictures that may be included in the task. These
links also help the reviewer to check the source if questions arise
regarding the suitability of the materials (language issues, picture qual-
ity and so on).
Length of the sound file: this should be completed and be within the
parameters cited in the test specifications.
Speed of delivery: make sure this is in line with the parameters pro-
vided in the test specifications (see 2.5.1.12).
Date: the date this version of the task was completed. This should be
updated each time the task is revised.
Version: this number should be updated each timethe taskis revised.
This way the test developer and the task reviewer can keep note of any
changes which have been made.
4.2 Task instructions

The wording of the task instructions should be standardised (and subse-
quently trialled see 6.1.1) so that all test developers are using the same
ones. This helps the test takers become familiar with the requirements of
the different task types and may help to lessen their test anxiety. It is also
important that the wording used in the instructions is short and simple
and, wherever possible, easier than the level being targeted by the task.
This is because understanding the instructions is not part of the test. In
addition, the instructions should provide test takers with sufficient con-
text regarding the sound file they are about to listen to so that they do not
go into the recording cold (see 1.5.3.2). For example:
Listen to two girls talking about their holiday in Mexico. Choose the correct
answer (A, B, C or D) for questions 1-7. The first one (0) has been done as an
example.
The context provided in these instructions acts as asignal tothe test

takers encouraging them toactivate any schemata they possess related to
travelling, overseas destinations, holiday activities and so on before the
sound file begins (see Field 2013: 92). However, care must be taken that
the instructions do not help the test taker to answer any of the items. For
example, Listen to someone talking about a boy who lives in a small village
in Tibet. And then one of the items asks: Where does the boy live?
As the instructions above indicate, reference should also be made to
the example. There are three reasons for including an example in a task.
Firstly, it provides information about what the test taker should do, for
example, tick a box, write one or two words or a letter. Secondly, it illus-
trates the type of listening behaviour the task is attempting to measure;
for example, if the task is targeting SIID, the example should also do this.
Thirdly, it reflects the difficulty level of the task the example should not
be easier than the other items which follow; this could lull the test taker
into a false sense of security regarding the difficulty level of the rest of the
task. A picture may also help the test takers to prepare for what they are
going to listen to.
The instructions that are heard at the beginning of the sound file
should be the same as those that appear in the task in the test booklet.
This helps the test taker to engage in a non-threatening act of listening
before being faced with having to understand what is being said and
needing to respond to questions based on the sound file. The instructions
should also include information about how long the test takers have to
read the questions prior to the beginning of the actual recording and how
long they will have to review and complete their answers once the record-
ing has finished. For example,
You are going to listen to a programme about lead mining in north Yorkshire.
First, you will have 45 seconds to study the task below, and then you will hear
the recording twice. While listening, choose the correct answer (A, B, C or D)
for questions 1-8. Put a cross () in the correct box. The first one (0) has been
done for you. At the end of the task you will have 15 seconds to complete your
answers.
The amount of time which should be provided for reading and com-
pleting the items depends on a number of factors, such as the type of
test method, the test takers level of familiarity with it and the num-
ber of items in the task. Multiple choice questions, for example, usually
take longer to read than sentence completion items. The amount of time
required at the end of the sound file depends to some extent on whether
the test takers hear the sound file twice. If it is only played once, they will
definitely need some time to review and complete their answers. When
in doubt, provide more time rather than less, this can be confirmed after
the trial (see 6.1.2).
Certain research (Field 2013; Buck 1991) suggests that test tak-
ers perform better when they are allowed to preview certain types of
items as they gain insights into what to listen out for in the sound file.
Wagner (2013), on the other hand, feels that further research on this
area is needed to confirm that item preview does help. (The possible
conflict that item preview may have with cognitive validity was dis-
cussed in 1.5.1.1.)
4.3 Task issues

4.3.1 Test method
There are a number of things to bear in mind when selecting which test
method should be used in a listening task. First of all, and most importantly,
the test method should lend itself to the construct which is being targeted in
the task (see Haladyna and Rodriguez 2013: 43). Field (2013: 141) advises
caution in those situations wherethe test format is driving the thinking of test
designers and item writers rather than the nature of the construct to be tested.
In other words, the construct should come first, the test method second.
Secondly, the test developer must always be aware of the amount of
reading the test method requires the test taker to undertake in order to
answer the questions. To this end, the stems and options should be as
short as possible though not so short that they become cryptic. Thirdly,
the wording must be carefully crafted so that the test taker does not waste
precious seconds trying to understand what it means while simultane-
ously listening to the sound file and trying to identify the answer.
Choosing the most appropriate test method to measure the targeted
construct is not always obvious and experience shows that some tasks
need to go through two test methods before the task works. The reason
for this could be related to the nature of the sound file (lack of sufficient
detail for MCQ items), to the construct (difficult to develop items which
sufficiently target it) or to the test developers own ability to work with
a particular method especially early on in their training. To some extent,
choosing the best test method is a matter of experience which becomes
easier with practice.
Developing items at higher levels, for example at CEFR C1 and above,
can lead test developers into using linguistically and propositionally com-
plex wording in their items in an attempt to match the perceived dif-
ficulty level. This has obvious consequences for the processing demands
faced by the listener. Field (2013: 150) reminds us with construct and
cognitive validity at stake, it is vitally important to limit the extent to which
difficulty is loaded onto items particularly given that those items are in a
different modality from the target construct.
Each test method has its strengths and weaknesses; these are discussed
in turn below.
4.3.1.1 Multiple matching (MM)
One test method that appears to work well in listening tasks is multi-
ple matching (MM). There are a number of different formats, includ-
ing: matching answers with questions, for example, in an interview (see
Chapter5, Task 5.1); matching sentence beginnings with sentence endings
(see Chapter5, Task 5.3); matching topics with a series of short sound files
(see Into Europe Assessing Listening: Task 44); or matching what is being
said to a range of pictures (see Into Europe Assessing Listening: Task 43).
MM tasks can be used to target different types of listening behaviour
(Field 2013: 132, 137). For example, if you want to target the test takers
ability to infer propositional meaning, you could develop a task which
requires them to match the speakers mood or opinion about a particular
subject to one of the options. If you want to assess main ideas compre-
hension, you can paraphrase the textmap results (see 3.5) and then split
them into two parts (sentence beginnings and endings). Testing impor-
tant details can also be targeted through matching (see Into Europe
Assessing Listening: Task 41).
MM tasks are compact with little redundancy and require much less
reading than MCQ items (Haladyna and Rodriguez 2013; Field 2013).
Another advantage of MM tasks is that they involve no writing and there-
fore reduce the chance of any construct irrelevant variance that writing
may bring to the task. Post-trial feedback in a number of countries has
shown that test takers appear to enjoy this particular method. This is con-
firmed by Haladyna and Rodriguez (2013: 74) who state that the format
is very popular and widely accepted.
Care must, however, be taken to ensure that where sentence begin-
nings and endings are used, that the task cannot be completed simply
through the use of grammatical, syntactical or semantic knowledge
without listening to the sound file. This is an argument that is often
raised against using this type of MM task. (See Task 5.3 for an example of
this.) However, this can be minimised by careful wording of the sentence
endings (see 4.4.3.2.3). In addition, although the answer might appear

obvious to non-test takers such as teachers and other test developers, test
takers themselves operate under different conditions when sitting a test,
especially a high-stakes one, where time pressure and anxiety play large
roles. Ultimately, all choices still need to be confirmed through listening.
The use of good distracters should also help and it is important to include
at least one distracter in MM tasks so that the test taker is not able to get
the final item correct by default (see 4.4.3.2.3 below).
4.3.1.2 Short answer questions (SAQ)
SAQ items require the test taker to produce an answer, rather than to
select one from a range of options. They are often referred to as con-
structed response items. When using this method, the test developer needs
to define what short means in their particular test situation. If you
have a look at SAQ tasks in general, you will probably find that they
require a maximum of five words. This means the item can be answered
in between one tofive words depending on what is being targeted. This
limit is imposed in an attempt to minimise any construct irrelevant vari-
ance, deriving from the test takers ability to write, from affecting his/her
performance on the listening task (see Weir 2005: 137). When targeting
SIID, the answer can often be written using one or two words, but with
MISD and gist, it is more likely that a minimum of three words will be
needed for the test taker to show that they have understood the idea.
There are two main types of SAQs: those that consist of closed ques-
tions, for example, When was John Smith born? and those that require
completion (often referred to as sentence completion tasks), for example,
John Smith was born in ____. It is strongly recommended that the com-
pletion part be placed at the end of the sentence rather than in the mid-
dle (see sample Task 5.6in Chapter5). This is because there is a strong
possibility that test takers will engage in guessing strategies (Field 2013:
131), in other words attempt to apply their syntactical, grammatical and
semantic knowledge to complete a gap when it appears in the middle of
an item, rather than one that appears at the end. Table completion tasks
are a further option (see Into Europe Assessing Listening: Task 25).
One of the advantages of the SAQ test method is its flexibility. It

allows for different types of listening behaviour to be targeted such as
SIID (see Task 5.6), MISD (see Into Europe Assessing Listening: Task 10)
and gist (see Chapter5, Task 5.5). SAQ tasks also often result in stronger
discrimination than other test methods, as the test taker has to construct
the answer and cannot simply depend on guessing; in other words, the
test taker has either identified the answer in the sound file or s/he has not.
The SAQ test method does, however, have some disadvantages. The
first relates to the wording of the question or stem (in a sentence comple-
tion SAQ task) which needs to be tightly controlled in order to limit the
number of possible answers. In a SIID task this is much easier to do than
in a MISD or gist task. For example, in the John Smith example above,
the most likely answer is limited to one word:
When was John Smith born? ____1852_____

John Smith was born in ___1852____
though arguably a test taker could write in the 1850s.

In a MISD or gist task the answers are likely to be far more varied, and
it is therefore important that a key of the acceptable (and post-field trial,
non-acceptable answers) is prepared in advance for use by the markers.
Where the answer is similar in content to what the test developer is trying
to target, this should be accepted. If the list becomes too long though,
this may suggest that the item is somewhat guessable and may not dis-
criminate well (see 6.3.2.2). Where a totally different response to the one
that the test developer is targeting appears, and is deemed acceptable,
this suggests that the wording of the question has been left too open and
therefore may not faithfully address the textmapped point.
Another disadvantage of the SAQ method in assessing listening abil-
ity is that the test takers have to produce the actual words themselves
that is, they cannot rely on a written text providing it as they can in
reading. This makes such itemsmore cognitively demanding (Field 2013:
131). This can, as mentioned above, lead to possible construct irrelevant
variance due to the need to manipulate language in order to answer the
question. To minimise this problem, items should be written in such a
way that the words of the sound file can be used verbatim to complete
the question though, of course, the words in the stem or question

should not be the same as those on the sound file, or lead the test taker
to the answer unless the construct being targeted is one of recognition.
4.3.1.3 Multiple choice questions (MCQ)
MCQ tasks can also be used in listening, and like MM tasks, are useful
in targeting different processing levels (Field 2013: 128). In terms of dif-
ficulty, Innami and Koizumi (2009) found that MCQ items are easier
than SAQ items in L2 listening. Careful thought, however, must be given
to MCQ item construction due to the amount of reading that may be
involved and the impact this can have on the test taker who is trying to
process the input and confirm or eliminate distracters at the same time.
In light of this, it is recommended that MCQ options should be as short
as possible preferably only half a line at most (seeChapter 5, Task 5.8).
A decision also needs to be taken as to whether the item should have
three or four options. Recent research (Harding 2011; Shizuka etal. 2006;
Lee and Winke 2013) suggests that given the demands upon the listener,
and the minimal differences in discrimination, that a three-option item
(ABC) is optimal in MCQ tasks. Haladyna and Rodriguez (2013: 66)
add that for average and stronger test takers, the three-option MCQ is
more efficient but for the weaker test takers four or five should be used
on the grounds that they are more likely to employ guessing strategies.
From a practical point of view, three-option MCQ items also take less
time to construct and can save time during the test administration (Lee
and Winke 2013) thus possibly allowing for other items to be added,
depending on the overall amount of time allocated to the listening test,
and thereby providing more evidence of the test takers listening ability
(see Haladyna and Rodriguez 2013: 66).
Whether you choose to use three or four options, they all need to
be as attractive and plausible as possible to limit successful test-taking
strategies. This is particularly true, however, where only three options are
used, as being able to easily dismiss one of these options will provide the
test taker with a 50:50 chance of answering the item correctly through
guessing. Options that are ridiculous in content, making them easy to
eliminate, and those, which are in any way tricky, must be avoided
(Haladyna and Rodriguez 2013: 62).
Where MCQs are used to measure MISD, the input of the sound file
needs to be detailed enough to produce a sufficient number of viable
options. Sound files of a discursive nature, such as those where two or
three people are putting forward different arguments, where someone is
being interviewed or where one person is explaining different opinions
held by a number of other people, lend themselves to MCQ items.
As with MM tasks, pictures are particularly useful at the lower end of
the ability range; for example, the Cambridge ESOL suite uses MCQ tasks
with pictures at KET and PET. Using a set of four pictures, test takers
could be asked to match the correct picture to the content of the sound
file (see Into Europe Assessing Listening: Task 13 for an example of this
type); or there could be multiple sets of related pictures, based on what the
speaker is talking about or describing, and the test taker must choose the
correct answer to each question in turn. Field (2013: 134-5) points out
that this approach might be particularly useful where test takers are from
L1 contexts which do not use the Western European alphabet.
4.3.1.4 Other test methods
While it is possible to use sequencing as a test method in listening, for

example, test takers could be asked to put a series of pictures into the cor-
rect order according to the content of the sound file, care must be taken
to minimise the role of memory and recall in task completion as this may
involve construct irrelevant variance (Field 2013: 123). The number of
items in the sequence would therefore need to be limited and this might
in turn lead to guessing.
True/false tasks can also be used for listening, but the obvious problem
is that test takers have a 50:50 chance of answering the item correctly
by guessing. While this might be acceptable as part of a low-stakes test,
in a high-stakes one it is not something to be recommended. Indeed,
Haladyna and Rodriguez (2013: 69) report that this type of test method
is rarely seen in standardised tests although it can easily be found in
classroom tests. In reading, the guessability factor can be avoided by
asking test takers to provide justification for their decisions regarding

whether a statement is true or false, but in listening this would once
again involve memory.
4.3.2 Number oftimes heard
As discussed in 2.5.1.11, this is another decision that test developers need

to take when drawing up the test specifications. If you have decided that
some sound files will be played once and others twice, a useful way to
confirm which should be applied to a particular task is to ask your fellow
test developers to indicate which items they managed to answer during
the first listening (using black ink), and which in the second listening
(using blue ink), and then discuss their findings. Yet another approach
is not to tell your colleagues that they will hear the sound file twice, and
again see what they manage to do during that initial listening. Decisions
regarding the number of times a listener will listen to a sound file should
be confirmed through field trials (see Chapter 6).
4.3.3 Number ofitems needed
The optimal number of items in a task (or test) should have been made
at the test specifications stage (see 2.5.2.3). During task development,
it is important that the test developer complies with the minimum and
maximum number of items per task unless there are good reasons for
reviewing this decision before the task goes into the trial. For example,
where the textmap results allow for one or two extra items in a task, it
may be useful to include these at the trial stage; any items with weak sta-
tistics can then be dropped after checking for any newly created gaps in
the sound file content.
4.3.4 Task layout
Given the number of demands placed upon a test taker during a listening
test (listening, reading and sometimes writing), it is crucial that the task
layout be as clear and as listener-friendly as possible. Where a task needs
two pages, these must be placed opposite each other in the test booklet
to avoid page turning. In addition, there should be ample space for the
test taker to write his/her answer in a SAQ task and the MCQ options
should be spread out sufficiently well for the test taker to be able to see
them clearly. In MM tasks where test takers are required to match sen-
tence beginnings and endings, it is strongly recommended that the two
tables are in the same position on opposing pages so that the test taker
simply needs to read across from one to the other (see Chapter5, Task 5.3
A Diplomat Speaks for an example of this.)
4.3.5 Mode ofdelivery
The advantages and disadvantages of using sound files as opposed to

video clips were discussed in 1.5.3.4 and 2.5.1.10.
4.3.6 Integrated listening tasks
In some examinations, it makes sense to assess a test takers listening abil-

ity at the same time as his/her other skills. For example, in an academic
context it is more than conceivable that you might want to ask test tak-
ers to listen to a lecture while simultaneously taking notes and then ask
them to use these, possibly together with some reading texts, to respond
to a written task. Such a task is likely to have high construct validity for
university students, as it is probable that this is the kind of task they will
face during their studies. Scores would be based on the final written task,
but would obviously involve the test takers listening, reading and writing
abilities. (See Rukthong 2015 for an analysis of academic listening-to-
summarize tasks.)
4.3.7 Grading issues
As the main aim of a listening task is to determine a test takers listen-

ing ability, mistakes in spelling, punctuation and grammar should not
be penalised in SAQ tasks. If it is clear to the marker what the test taker
is trying to say, and that this reflects the targeted answer, then a point
should be awarded. Where doubt exists and the test is a high stakes one,
another colleague should be asked for their opinion. If no one is avail-
able, look through the rest of the test takers answers to see if this can help
you to determine whether the test taker should be given the benefit of the
doubt or not.
It is strongly recommended that half-marks are not used; experience
shows that these tend to be used in an inconsistent (and therefore unre-
liable) way across different markers. In addition, items that carry more
than one mark often only serve to artificially inflate the gap between the
stronger and the weaker test takers. Where a particular aspect of listen-
ing is felt to be more important (for whatever reason), then it is better
to include more items targeting that type of listening behaviour than
to award more than one mark to an item (Ebel 1979). However, you
should also be aware of redundancy and construct over-representationif
too many items target the same construct.
4.4 G
uidelines fordeveloping listening
items
Developing a set of item writing guidelines which test developers can use,
as the basis for task development work, is crucial for a number of reasons.
Firstly, they help to ensure that the items conform to the test specifica-
tions. Secondly, guidelines should help to minimise any reliability issues
that might creep in due to the inclusion of inappropriate wording in
the instructions. Thirdly, they should encourage all members of the test
development team to work in the same way. Fourthly, they act as a check-
list to refer to during peer review (see 4.5).
Guidelines need to address issues related to the sound file, the instruc-
tions including the use of the example and picture (if used), task devel-
opment, the test method and the grading procedure. Based on past
experience of working with task development teams, recommendations
regarding how each of these issues can best be dealt with are presented
below.
4.4.1 Sound file
1. Use authentic sound files. These could be ones which have been
downloaded from the internet (check copyright permission) or ones
which you have created yourself. For example, an interview of some-
one talking about the kind of books they like to read (see Task 5.1,
Chapter 5).
2. The length of the sound file must be within the test specification
parameters.
3. The topic should be accessible in terms of cognitive maturity, age and
gender and should be something the target test population can relate
to.
4. The sound file should exhibit normally occurring oral features (see
1.4) in keeping with the input type (for example, speech versus
conversation).
5. The speed of delivery must be commensurate with the targeted level
of difficulty and conform to the test specifications.
6. Accents should be appropriate in terms of range, gender and age.
7. The number of voices should be in keeping with the difficulty level
being targeted. (The more voices there are, the more difficult a sound
file usually becomes.) (See Field 2013: 116.)
8. At least some sound files should have background noise to replicate
what listeners have to deal with in many real-life listening contexts.
Such background noise should be supportive and not disruptive (see
Task 5.8in Chapter5).
9. Sound files must be of good quality that will replicate well in the
target test situation (acoustics).
10. Where phone-ins form part of the sound file, ensure that the audibil-
ity level is sufficiently clear as the volume can often differ at those
points.
11. Check that the sound file does not finish abruptly, for example in the
middle of a sentence, as test takers might think there is something
wrong with the recording. Instead edit the last few words of the
sound file so that they fade out naturally.
4.4.2 Task instructions
1. The wording of the instructions should be standardised so that all

test developers are using the same ones.
2. The wording should be short and simple and include no extraneous
information.
3. The instructions should include information on what the sound file
will be about in order to give test takers some context before they
start listening. For example, You are going to listen to an interview
with the Australian Ambassador to Thailand.
4. The reference to the topic of the sound file should differ from any
task title, firstly because this information would then be redundant
and secondly, because titles can be cryptic or pithy and are thus less
likely to help the test takers prepare.
5. It is often helpful to put the instructions in bold or italics to differ-
entiate them from the task itself.
6. The instructions in the task should be the same as those that appear on
the sound file. (See 4.2 for the rationale behind this recommendation.)
7. The instructions should make appropriate reference to the example
and any extra options that need not be used. For example in MM
tasks, they might appear as follows:
You are going to listen to While listening, match the beginnings of the
sentences (1-7) with the sentence endings (A-J). There are two sentence end-
ings that you should not use. Write
(See Task 5.3, Chapter 5 for the complete version of these

instructions.)
8. The answer for the example should be based on the results of the
textmapping exercise.
9. The example must reflect the same type of listening behaviour and be
at the same level of difficulty as the rest of the task (see 4.2).
10. The example should show test takers what they need to do in order
to complete the task, for example, tick a box, write four words, select
a letter and so on.
11. The answer in the example must be completed. It is the convention

to use a different cursive font for the SAQ answer so as to make it
clear to the test takers. In MCQ and MM tasks the use of shading
helps (see Tasks 5.3 and 5.8in Chapter5).
12. The picture (if used) should be necessary to understanding the task
and should not be included simply for cosmetic purposes.
4.4.3 Item/task development
4.4.3.1 General issues
1. Wherever possible, the language of the test items should be simpler

than the wording of the sound file and certainly no more difficult.
2. The test items should act as a kind of framework by which the test
taker can keep track of the part of the sound file s/he is listening to at
any given point. (It is acknowledged that such a framework is an issue
for cognitive validity.)
3. Target one type of listening behaviour per task; in other words, do not
mix MISD and SIID.This makes the task more cognitively demanding.
4. Each task must include a completed task identifier (see 4.1 above).
5. Check that all the items reflect the focus given in the task identifier.
6. When testing comprehension, as opposed to recognition, avoid using
the same words, and the same sentence structure, as those which
appear in the sound file as much as possible. Where the only alterna-
tive word is more difficult and/or may result in a less natural rendi-
tion, it may be necessary to use the same word.
7. Be careful when using the cognate argument (the word is the same or
very similar in both the L1 and the target language) as a rationale that
test takers will understand seminal words from a difficulty level higher
than the one being targeted. In the global world in which we live, not
all learners have the same L1.
8. Each task should come with an appropriate file name that includes the
test developers initials, the task name, the test method and the ver-
sion. For example:
HF_Earthquake_in_Peru_MCQ_v1
This helps in terms of keeping an account of how many tasks a

particular test developer has produced, what types of test methods
are being used as well as the status of the task (number of versions).
9. A particular font and size should be agreed upon for use in all tasks
for example, Arial 12. This helps to maintain layout consistency.
10. Each task should include an example (see 4.2 above) which should
be of the same quality as the rest of the items and be based on the
textmapping results.
11. Wherever possible, avoid using the contents of the first sentence of
the sound file for the example as it acts as a scene-setter.
12. The last sentence should also be left unexploited, as there is often
insufficient follow-up context for an item to be reliably answered.
An exception might be if the last sentence of the sound file involves
an important summary of the whole recording. For example, it
might be the point at which the speaker finally reveals his opinion
on X.
13. Make sure the wording of an item does not help the test taker to
answer one of the previous/future items. (Haladyna and Rodriguez
2013: 93)
14. Be careful to avoid overlap between questions that is, two questions
focusing on the same point of information or detail in the sound file.
Ideally, with questions at a premium, each item should provide the
test developer with some new information about the test takers lis-
tening ability.
15. Make sure that the items cannot be answered without listening to the
sound file, based simply on a test takers general, grammatical, syn-
tactical and/or semantic knowledge.
16. Make sure the items are independent of each other, in other words,
the test taker does not need to get one item right in order to answer
the next one correctly.
17. Make sure that non-linguistic knowledge, (for example, maths; con-
cept of democracy; geography and so on) is not required to answer
the item correctly (unless this is part of your construct and this is
clearly spelt out in the test specifications).
18. The answers to the items must be in the order in which they appear
in the sound file otherwise they are likely to impose a heavy load on
the testtakers memory (see Buck 2001: 138; Field 2013: 133-4).
19. Make sure there is sufficient redundancy in the sound file between
two consecutive items so the test taker has time to process the input
and complete his/her answer (ibid.). According to Field (2013: 89)
much listening is retroactive, with many words not being accurately
identified until as late as three words after their offset.
20. Avoid using referents (personal pronouns, demonstratives) in test
items. For example, Where did he go on Monday? should be writ-
ten as Where did John go on Monday? If John appears throughout
the sound file and is the only male voice/male person referred to in
the sound file, he can be used after the initial question.
21. Make sure the content of the options do not overlap. For example,
A between 5 oclock and 7 oclock

B around 6 oclock
22. Word the stem positively; avoid the use of negatives in both the stem
and the options as this has a negative effect on students (Haladyna
and Rodriguez 2013: 26, 103).
23. Avoid humour in items as it detracts from the purpose of the test
(ibid.: 107).
24. All tasks should include a key, which should appear on the final page
of the task separated from the rest of the task so as not to influence
those involved in peer review (see 4.5 below). It should not appear
within the task.
25. Check that the key is correct and that any final changes made to the
task (distracter order, for example) are reflected in the final version of
the key.
4.4.3.2 Test method
4.4.3.2.1 General issues
1. A range of test methods should be used to minimise any test method

effect.
2. Wherever possible it is recommended that a mixture of selected

response (multiple choice, multiple matching) and constructed
response (short answer questions) should be used. This is likely to
minimise the impact of guessing.
3. Use only one test method per task and, as mentioned above, use this
to target just one type of listening behaviour. Asking the test taker to
switch between methods can be confusing and waste valuable time.
(See Into Europe Guidelines for Writers of Listening Tests 2.2.7: 122.) It
also causes complications with regards to the example.
4. Use only those test methods that are familiar to the target test popula-
tion. Where a new method needs to be introduced, make sample tasks
available at least one year before the test date.
4.4.3.2.2 Short answer questions (SAQ)
1. Minimise the amount of language manipulation required by the test

takers. Language manipulation is much more difficult for test takers
in a listening task than a reading one; in reading, the text is supplied
and test takers do not have to simultaneously contend with an incom-
ing stream of aural input.
2. Make sure the item can be answered in the permitted number of words.
3. Grammatically correct responses are not required in a listening task pro-
vided the meaning has been conveyed. This is also true of spelling errors.
4. Complete sentences should not be expected in listening tasks, as in
real-life listening events, natural responses do not often take that form.
5. Make the wording of the items as precise as possible to minimise mul-
tiple possible answers that may create marking, and thus discrimina-
tion, problems.
6. Where there is more than one possible answer to the question, add the
following words on the line below the question: (Give one answer, or
Name one.)
7. Closed questions and sentence completion should not be mixed

within the same task. For example:
Q1 Where did John go at 9 a.m.?

Q2 John likes to watch _____.
This is confusing for the test taker and would require two sets of
instructions.
8. Where the item is targeting a main idea, test takers should be required
to write more than just one word. (One word is not usually sufficient
to test a main idea though it occasionally can do at a higher level of
difficulty and/or where the targeted answer is based on an abstract
concept.)
9. Ensure the items do not lead the test takers to having to use the same
answer more than once as this might lead to confusion (good test tak-
ers are likely to reject this possibility) and may result in a lack of face
validity.
4.4.3.2.3 Multiple matching (MM)
1. Check that there is only one correct answer unless the task allows test
takers to use the same option more than once in the task.
2. In order to minimise the use of syntactical, grammatical and semantic
knowledge in putting sentence beginnings and endings together, start
all sentence endings with the same part of speech. Where this is not
possible, use two parts of speech.
3. Make sure that the combination(s) of sentence beginnings and end-
ings can be processed while listening, in other words, they are not too
long.
4. At the trial stage it is useful to include two distracters just in case one
of them does not work (seeChapter5, Task 5.3). One of these can be
subsequently dropped if necessary. Where a task contains only a few
items (under five), one distracter may be sufficient.
5. Make sure that the wording of the options has been paraphrased so
that the test takers cannot simply match the words with those on the
sound file.
6. The distracters should reflect the same construct as the real options.
4.4.3.2.4 Multiple choice questions (MCQ)
1. All the distracters must be plausible. If this proves impossible, it sug-

gests that MCQ is not the appropriate test method for the sound file.
2. Check for an appropriate distribution of letters in the key (equal or
near equal number of As, Bs, Cs and Ds) to avoid test takers simply
obtaining points by writing A (for example) for every item.
3. Options should be of approximately the same length or there should
be two balanced pairs (for example, A and B, C and D) where the
four-option MCQ method is used. Wise test takers will usually dis-
miss options that appear different from the others (though of course
this approach may not always work). (See Haladyna and Rodriguez
2013: 103-4 for further discussion.)
4. Avoid repeating words which appear in the stem in the distracters.
This helps to gives the correct answer away (ibid.).
5. MCQ distracters should fulfil the same function as the key; if the item
is targeting specific information, all the options should target that
construct.
6. Avoid None of the above as an option on the basis that there has to
be a correct answer (ibid.).
7. Avoid repeating the same word in the same place in the options (face
validity issue):
A Different domestic animals
B Different animal fodders
C Different sized bags
D Different brands of tractors
In the above case, the word Different could be moved into the MCQ
stem.
8. Check that there is only one correct answer.
9. Where figures, times, dates and so on are used, put them in logical or
numerical order. For example:
A 1978
B 1983
C 1987
D 1993
4.4.4 Layout issues
1. To save time, make task templates available to all test developers so

they can simply type their tasks into them. This helps to systematize
the layout and makes it easier for the item developer, the peer reviewer
and the test taker.
2. Add shading and indicate the answer by using such symbols

as or to make the correct answer clear to the test takers. For
example, Task 5.8in Chapter5 employs the following standardised
layout:
0 Elliot can explain about Paris because he _____

A comes from the city
B works for the tourist office
C has lived there for years
D knows the best places.
Task 5.3 A Diplomat Speaks does this for the MM test method, and
Task 5.6 Oxfam Walk for the SAQ one.
4.5 Peer review andrevision

Once the task is at the stage where the test developer feels it is ready for
feedback, appropriate reviewers should be identified to carry out this cru-
cial work. Feedback is best provided by those who are familiar with the
test specifications and the rationale behind the task development work,
and who are prepared to be completely honest. Experience shows that
many peer reviewers find it difficult to provide negative feedback about a
sound file and/or a task if this feedback has to be given to their colleagues.
Concern about giving offence is understandable but can result in a task
(or sound file) going forward when it should not; this is not only a waste
of resources but ultimately a threat to reliable and valid test scores if
the task continues to trial and live test administration. When providing
constructive feedback, the reviewer must wear the reviewer hat and no
other.
Wherever possible (and admittedly this is not always the case), the
feedback is likely to be even more useful if the reviewer is someone who
has not taken part in the textmapping procedure. Where the latter is
the case, unless there has been some time between the two events, the
reviewer may well remember certain aspects of the sound file and this
can influence his/her feedback on the task. For example, the items might
seem easier, the answers more obvious, because s/he remembers parts of
the sound file.
In addition, to peer reviewers being able to provide constructive feed-
back, test developers have to be able to accept it and to acknowledge that
sometimes their task is not going to work and that it needs to be dropped.
For the sake of everyone involved in test development, it is important
that this aspect of task development is aired and embraced from the very
beginning.
4.5.1 Peer review
In providing feedback to many test developers over the years, I have

found the following steps to be the most useful:
1. Check the information provided in the task identifier (TI) to see

that it is complete and that it conforms to the test specifications.
Where it does not, or where any information is missing, add
a note.
2. Check the instructions to ensure that they match the standardised
ones provided to all the test developers in terms of their wording and
layout (bold, italics and so on). Where they do not, add a
comment.
3. Make sure the test developer has introduced the topic of the sound
file in an appropriate manner that is, in a way that will help the test
taker to start thinking about what they are going to hear. The
instructions should not simply repeat the title of the task (if there is
one) or be too vague (for example, You are going to listen to an
interview). Where the introduction is not appropriate, add a

comment.
4. Check that the language used in the instructions is at the appropriate
level preferably easier than the targeted level or at the same level if
other wording would lead to unnaturalness.
5. Check that the example is targeting the same construct as mentioned
in the TI, that it shows test takers what is required of them (a box has
been ticked, a few words have been provided as an answer and so on)
and that the difficulty level is as claimed in the TI. If it does not
comply with any of these three roles, add a comment. (While it is
acknowledged that not all test takers read the example, it is the test
developers responsibility to provide one.)
6. Next, study the task and try to answer the items without listening to
the sound file. Select or write what you believe is the correct response
based on your own general knowledge or any hints unwittingly pro-
vided by the test developer.
7. Check that the layout of the task (instructions/items) conform to the
item writing guidelines. Where they do not, add a comment.
8. Play the sound file and check whether the instructions in the written
task correspond to those on the recording. Where this is not the case,
add a note.
9. Check the quality and the topic of the sound file and if either of
these is found to be unacceptable for a particular reason add a com-
ment. (Such issues should have been picked up at the textmapping
stage, but sometimes things slip through.)
10. If you find the sound file interesting, motivating and so on, say so;
test developers like positive as well as negative (albeit constructive)
feedback!
11. Once you have completed steps 1-10 above, you are ready to do the
task as a test taker. However, before you start, study the following
questions which you should consider while completing the task:
a. Is the wording of the items at the appropriate difficulty level?
b.Is the length of the items (stem/distracters) appropriate for a
listening task?
c. Is the amount of redundancy between the items sufficient to allow
for item completion?
d. Is there more than one answer? If the task is SAQ and there is
more than one answer, check whether the answers relate to the
same concept or to two separate ones. If the latter, add a note, if
the former, ask the test developer whether your alternative sugges-
tion would be acceptable.
e. Can all the questions be answered based on the sound file?
f. Do the distracters work? That is, does your eye engage with them
or not even grace them with a blink? If the latter, you need to
leave a comment.
g. Is there any overlap in terms of content between the items? For
example, do two of the items have the same answer?
h. Does the answer to one item help with the answer to another
item?
i. Do any of the items target something other than the construct
defined in the TI? If so, check the textmap table to see what the
test developer meant to target.
j. Do any of the items require the test takers to understand vocabu-
lary or expressions above the targeted level in order to answer the
item correctly?
k. Can the answer be written in the number of words allowed by the
task (SAQ)?
l. Is the test method the most appropriate one given the contents of
the sound file and the targeted construct?
12. Now do the task under the same test conditions as the test taker as
far as possible. If the instructions say the sound file will be played
twice, then play it twice even if you do not need to hear it twice. Give
yourself the same amount of time, as the test takers will have to read
and then complete the questions. If the recording should be listened
to twice, mark the items in such a way that the test developer can see
which ones you answered on the first listening and which on the
second. By doing this you provide useful insights to the test devel-
oper on the differing difficulty levels of the items or the related part
of the sound file.
13. Do not stop the sound file while doing 1-10 above; simply make
quick notes on the task that you can later complete. (After a while
this will become second nature and you will do it much more
quickly.)
14. Once you have finished completing the items and your comments,
check the answers you have given against the key the test developer
has provided. (This should be on a separate page so you are not influ-
enced while completing the task. The answers must not be marked
in the task.)
15. Where any differences are found between the key and what you have
written/chosen, add a note. If your answer is not in the list (SAQ
tasks), or you have chosen another option (MCQ/MM) ask the test
developer whether s/he would accept it or not.
16. If you could not answer an item, tell the test developer including the
reason if known.
17. Where you find that there is insufficient time to complete an item,
check the Time column in the textmap table, which should be
located at the end of the task. If the time appears to be sufficient, try
to deduce why the item was problematic and mention this in your
feedback.
18. Look through the textmap table results to ensure that what is there,
has been targeted in the items and that all points relate to the con-
struct defined in the TI.Add comments as necessary.
19. Finally, taking all the feedback into consideration, decide whether
the test developer should be encouraged to move on to the next ver-
sion of the task or not. If not, summarise your reasoning so as to help
the test developer as much as possible with his/her future task
development.
20. Once your comments are complete, add your initials to the file
name for example, HF_Earthquake_in_Peru_MCQ_v1_RG
and return the task to the test developer.
21. If you feel that in light of doing the task any of your comments
might impact on the test specifications or the item writing guide-
lines, make sure this information is passed to the person responsible
for this aspect of task development so that the documents can be
reviewed and/or updated as necessary.
4.5.2 Revision
On receiving feedback, the test developer should read through all the
comments to get a general idea of what issues have been raised. Then if
the task has been recommended to move forward to the next version, the
test developer should work through each comment, making changes as
necessary. To help the reviewer, it is better if the test developer puts any
new wording or comments in a different colour. This should help speed
up the review process.
Where a test developer disagrees with something the reviewer has said,
a reason must be provided. For example, if the test developer feels that
an answer suggested by the reviewer in an SAQ item is not acceptable, a
reason must be given. If something the reviewer has written is not clear,
the test developer should ask for further explanation or clarification.
Comments should not be left unanswered; this only leads to lost time, as
the reviewer will need to post the comment again on the next version of
the task if s/he sees it has not been responded to.
Once the revisions are complete, the version number and date in the
TI should be changed and the reviewers initials removed from the file
name so that it appears as follows: HF_Earthquake_in_Peru_MCQ_v2.
The task should then be re-posted to the same reviewer for further
feedback.
4.6 Summary
Developing good tasks takes time, but it is time well spent if it results in
tasks that provide a reliable and valid means of measuring the test takers
ability. In addition, the procedures outlined above should increase the
test developers own expertise and ability to produce good listening tasks.
DLT Bibliography
Buck, G. (1991). The test of listening comprehension: An introspective study.
Language Testing, 8, 67-91.

Ebel, R.L. (1979). Essentials of educational measurement (3rd ed.). Englewood
Cliffs, NJ: Prentice-Hall.
Harding, L. (2011). Accent and Listening Assessment. Peter Lang.
Innami, Y., & Koizumi, R. (2009). A meta-analysis of test format effects on
reading and listening test performance: Focus on multiple-choice and open-
ended formats. Language Testing, 26, 219-244.
Lee, H., & Winke, P. (2013). The differences among three-, four-, and five-
option-item formats in the context of a high-stakes English-language listen-
ing test. Language Testing, 30, 99-123.
Rukthong, A. (2015). Investigating the listening construct underlying listening-
to-summarize tasks. PhD.University of Lancaster.
Shizuka, T., Takeuchi, O., Yashima, T., & Yoshizawa, K. (2006). A comparison
of three- and four-option English tests for university entrance selection pur-
poses in Japan. Language Testing, 23, 35.
Wagner, E. (2013). An investigation of how the channel of input and access to
test questions affect L2 listening test performance. Language Assessment
Quarterly, 10 (2), 178-195.
Weir, C.J. (2005). Language testing and validation: An evidence-based approach.
NewYork: Palgrave Macmillan.
5
What makes agood listening task?
Introduction
In choosing the tasks that are discussed in this chapter, I had a number of
objectives in mind. Firstly, I wanted to include tasks that focused on dif-
ferent types of listening behaviour; secondly, I looked for tasks that could
exemplify different test methods (multiple matching, short answer ques-
tions and multiple choice questions); and thirdly, I selectedtasks which
targeted a range of different ability levels. In addition to these consider-
ations about the tasks themselves, I wanted to include a range of sound
files that reflected different discourse types, topics, target audiences and
purposes. The final selection will hopefully provide some useful examples
of what works well and what can be improved upon.
In the case of each task the test population, instructions and sound file
are described and then the task presented. This is followed by a discussion
of each task in terms of the type of listening behaviour the test developer
is hoping to measure, the suitability of the sound file in terms of reflect-
Electronic supplementary material: The online version of this chapter (doi:10.1057/978-1-349-

68771-8_5) contains supplementary material, which is available to authorized users.

DOI10.1057/978-1-349-68771-8_5
ing a real-world context, the test method and the layout of the task in
terms of facilitating the listeners responses.
The keys for all the tasks are located at the end of this chapter and the
relevant sound files can be found on the Palgrave Macmillan website. It
should be noted that sometimes the instructions are present at the begin-
ning of the sound file and sometimes they are not.
To receive the maximum benefit from this chapter, I strongly recommend
you actually do the tasks as a test taker under the same conditions, that is,
if the instructions say the recording will be played twice, then listen twice.
Read the task instructions carefully to see what you should do and study
the example and the items in the time provided. I find it very helpful to use
different colours for the answers I give during the first and second times that
I listen to the sound file as they provide an indicator of those items which
might be more difficult or which might be working in a different way than
had been anticipated by the test developer. Above all, you should remember
first of all that there is no such thing as a perfect task and secondly, what you
as a reader and/or teacher may feel is problematic quite often goes happily
unnoticed by the test taker and is not an issue in the resulting statistics!
Part 1: Multiple matching tasks

5.1 Task 1: Reading habits (MM)
This first multiple matching task was part of a battery of tasks which were
developed for adult university students who required a pass at either B1
or B2in order to graduate from university. Time was provided before and
after the task for the test takers to familiarise themselves with what was
required and to complete their answers. The instructions and the task
itself appear in Figure 5.1.
Listen to Jane answering questions about her reading habits. First you have
45 seconds to study the questions. Then you will hear the recording twice.
Choose the correct answer (1-7) for each question (A-I) . There is one extra
question that you do not need to use. There is an example (0) at the beginning.
At the end of the second recording, you will have 45 seconds to finalise your
answers. Start studying the questions now.
5 What makes agood listening task? 117
JANES READING HABITS
Question Answer
A Did you read a lot when you were a child? 0 E
B Where do you normally read?

Q1
C How often do you read?
Q2
D How did you choose the last book you read?
E Do you ever read books in a foreign language? Q3
F Do you like reading things on a screen?

Q4
G What kind of books do you prefer to read?
Q5
H When did you last read a book?
I Do you read more now than when you were Q6
younger? Q7
Fig. 5.1 Janes reading habits (MM)
5.1.1 Sound file
The sound file is based on an interview about reading habits with a

woman called Jane, who speaks with a Scottish accent. The interviewers
questions were subsequently removed from the interview and used as the
items (A-I). The sound file lasts just under three minutes (this recording
does not include the task instructions). The test developer felt the task
was suitable for use in assessing CEFR B1 listening ability.
5.1.2 Task
5.1.2.1 Listening behaviour
The items were aimed at measuring the test takers ability to synthesise
the ideas presented in each response that Jane gave in order to determine
the overall idea and then link this with the relevant question. For exam-
ple, in attempting to find the answer to question 1, the test taker needs
to synthesise the following information provided by the speaker, catholic

taste, classics, well-written and well-reviewed, and match these with
option G What kind of books do you prefer to read? In question 4, the
test taker has to combine distracting, sore eyes, old-fashioned book and
pages to arrive at option F Do you like reading things on a screen?
5.1.2.2 Suitability oftest method
Multiple matching seems an ideal way to exploit a question and answer

type interview. The test taker cannot decide which option (A-I) goes with
the questions without listening to the sound file. The example helps the
test taker to prepare for the rest of the items as it shows the test takers that
putting together the following snippets of information: when a student,
at school, studied French and German, not many adjectives and look up
in a dictionary leads to the question Do you ever read books in a foreign
language? H is a distracter but because it includes the words you last
read and book, the test taker cannot dismiss it easily.
The wording of the items is not difficult and is commensurate with the
test developers target of B1. Seven items provide a sufficient number for the
answers not to be too obvious as well as providing a reasonable measure of the
test takers ability to synthesise short snippets of input in order to identify the
gist. The amount of time between the items is not very long (a few seconds
only) but trial statistics and feedback on the task suggested this was sufficient for
those test takers who were at the targeted level to respond to the items correctly.
5.1.2.3Layout
The two parts of the table are opposite each other so the test taker simply
has to look across to the options, select one and fill in the appropriate box.
5.2 Task 2: School class (MM)
This task was developed for use with 11-12 year old schoolchildren. The
test takers were provided with time to study the task before being asked
to listen twice to the sound file. Further time was allowed at the end of
the second listening for the test takers to complete their answers. The
instructions and the task itself appear in Figure 5.2.
Listen to the description of a school class. While listening match the chil-
dren (B-K) with their names (1-7) . There are more letters than you need.
There is an example at the beginning (0). You will hear the recording
twice.
At the end of the first recording you will have a pause of 10 seconds.
At the end of the second recording you will have 10 seconds to complete
your answers. You now have 10 seconds to look at the task.
D G
F E
C
H
I
B K J
0 Miss Sparks A
Q1 Ben
Q2 Mary
Q3 Judy
Q4 Linda
Q5 Susan
Q6 Michael
Q7 Sam
Fig. 5.2 School class (MM)

5.2.1 Sound file
The speaker in this sound file has obviously been asked to describe the
students in the picture, which does not reflect real-life listening in the
same way as the previous task and therefore lacks authenticity. In terms
of the content, however, it is something that the target test population
would be able to relate to. The sound file is approximately 50 seconds
long and consists of just one female voice talking in a reasonably natural
and measured way. The test developer put the combined sound file and
items at CEFR A2.
5.2.2 Task
According to the test developer, the items were aimed at measuring the
test takers ability to identify specific information (the names of the chil-
dren) and important details (things which help to differentiate the chil-
dren from one another such as descriptions of their hair, their clothes and
so on).
Lets take a look to see how well this works. The first child to be
described is Susan. The speaker mentions that she has long dark hair
and a striped pullover. The next child to be described is Ben; however,
in order to answer this item correctly the test taker has to rely on the
childs location (Ben is next to Susan). The item is, therefore, arguably
interdependent that is, if the test taker did not identify Susan cor-
rectly, s/he may be in danger of not identifying Ben correctly either.
Understanding important details helps with the next child, Linda, who
is described as wearing glasses. A further piece of information (though
an idea) helps to confirm her identity (she knows the answer, her hand is
up). The next child, Sam, can be identified through a series of important
details such as black curly hair and black jacket. (Further information
is also provided regarding hislocation, though again like Ben, the extent
to which this helps depends on whether the test taker has managed to
identify Linda correctly.)
It seems that although some of the items can be answered by under-
standing important details, others involve understanding ideas and there
is a degree of interdependency between some of the items. The intended
focus of the task could easily be tightened by focusing on the important
details of the children rather than on their location or what they are
doing. On the positive side, seven items are likely to provide a reason-
able picture of the test takers ability to identify specific information and
important details (once the task has been revised). Although the sound
file is relatively dense and some test takers might miss one of the names,
being able to hear the sound file again provides them with a second
chance. The sound filecould also be made more authentic by building
in other natural oral features such as hesitation, repetition, pauses and
so on.
Multiple matching appears to be the most suitable method to use with

this type of task. The test takers cannot answer the items simply by
looking at the picture they need to listen to the sound file; moreover,
they do not have to read anything apart from the teachers and the chil-
drens names meaning that reading is kept to a minimum. The example
explains how the letters and names should go together and that test
takers only need to write the correct letter opposite the childs name.
However, the fact that the names in the table are in a different order
to that in which the speaker describes the children means that the test
taker has to employ rapid eye movements from the table to the pictures
and back again. A wise test taker would (of course) simply label the
children in the picture and then write the letters down at the end of
the recording. Having two distracters in the picture (children B and D)
helps to minimise guessing.
5.2.2.3Layout
It is obvious that the picture has not been professionally or commercially

produced but it should be sufficiently clear for this not to impact on the
reliability of the test. (Obviously a field trial would confirm whether or
not this is the case.) Having the table opposite the picture (landscape)
would decrease the amount of eye movement that the test taker needs to
make.
5.3 Task 3: Adiplomat speaks (MM)
This task was used as part of a suite of tasks to assess the listening abil-
ity of career diplomats. The test takers were provided with time to study
the task before they heard the sound file, which was then played twice.
Further time was allowed at the end of the second listening for the test
takers to complete their answers. The task instructions are below while
the task itself appears in Figure5.3.
You are going to listen to part of an interview with a diplomat. First you
will have one minute to study the task below, and then you will hear the
recording twice. While listening, match the beginnings of the sentences (1-7)
with the sentence endings (A-J). There are two sentence endings that you
should not use. Write your answers in the spaces provided. The first one (0)
has been done for you.
After the second listening, you will have one minute to check your
answers.
Fig. 5.3 A diplomat speaks (MM)

5.3.1 Sound file
The sound file is an extract from an interview with the then Australian
Ambassador to Thailand and, as such, had content validity for the test tak-
ers as the topic covered issues related to their profession. It was approxi-
mately four minutes in length and consisted of two voices the female
interviewer and the male ambassador. Both of the speakers have Australian
accents and talk in a rather measured way; the test developer estimated the
speed of delivery at approximately 170 words per minute. The lack of any
background noise was probably due to the fact that the interview took place
in a studio. The test developer put the sound file at around CEFR B2/B2+.
5.3.2 Task
The test developer asked colleagues to textmap the sound file for main ideas
and supporting details (MISD). The results were paraphrased to minimise
the possibility of simple recognition, and then the textmapped MISD were
split into two parts, beginnings and endings, as shown in Figure5.3.
Let us look at a couple of items to see the extent to which the test
developer was successful in requiring test takers to understand MISD
starting with the example which was also based on a main idea that came
out of the textmap:
0 The relationship with Thailand ____ F developed gradually over time.
Its purpose, as discussed in 4.2, is not only to show the test takers what
they have to do in order to complete the other items, but also to provide
them with an idea of the type of listening behaviour they should employ
and the level of difficulty they should expect to find in the rest of the task.
The test taker needs to find some information in the sound file which
means something similar to the sentence beginning 0 The relationship
with Thailand and then match what comes next in the sound file with
one of the options, in this case F, which is marked as the answer to the
example. The ambassador says:
I think the best way to describe whats happened over that period in Australia-
Thailand relations is a relationship of quiet achievements, that weve actu-
ally seen that relationship grow in a steady way over that entire period
The element Australia-Thailand relations matches the sentence begin-

ning, while the words grow in a steady way over that entire period go
together with the sentence ending in F indicating this is the correct
answer.
The beginning of question 2 states: The number of Thais studying in
Australia ____ suggesting that the test takers need to listen out for some
information regarding student numbers in Australia. The relevant part of
the sound file the test takers need to understand is:
Student numbers have grown from just a few thousand students in the 1990s
to over 20,000 students these days.
Having identified the appropriate part of the sound file, the test taker
must then find a suitable ending from within the options A to J.Part
of sentence ending I refers to growth: have increased hugely; moreover
the time frame mentioned by the ambassador matches the second part
of sentence ending I: over the past two decades. Therefore the correct
answer is I.
Question 4 states: Thai businesses are now putting money ____,
indicating to the test takers that they need to identify some reference in
the sound file which relates to Thai business and putting money. In the
interview, the ambassador says:
for the last couple of years investment has been the story, especially Thai
investment in Australia, which has gone from a very low base to be really sub-
stantial, where you have major Thai investments in our energy sector, in our
agri-business and in our tourism industries as well.
The ambassadors references to major Thai investments and the exam-

ples of agri-business and tourism should lead successful listeners to select
sentence ending G: into the Australian economy which links these two
parts of the idea.
As discussed in 4.3.1.1, multiple matching appears to lend itself well

to the assessment of main ideas comprehension. Care has been taken
with the wording of the items to ensure that it is no more difficult than
that of the sound file. Attempts have also been made to try to reduce
the possibility of test takers answering the items based simply on their
knowledge of syntax and grammar, though this has not been completely
successful here, and good test takers may well be able to match a number
of the items before the recording begins. If such were the case, this would
constitute a threat to cognitive validity as test takers would be using
test-taking strategies rather than the type of listening behaviour the test
developer was hoping to target, that is, main ideas comprehension (Field
2013: 107). Starting every sentence ending with the same part of speech
would have reduced the guessability factor and improved confidence in
the reliability ofthe test scores resulting from this task.
The length of the sentence beginnings and endings are also reasonably
short, thus hopefully enabling the listener to process them while simul-
taneously listening to the sound file. Including the example, there are
eight items which, spread across a four-minute sound file, suggests there
is sufficient time for the test takers to answer the questions comfortably
without too much cognitive strain if they are of the targeted level. This
was confirmed by the trial feedback findings.
5.3.2.3Layout
Experience has shown that placing the two tables containing the sentence
beginnings and endings opposite each other minimises the amount of
work the test takers eyes have to undertake in order to complete the task.
This is important given the various constraints of the listening task. Test
takers were asked to enter their answers directly into the table to reduce
any errors that might occur in transferring them to a separate answer
sheet. (It is acknowledged that this is not always a practical option in
large-scale testing.)
Part 2: Short answer tasks

5.4 Task 4: Winter holidays (SAQ)
This first short answer question task was part of a bank of tasks given on an
annual basis to 11to12 year old schoolchildren to determine what CEFR
level they had reached. Time was provided before and after the task for
the test takers to familiarise themselves with what the task required and to
complete their answers. The instructions and the task appear in Figure 5.4.
Listen to a girl talking about her holidays. While listening answer the ques-
tions below in 1 to 5 words or numbers. There is an example at the beginning
(0). You will hear the recording twice. You will have 10 seconds at the end
of the first recording and 10 seconds at the end of the task to complete your
answers. You now have 20 seconds to look at the task.
0 When did the girl go on holiday? winter
1 Who went with her and her parents?
2 How did the family get to Austria?
3 How long did the journey take?
What was the weather like? a) ________________________

4
(Give two answers) b) ________________________
5 What sport did the girl do?
Fig. 5.4 Winter holidays (SAQ)
5.4.1 Sound file
The sound file lasts just under one minute and is based on an 11 year
old girls description of her winter holidays. The delivery sounds rather
studied, suggesting it was based on either a written text or a set of scripted
bullet points. The language itself, however, seems reasonably natural and
appropriate for an 11 year old. The test developer felt the sound file was
suitable for assessing CEFR A2 listening ability and that the test takers
would find the topic accessible.
5.4.2 Task
According to the test developer, the items were aimed at measuring the
test takers ability to identify specific information and important details
(SIID) based on the results of the textmapping exercise. For example, in
question 1, the test taker has to focus on who else went with the speaker
and her parents (answer: her brother important detail); in question 2,
the test taker must listen out for a kind of transport (answer: car impor-
tant detail); in question 3, the test taker must identify the length of time
the journey took (answer: eight hours specific information) and so on.
The SAQ format lends itself well to items that target SIID, as the num-
bers of possible answers are limited (unlike MISD items see Task 6
below). In general, this makes it is easier to mark and usually easier for
the test taker to know what type of answer is required. Another advantage
of using the SAQ format here is that the answers require little manipula-
tion of language (limited construct irrelevant variance).
The example indicates to the test taker how much language s/he needs
to produce and the type of information being targeted. This should help
them to have a clear picture of what they need to do in the rest of the task.
However, the answer to the example does appear in the first sentence of
the sound file, giving the test taker little time to become accustomed to
the speakers voice and topic. This is not ideal. It is recommended that the
first utterance in a sound file be left intact and that the example be based
on the second or third one, depending on the results of the textmapping
exercise. With short sound files, however, this sometimes proves difficult
and arguably it is better to have an example based on the first utterance
than to have no example at all.
The wording of the items is not difficult and appears to match the test
developers aim of targeting A2. Six items (each answer to question 4 was
awarded 1 mark) provides a reasonable idea of the test takers ability to
identify SIID (mainly important details here). The pace of the speaker
and the distribution of the items throughout the sound file allow suf-
ficient time for the test taker to complete each answer.
5.4.2.3Layout
The layout of the task encourages the use of short answers, as there is
insufficient room for a sentence to be inserted. (Experience shows that
even when a maximum of 4 words is mentioned in the instructions, some
test takers still feel they should write a complete sentence.) The test takers
are required to write their answers directly opposite the questions; this
should help when simultaneously processing the sound file.
5.5 Task 5: Message (SAQ)
This SAQ task comes from a range of tasks aimed at assessing the English
language ability of 14to15 year old students. Test takers simply had to
complete one question based on the sound file following the instructions
given in Figure 5.5 below:
Listen to the following telephone conversation. While

listening, answer the question in 4 to 7 words. You will
after the second recording to complete your notes.
The recording will begin now.
Why is Jim calling? _______________________________
Fig. 5.5 Message (SAQ)
5.5.1 Sound file
The sound file is based on a telephone conversation between a man and a

woman during which the man leaves a message. The conversation lasts just
under 30 seconds and the test takers hear it twice. The speed of delivery was
calculated to be approximately 180 wpm. The test developer feltthe sound
file was suitable for use in assessing listening ability at CEFR B2.
5.5.2 Task
The item requires the test takers to determine the reason why the man,
Jim, is making the call. In order to do this, the test takers need to syn-
thesise a number of ideas firstly, that the caller wants to speak to Mike,
who is out at the time of the call; secondly, that he is speaking to Mikes
sister who is willing to take a message; thirdly, that Jim and Mike were
scheduled to meet at 8 p.m.; and fourthly, that Jim is not feeling well so
he will not be able to make the appointment. (We also learn that Jims
sister will pass the message on, although this is not needed to complete
the task.) The test taker needs to combine the information these ideas
represent and produce the overall idea in order to answer the question.
The answer should reflect something along the lines of he cant come
tonight, hes not feeling well, or he cant meet Mike.
The short answer question format works well in this type of task, in
which the gist or overall idea is being targeted, as it requires the test
taker to synthesise the ideas him/herself rather than simply being able
to select one from three or four options. The number of words required
are limited (they are told they can use up to seven words, but it can be
done within four or five) so it should not be too taxing; nor are the words
particularly difficult to produce which should minimise any construct
irrelevant variance which writing might bring to the task.
There is no example as there is only one item; this is usually the case
with single gist items (as opposed to a multiple matching gist task such as
that discussed in Task 1). Where there is any doubt as to the test takers
level of familiarity with such items, a sample task should be made available.
5.5.2.3Layout
The layout of the task is very simple and should cause no particular
problems.
5.6 Task 6: Oxfam Walk (SAQ)
This SAQ task comes from a bank of tasks aimed at assessing the ability
of final year school students. The instructions and task can be found in
Figure 5.6 below.
You are going to listen to a radio interview with a young woman

called Rosie, who is talking about the charity event Oxfam Walk.
First you will have 45 seconds to study the task below; then you will
hear the recording twice. While listening, complete the sentences
(1-9) using a maximum of 4 words. Write your answers in the spaces
provided. The first one (0) has been done for you.
After the second listening, you will have 45 seconds to check your
answers.
Oxfam Walk
0 Rosie works for the charity Oxfam as the ___ . marketing coordinator
Q1 The first Oxfam Walk took place ____.
Q2 So far the charity event has brought in ____.
Q3 The annual event attracts ____.
Q4 The shortest walk stays entirely ____.
Q5 Walkers have a choice of ___.
Q6 In this years walk Oxfam is hoping to raise ___.
Q7 The event is largely organised by ___ .
Q8 To find out about the job offer get in touch with ___.
Q9 Rosie does not think she is fit enough to do ___.
Fig. 5.6 Oxfam Walk (SAQ)

5.6.1 Sound file
The sound file is an authentic radio broadcast by Star Radio, consisting of

an interview between the presenter and someone called Rosie who works
as a marketing coordinator for Oxfam. The interview lasts just under four
minutes and the test takers hear it twice. The speed of delivery was calcu-
lated to be approximately 200 wpm. The test developer felt the sound file
was suitable for use in assessing listening ability at CEFR B2.
5.6.2 Task
The task requires the test takers to identify some of the specific infor-
mation and important details in the sound file. The sample question
provides the test takers with the kind of important detail they should be
listening out for in this case the role Rosie fulfils at Oxfam in order to
complete the statements in questions 1-9. Other items, such as 4, 7 and
8, also target important details, while the rest focus on specific informa-
tion. Question 8 could be answered with either the name Simon Watkins
(specific information) or his role current chairman.
Although the test developer successfully identified SIID in the sound
file, the fact that the test takers are allowed to listen twice suggests that
they will employ careful listening as opposed to selective listening, and
that the level of difficulty (despite the speed of delivery) may be lower
than B2.
At first sight, the short answer question format seems well suited to this
task in that the test taker simply needs to complete the statements with
numbers, names, figures and so on. In reality, the trial showed that the
test takers came up with a myriad ways of completing the statements,
making the final key of acceptable answers (not all produced in this chap-
ters key for reasons of space) incredibly long. This was surprising as it
was expected that the answers the test takers would produce for the SIID
items would avoid the multiple answer situation often faced by MISD
questions.
5.6.2.3Layout
As with Task 4 above, the layout of the task encourages the use of short
answers as there is insufficient room for a sentence to be written in the
space provided. The need for a short answer is also stressed in the instruc-
tions (a maximum of four words) and helps to minimise any construct
irrelevant variance.
Part 3: Multiple choice tasks

5.7 Task 7: Hospital (MCQ)
This first multiple-choice question (MCQ) task was developed to be used

with 14-15 year old children. Time was provided for the test takers to
study the task before the sound file began. The task instructions and the
task can be seen below in Figure 5.7.
5.7.1 Sound file
The instructions provide a clear context for the sound file, which is based
on a young man explaining to someone how to find the hospital. The
directions given last just under 20 seconds. The test developer felt the
sound file was suitable for assessing CEFR A2 listening ability and the
topic was felt to be something 14-15 year olds would be able to relate to.
Listen to a man describing the way to the hospital. While listening, tick
the correct map (a, b, c or d). You will hear the recording twice.
You will have 10 seconds at the end of the recording to complete your
answer.
You now have 20 seconds to look at the maps.
Which map shows the way the man describes?
a) b)
KINGS ROAD
HOSPITAL
KINGS ROAD
HOSPITAL
You are here

You are here
c) d)
KINGS ROAD
HOSPITAL
KINGS ROAD
HOSPITAL
You are here You are here
Fig. 5.7 Hospital (MCQ)
5.7.2 Task
The item was aimed at measuring the test takers ability to grasp the
overall meaning of the directions based on identifying and understanding
the relevant SIID.For example, the test taker needed to understand such
details as straight on, turn right, roundabout, left, second building,
on right and specific information such as Kings Road.
The multiple choice question format, in the shape of a map, lends itself
well to instructions such as these as they display the necessary informa-
tion in a non-verbal way. The test taker simply has to match what s/he
is hearing to the visual display. The task is very simple to mark. There is
obviously no example as there is only one item; where there is any doubt
about test takers familiarity with this type of task, a sample exercise
should be made available to them prior to the live test administration.
5.7.2.3Layout
The layout of the task is compact and it is possible to look at all four
options simultaneously, although a little more space between the four
maps might have helped. The box, which the test taker needs to tick,
is quite small and may take a few seconds to locate. Putting the boxes
outside the maps might have made them easier to see.
5.8 Task 8: Tourism inParis
The second MCQ task was developed to ascertain university students

level of English. Time was provided for the test takers to study the task
before the sound file began. The instructions and task can be found
inFigure 5.8 below:
Listen to two people talking about tourism in Paris. First you have
45 seconds to study the questions. Then you will hear the recording
twice. Choose the correct answer (A, B, C or D) for questions 1-7.
There is an example (0) at the beginning.
At the end of the second recording, you will have 45 seconds to final-
ise your answers.
Tourism in Paris
0 Elliot can explain about Paris because he _____
A comes from the city.
B works for the tourist office.
C has lived there for years.
D knows the best places.
Q1 When choosing activities in Paris you should think about _____
A the duration of your visit.
B the cost of museums.
C what you want to see.
D how far you want to walk.
Q2 Elliot recommends the first place because _____
A it is a famous building.
B it is a popular museum.
C you can see all of Paris.
D you can take pictures.
Q3 To experience the character of Paris you should _____
A take a tourist bus.
B speak to the Parisians.
C visit a lot of museums.
D go everywhere on foot.
Fig. 5.8 Tourism in Paris (MCQ)

Q4 In the Latin Quarter you can find _____
A religious buildings.
B famous hotels.
C friendly local people.
D groups of tourists.
Q5 In the second area Elliot mentions you can find _____
A small independent shops.
B interesting modern hotels.
C an exclusive atmosphere.
D many types of people.
Q6 Elliot says that on arrival you should _____
A get the bus to your hotel.
B take the train to the centre.
C visit one of the tourist offices.
D plan your visit to the Louvre.
Q7 To explore the city you should _____
A get a tour guide.
B use public transport.
C stay in the centre.
D rent a small car.
0 Q1 Q2 Q3 Q4 Q5 Q6 Q7

5.8.1 Sound file
The sound file is an authentic interview with Elliott, who works for the
Paris tourist office. It takes place outside, which is indicated by appropri-
ate supportive background noise. Both the questions and the responses
in the interview are delivered quite naturally and in an engaging way.
The length of the sound file is just under three minutes and the speed of
delivery was estimated to be approximately 150 wpm. The test developer
put the task as a whole at CEFR B1.
5.8.2 Task
The task requires the test taker to understand the main ideas and sup-
porting details presented in the sound file. For example, at the beginning
of the sound file Elliott is asked what there is to do in Paris. He answers
that this depends on how many days the tourist is going to spend in the
city. This idea has been transformed and paraphrased into question 1
(When choosing activities in Paris you should think about _____. The cor-
rect answer isA the duration of your visit).
The second question attempts to target the first venue that Elliott rec-
ommends and also his reason for doing so in other words, Montmartre
so as to get a nice view of the city. The test developer manages to avoid
using the name of the place, which would cue the answer, but the stem
does presuppose that the test taker is aware that this is the first place
Elliott mentions. This also happens in question 5 (second area). This is
one of the challenges that test developers meet when trying to test the
main idea without signalling too precisely where the answer is located
which could lead to test takers answering an item correctly through rec-
ognition rather than comprehension. Sometimes slips occur, as in the
example and question 6 where the words tourist office appear on both
the sound file and in the items.
Having said that, it is sometimes practically impossible to paraphrase
certain words without the results appearing engineered or being more
difficult than the original wording. Where a word occurs many times
in a sound file, it seems reasonable to use the original word if it proves
too difficult to paraphrase as arguably the test taker still has to be able to
identify the correct occurrence of the word(s) and use this to answer the
item concerned.
There are a total of seven items plus the example in the task which,
with a three-minute sound file, would suggest sufficient redundancy for
the test takers to complete and confirm their answers by the end of the
second listening. (The actual distribution of the items should of course be
checked at the textmapping stage see 3.5.)
The sound file is quite detailed in terms of ideas and information about
what people should do when visiting Paris and therefore lends itself to a
multiple-choice task. The options are reasonably short, thereby minimis-
ing the burden placed on the test takers as they listen to the sound file
and try to determine the correct answer. The distracters are not easily
dismissible and it is unlikely that the test taker will be able to eliminate
any before listening to the sound file.
5.8.2.3Layout
The layout is neat and concise and the space for writing the answers
clearly indicated by the example in the table at the bottom of the task.
Boxes at the side of each item might have helped rather than having to
transfer them to the bottom of the task.
5.9 Summary
In this chapter you have worked through eight listening tasks reflecting
different behaviours, test methods, topics and types of sound file and read
the discussion concerning their advantages and disadvantages. Based on
these findings, let us finish by summarising what makes a good listening

task.
The first and foremost characteristic of a good listening task must be
the extent to which it manages to target the construct the test developer
intends to measure. Where the items are based on textmapping results, as
many of the above tasks were, this should increase the probability of that
type of listening behaviour being assessed.
Secondly, the instructions and example should make the demands of
the task (type of listeningrequired, test method, level of difficulty) com-
pletely clear to the test taker so that s/he knows what to expect in the
rest of the task. Thirdly, the test method should suit the type of listening
being measured, be familiar to the test taker and minimise the need for
other skills unless they are part of the construct (for example, listening
into writing or speaking). Fourthly, the test method should also lend
itself to the contents of the sound file, for example, in the case of MCQ
the sound file must be sufficiently detailed to allow for quality options to
be developed. It also needs to be reliable in terms of marking (see 6.2.6)
so that this does not impact on the confidence that stakeholders have
inthe test scores.
Fifthly, and also of great importance, the sound file needs to lend
itself to the listening behaviour being targeted, to be at an appropriate
level of difficulty and to reflect the type of listening we might encoun-
ter in the real world. It should also be something that the target test
population can relate to, which preferably has face validity (see, for
example, Task 3 above) and the content of which is interesting and
engaging.
In combination, these make for a tall order and it is unlikely that many
tasks can tick all these boxes. Our job as test developers, however, is to
tick as many of these boxes as we can if we want to have confidence in
the instrument we are using to assess our test takers listening ability.
In the next chapter, I look at how, having developed as good a set of
tasks as we can, we can find empirical evidence to support the claims we
are trying to make.
5.10 Keys tothesample tasks
Task 1: Reading habits (MM)

0 E
Q1 G
Q2 C
Q3 A
Q4 F
Q5 I
Q6 B
Q7 D
Task 2: School class (MM)

0 Miss Sparks A
Q1 Ben J
Q2 Mary F
Q3 Judy H
Q4 Linda C
Q5 Susan K
Q6 Michael G
Q7 Sam E
Task 3: A Diplomat speaks (MM)

0 Q1 Q2 Q3 Q4 Q5 Q6 Q7
F E I C G A B H
Task 4: Winter holiday (SAQ)

0 In winter
Q1 (my) brother
Q2 (by) car
Q3 8 hours
Q4 Two of the following:
snowy / sunny / cold
Q5 snowboarding
Task 5: Message (SAQ)

(he) cant come (tonight)
(because) hes not feeling well
(he is) not going to make it
(he) cant meet Mike (at 8 p.m.)
no meeting (tonight)
Jim is ill
Task 6: Oxfam Walk (SAQ)

0 Marketing coordinator
Q1 43 years ago / in the 60s / late 60s
Q2 1 million pounds / 1 million
Q3 1,000 walkers / over 1,000 walkers / about 1,000 walkers
Q4 within the park / in Milton Park
Q5 different walks / 4 different walks
different distances / 4 different distances
a variety of routes / 3-26 miles
would you accept just walks? It fits.
Q6 50,000 pounds / 50,000
more than last year / more than 30,000
Q7 unpaid volunteers / volunteers
Q8 the current chairman / the chairman
Simon Watkins / 01 223 546157
Q9 the 26 miles / the 26-mile walk / 26 miles
Task 7: Hospital (MCQ)

C
Task 8: Tourism in Paris (MCQ)

0 Q1 Q2 Q3 Q4 Q5 Q6 Q7
B A C D A D C B
DLT Bibliography
6
How do weknow if thelistening task
works?
Introduction
If you have followed the steps outlined in Developing Listening Tests so
far, your listening tasks should have gone through a number of carefully
applied stages from defining the construct and the performance condi-
tions in the test specifications (Chapter 2), to textmapping (Chapter 3),
and task development, peer review and revision (Chapter4). Even so, it
cannot be guaranteed that the final product will be without error. To be
certain that an item/task is likely to contribute positively to a valid and
reliable test score, it is necessary to subject it to a trial on a representative
test population (Green 2013; Buck 2009). The resulting data should then
be analysed to determine whether they have good psychometric proper-
ties. In addition, where high-stakes tests are involved, the task(s) should
then be subjected to an external review (see Chapter 7).
Some test development teams believe that it is impossible to trial tasks
because of security concerns. While this is indeed an issue that must be
considered very carefully, particularly in high-stakes tests, a decision not to
trial can have major negative effects on a test takers performance and on
the confidence level which stakeholders should have in the validity and reli-

DOI10.1057/978-1-349-68771-8_6
ability of the resulting test scores. Experience shows that trialling ahead of
when the tasks will actually be needed (see6.2 below) helps to minimise any
perceived security threats, as does trialling multiple tasks simultaneously,
so that it is unclear as to which tasks will finally be presented in any live
administration. Of course, the latter presupposes that there are a number of
test developers working together and that resources are available for a large-
scale trial. In the school context, by contrast, it is recommended that tasks
be trialled on parallel classes or in other schools in order to gather informa-
tion about how the tasks perform. Without trials, it is impossible to know
whether or not an item or task will add to the validity and reliability of the
test score. This is something that all decision makers should be aware of.
To summarise, trialling in general allows us to ascertain if the tasks
perform as expected and whether they are likely to contribute to a valid
and reliable test score. Many things can impact on the success of an item
or task and each of these can be examined through field trials.
6.1 Why do wetrial?

6.1.1 Task instructions
First of all, we need to check that the task instructions (sometimes referred
to as rubrics) are doing their job. If these have not been carefully written
using language that is at the appropriate level (equal to or lower than that
which is being targeted in the task) and avoiding metalanguage, the test
takers might not understand what is expected of them. So although a
test taker might understand the contents of the sound file, s/he might be
unable to complete the task.
Instructions, like the task itself, need to be trialled and standardised
so that they do not influence a test takers performance. Some examina-
tion boards use the test takers mother tongue in the instructions. This is
particularly appropriate when developing tests for children, on the basis
that the instructions should not be part of the test.However, care must
obviously be taken in multilingual societies that using the mother tongue
does not disadvantage any test takers.
One way of finding out whether the instructions, including the exam-
ple, have fulfilled their role is by administering a feedback questionnaire
6 How do weknow if thelistening task works? 147
(see 6.1.9) to the test takers as soon as they have finished the trial and
including a question on this issue. Remember test taker anxiety is likely
to be reduced if the instructions on the sound file match those which
appear at the beginning of the task, as the listener will be able to follow
what is being said with the aid of the written words.
6.1.2 Amount oftime allocated
The amount of time test takers need to study the task prior to listening to
the recording, and the amount of time they should have at the end of each
task to complete their answers, is usually included in the task instructions.
When a new test is developed, it is necessary to trial the amount of time
provided to make sure it is neither too short nor too long. Where the for-
mer is the case, the reliability of the test scores can be affected if test takers
simply have insufficient time to read through the items or to complete
their answers; in the latter scenario, it is likely to lead to increased test
anxiety or may encourage the test takers to talk to each other.
Useful evidence can be gathered by the test administrators during the
trial as to whether the test takers appear to be ready when the recording
starts and whether they had sufficient time to complete the questions.
Further information can also be gathered by means of a test taker feed-
back questionnaire. Test developers should not be reluctant to change the
amount of time provided during the trial based on the evidence gathered;
this is one of the reasons for field trialling.
6.1.3 Different test methods
Trial data also reveal insights into how different test methods work. For
example, it provides evidence of which test type the test takers appear to
perform better on and which they find more challenging. It also reveals
which methods are discriminating more strongly (see 6.3.2.2). Where
a test method is unfamiliar to the test takers, this may be reflected in
lower scores and/or in an increased number of no responses. Hence the
importance of including a range of different test methods in the test so as
to minimise any test method effect which might influence the test takers
performance.
6.1.4 Task key
The key to short answer tasks (SAQ), in particular, benefit from being
trialled as it is impossible for the test developer to predict all valid answers
in this type of task. This is especially the case when main ideas and sup-
porting details are being targeted, as there will be a number of ways that
test takers can respond. The field trial also allows us to witness the extent
to which the test takers responses reflect what the test developer was
hoping to target in the item; experience has shown that sometimes test
takers produce a totally different answer than that expected, which may
cast doubt on the construct validity of the items. For example, if the
item were designed to test a main idea but some test takers managed to
produce a valid answer using specific information or important details,
it would suggest that the wording of the item had not been successful in
targeting the right type of listening behaviour possibly due to a lack of
control in the wording. Fortunately, this is one of the advantages of the
trial situation; the test developer has the chance to review and revise the
item, and then to trial it once more.
Where a live test administration involves a large number of markers
working in separate locations, it is useful to include not only an extended
key based on the results of the field trial but also a list of unacceptable
answers. This helps to save valuable time as well as reducing possible
threats to reliability. Deciding on new answers to the key is often a prob-
lem when central marking is not possible. Where the test is a high-stakes
one, thought might be given to the use of a hotline where advice can
be given by a small panel of experts (see Green and Spoettl 2009) who
have been involved in the task development cycle and who have access
to appropriate databases, such as thesauruses, dictionaries, and language
corpora. Where possible such a panel should include a native speaker.
6.1.5 Task bias
Based on the data collected from the field trial, it is also possible to check
for any type of bias which the items might have in terms of gender, test
taker location, first language and so on. For example, the data resulting
from a task based on a topic which might advantage female students over
male ones can be checked to ascertain whether this is indeed the case.
Items that are found to suffer from any kind of bias should be dropped
from the task as they suggest an unfair playing field and bring into ques-
tion thevalidity and reliability of the test score (see also 7.5).
6.1.6 Sample tasks/benchmark performances
All test takers (as well as other stakeholders) benefit from access to sample
tasks. Such tasks should provide as accurate a picture as possible of the
type of tasks the test takers will meet in the live test in terms of what is
being tested (construct), how it is being tested (method, layout, sound
file characteristics and so on) and how their performance will be assessed
(grading criteria). The tasks that become sample tasks must comply in all
these respects and have good psychometric properties (see6.3.3 below).
In order to ensure this is the case, they need to be trialled.
Sample tasks should not be selected from those tasks which have been
discarded for some reason; on the contrary, they should be good tasks
that will stand up to close scrutiny. This is sometimes seen as a sacrifice,
but it is one which is well worth it in terms of the likely increase in stake-
holders confidence in the test. In addition to the sound file and the task
itself, a detailed key with justifications should be provided for each test
method as well as information about the targeted listening behaviour (see
also 7.4). It is important to publish as wide a range of sample tasks as pos-
sible so as to avoid any negative washback on the teaching situation, in
other words, to minimise any teaching to the test. It should also help the
test from becoming too predictable, for example, every year the test will
be made up of an X + Y + Z task.
6.1.7 Tasks forstandard setting
Another reason why field trials are useful is that they make it much easier
for the test developer to select tasks with good statistics, as well as posi-
tive test taker feedback which can then be put forward to standard setting
sessions (see 7.2). This qualitative and quantitative evidence reduces the
possibility of the tasks being rejected by the judges.
6.1.8 Test administration guidelines
In order to ensure that the trial takes place under optimal conditions, it
is important to develop administration guidelines. This becomes even
more important when the trial takes place in a number of different ven-
ues. If tasks are administered in different ways for example, if the time
provided to complete the tasks is inconsistent between locations, if the
instructions are changed, or if the recording is paused in one test venue
and not another, these differences will obviously impact on the confi-
dence one can have in the trial data. Therefore, even before the trial takes
place it is important to develop such guidelines and hold a test adminis-
tration workshop with those who people are going to deliver the trial to
make sure that the guidelines are clearly understood.
In developing test administration guidelines, a number of issues need
to be decided upon:
the number of test administrators needed

the location(s) of the trial
the need for silence, examination in progress notices
how test security will be maintained
how the test booklets and related documents will reach and be returned
from the field test venue(s)
how much space each test taker will have for the test papers
how much distance there will be between desks
the location of the test takers bags/books and so on
how the test takers are to be numbered/seated
who is to be responsible for the seating plan
who is to be responsible for numbering all the test booklets and feed-
back questionnaires (if these two documents are printed separately)
prior to the field trial taking place
the policy on late arriving test takers, candidates who are taken ill and
so on
the grounds for drawing up an incident report and who should be
responsible for this document
If a feedback questionnaire is to be given, the test administrators must

mention this to the test takers before the trial begins. During the field trial
the invigilators should make a note of any significant disturbances that might
have affected the test takers performance (for example, someone cutting the
grass, repairing something in the hallways, the school bell and so on which
may have impacted on the clarity of the sound files). At the end of the trial,
the administrator must ensure that every test booklet has been collected in
and that no information regarding the trialled tasks leaves the room.
6.1.9 Feedback questionnaires
As mentioned above, administering a feedback questionnaire at the end

of the trial provides a rich source of information for those who have
created the tasks. It is strongly recommended that test takers opinions
should be obtained concerning their:
familiarity with the sound file topics

level of interest in the content of the sound files
perceptions of the difficulty level of the sound files and the items
familiarity with the test methods used in the tasks
perceptions regarding the amount of time allocated to the pre-reading
of the task and to the completion of the items
perceptions concerning the accents used, the number of speakers, the
quality of the recordings, the speed of delivery, the length of the
recordings
perceptions of the instructions and the test booklet layout
difficulty in answering questions while listening
perceptions of the test as a measure of their listening ability
Careful thought must be given to the language the feedback question-

naire is delivered in; for example, it seems reasonable to argue that many
test takers would feel more comfortable using their own language to pro-
vide feedback in. In addition, the impact of administering a feedback
questionnaire in a context whether there is no questionnaire culture and/
or if young children are involved needs to be taken into consideration.
Feedback questions can be designed in a number of ways. For example

in tabular form such as that shown in Figure 6.1:
How familiar were you with the test formats used in this listening test?
Not Not very Quite Very
familiar familiar familiar familiar
Q1 Multi-matching (+ name of task) 1 2 3 4
Q2 Multiple choice (+ name of task) 1 2 3 4
Q3 Short Answers (+ name of task) 1 2 3 4
Fig. 6.1 Feedback questionnaire: Example 1
Or through the use of statements and a Likert scale as indicated in

Figure 6.2:
Q1 The instructions in the listening tasks were easy to

understand.
Strongly Neither Agree nor Strongly
Disagree Agree
Disagree Disagree Agree
Q2 I had enough time to complete all the listening tasks.

Strongly Neither Agree nor Strongly
Disagree Agree
Disagree Disagree Agree
Fig. 6.2 Feedback questionnaire: Example 2
(See Drnyei 2003; Haladyna and Rodriguez 2013 for further exam-
ples of feedback questionnaire.)
Even more useful insights into how test takers perceive the test can be
obtained if it is possible to link their opinions with their test performance.
This is not always possible due to anonymity. (See Green 2013,Chapter5
for more details on how to analyse feedback questionnaire data.)
6.1.10 Feedback tostakeholders
In a high-stakes test there are many stakeholders who would welcome fur-
ther insights into how the trialled listening tasks are perceived by the test
takers. These stakeholders include students, teachers, parents, school inspec-
tors, school heads, ministry officials, moderators, curriculum developers,
university teachers, teacher trainers, textbook writers and external judges
(standard setters), among others. Trialling makes it possible to share test
takers perceptions with these interested parties through stakeholder meet-
ings which can, in turn, provide other useful insights for the test developers.
6.1.11 Test specifications
The analysis of the qualitative and quantitative data resulting from the tri-
alled tasks can help the test developers to reassess the test specifications in
an informed way and make changes where necessary. For example, the tri-
als may show that the amount of time allocated for reading the questions
prior to listening to the sound file was insufficient or that a particular test
method was less familiar than expected. In light of this feedback, these
time parameters can be reassessed and changes made to the test specifica-
tions and the decision regarding the use of the test method re-visited.
6.1.12 Summary
It should be clear from all the arguments given above that field trials are
immensely useful to the test developer. Without them s/he is, to a certain
extent, working blind, as s/he has no evidence that the tasks will work appro-
priately. Given the possible consequences of using test scores from untrialled
tasks, there is really no argument for not putting test tasks through field trials.
6.2 How do wetrial?

Like most aspects of the task development cycle, the answer to this ques-
tion requires some careful thought, as there are a large number of factors
that need to be taken into account if the trial results are to be considered
worthy of analysis. Each of these will be discussed in turn below.
6.2.1 The test population
It is crucial that the test takers used in the trial be representative of the
test population to whom the tasks will ultimately be administered. For
obvious reasons, the test population which is used cannot be drawn from
the pool of actual test takers themselves, but the population should be
as close as possible in terms of factors such as ability level, age, regional
representation, L1(s), gender and so on. How can this be done? Let us
take, for example, a final school leaving examination situation. The best
way to obtain valid and reliable test data is to administer the field trial
in such a way that the test takers see it as a useful mock examination. In
such a scenario, the school leavers would be at approximately the same
stage in their schooling as the target test population. Having field trialled
the tasks on these school leavers, the successful tasks can then be kept and
used after two or three years when the test takers have already left school.
In order for this to happen, test development teams need to trial their
tasks at least one year in advance of the date they are actually needed and
preferably more on a range of school types, regions and locations.
6.2.2 Trial dates
As mentioned above, the trial should take place at roughly the same time of
year as the live test is to be administered so as to simulate similar conditions
in terms of knowledge gained. This is not always possible, of course, as the
period when the live tests are administered will be a very busy time for all
involved (test takers, teachers and schools). However, if there is too large a
gap between the date when the field trial is administered and that when the
live test is normally sat, this can have the effect of depressing the item results.
In other words, the tasks may seem more difficult than they actually are. In
such circumstances, the test developers would need to take this factor into
account when deciding on the suitability of the tasksdifficulty level which
is obviously likely to be less reliable as it will involve second-guessing as to
how the tasks would have worked if the trial dates had been more optimal.
6.2.3 Size ofthetrial population
How large does the trial population need to be? The answer to this question
depends on how high-stakes the test is and how the test scores are going
to be used. If the test results are likely to have high consequential validity
(Messick 1989) for example, the loss of an air traffic controllers licence
then clearly the larger and more representative the test population, the bet-
ter as the test developer is likely to have more confidence in the results.
For many test developers, however, and especially for those who work with
second foreign languages, large numbers are not always easy to find. The
minimum number of cases that might usefully be analysed is 30 but with so
small a number it is very difficult to generalise in a reliable way to a larger
test population. Having said that, it is better to trial a listening task on 30
test takers than none at all, and for many schoolteachers this is likely to be
the most they are able to find. At least with 30 test takers it will be possible
to see whether they have understood the instructions and the teacher should
be able to gain some feedback about the task itself. Where large test pop-
ulations and/or high-stakes tests are involved it is strongly recommended
that data from a minimum of 200 test takers be collected, and if the data
are to be analysed using modern test theory through such programmes as
Winsteps or Facets(Linacre 2016), then 300 test takers would be better as
the results are likely to be more stable and thus more generalisable.
6.2.4 Test booklet preparation
It is important to take into consideration a number of factors when

assembling the test booklets. First of all, it makes sense to include a range
of tasks targeting various aspects of the construct so that the test devel-
oper can see how the test takers perform with regard to the different types
of listening behaviour. This approach is also better for the test takers as a
way of minimising fatigue and possible boredom. Secondly, a selection of
test methods should be included so as to gather information on the dif-
ferent methods, to encourage interest as well as to minimise any possible
test method effect. Thirdly, the total number of tasks has to be carefully
thought through too many and performance on the last one(s) may be
affected by test fatigue; too few and the trial becomes less economical.
The age and cognitive maturity of the test takers need to be factored into
this decision as well.
Fourthly, once the tasks have been identified, the order they appear in
the test booklet must be agreed upon. The convention is to start with the
(perceived) easier tasks and work towards the (perceived) more difficult
ones. This is also true with regards to the test methods. Those thought to
be more familiar and more accessible should come first, followed by those
which may be more challenging. For example, SAQ tasks are generally
seen as more challenging because the test takers are required to produce
language rather than just selecting one of the options on offer. Ideally,
putting tasks with the same test methods next to each other helps the test
taker save time, but this may not always be possible if the difficulty level
varies to a great extent. It is also important to take the topics into consid-
eration; having two or three tasks all focusing on one particular subject
area could have a negative washback effect on the test takers interest
level.
Fifthly, the layout of the test booklet itself needs careful consider-
ation. As already mentioned in 4.2, it is good testing practice to use
standardised instructions; where a task requires two pages these should
face each other so that the test taker does not need to turn pages back and
forth while listening. The size and type of font also needs to be agreed
upon so that these can be standardised. Although colour would be attrac-
tive, few teams can afford this and so black and white tends to be the
norm. If pictures are used, then care must be taken that they are repro-
duced clearly.
Sixthly, as part of the test booklet preparation, it may be necessary to
produce multiple CDs or other types of media. The quality of these CDs
mustbe checked before being used.
Seventhly, a decision must be made regarding whether the test takers

will be given a separate answer sheet on which to record their responses
or whether they will write these directly into the test booklet. Clearly in
a listening test, it saves time to do the latter and possibly results in fewer
mistakes given the multi-tasking which listening tasks invariably involve.
However, from a grading point of view, the former is much more con-
venient and necessary if an optical scanner is to be used. Where separate
answer sheets are used, extra time should be factored in for the transfer-
ence of answers.
Eighthly, it is strongly recommended that all test booklets, answer
sheets and feedback questionnaires be numbered. This makes it much
easier for the test administrators to check that all the papers have been
collected in at the end of the trial and before any test takers leave the
room. From a practical point of view, having all three documents in one
test booklet makes the invigilators job much easier.
Ninthly, where the field trial takes place in a number of venues, the
delivery of test papers, sound files and other related documentation must
be carefully organised to ensure that security is maintained and that the
materials arrive in good time.
6.2.5 Administration andsecurity issues
One of the first issues which needs to be resolved when holding a field
trial is the actual location (for example, school, university, ministry) and
how suitable it is likely to be in terms of layout, acoustics, light, noise,
heat and so on. These aspects need to be checked by a responsible person
well in advance of the trial itself and changes made as necessary.
Secondly, administrators need to be clear of their responsibilities dur-
ing the trial. Ideally, they should be trained and provided with a set of
procedures to follow regarding invigilation well before the trial takes
place so that any issues can be resolved in advance.
Thirdly, if the test materials have to be sent to the testing venue, this
needs to be organised in a secure way: the materials need to be checked
by someone on arrival and then locked away until the day of the trial in
order to ensure the highest level of security. The equipment used for play-
ing the sound files must be checked and a back-up machine (and batteries
if necessary) made readily available just in case.
Fourthly, in high-stakes trials, the use of a seating plan showing test
taker numbers is to be recommended. This enables the test developer
to check the location of the test taker(s) in question if anything strange
emerges (for example, a number of tasks left completely blank) during
data analysis. Desks should be set at appropriate distances from each
other to discourage cheating; where two test takers have to sit at the same
desk (and this is the case in a number of countries), different versions of
the test paper must be used.
Fifthly, great care must be taken to ensure that no copies of the test
booklet or feedback questionnaire leave the testing room, and that no
notes have been made on any loose pieces of paper. Inevitably, there is
some risk that test takers will remember the topic of a particular sound
file. The risk should be minimal, however, provided the trial takes place
well in advance of the live test so that the test takers who took part in the
trial have already left the school, and also if a large number of tasks can be
trialled (particularly with high-stakes examinations) so that nobody can
predict which tasks will be selected for a future live test.
Finally, all mobile phones should be left outside the testing room. This
is obviously crucial during listening tests.
6.2.6 Marking
Great care must be taken in marking the trialled tasks, particularly those
which might involve subjective judgement such as short answer ques-
tions. For large-scale test administrations, it is recommended that an
optical scanner should be used for the selected response type items and
markers should grade only the constructed response items. However, this
is not practical in the case of small-scale testing. Where a number of
markers are involved in grading the trial results, the following procedure
has been shown to be useful:
1. If the group as a whole is large, it is recommended that they work in

groups of four and that one of them (usually the one with prior
marking experience and good language skills) be appointed head of

the group.
2. Each rater should be given a copy of the key and reminded that any
words in brackets in the SAQ tasks are optional, that is, the test taker
does not have to use them to get 1 point. (See examples of this in the
key to Task 5.6in Chapter 5).
3. To aid with data analysis, markers should rate the test takers answers
in the following way:
Correct answer = 1
Incorrect answer = 0
No answer = 9
4. Selected response items can also be marked this way (0, 1 and 9) but
the actual letter chosen by the test taker (A, B, C or D in MCQ
items, for example), should be entered into the data spreadsheet so
that a distracter analysis can take place (see6.3.2.1below).
5. It is recommended that the group as a wholeworks together on one
task at the beginning; an SAQ task is probably the most useful in
terms of learning how to deal with unexpected answers/anomalies.
6. The markers may have to listen to the sound files to determine
whether a particular answer (not in the key) is correct. Therefore,
copies of the sound file must be made available together with an
appropriate playing device.
7. Where an alternative answer to those appearing in the key occurs, the
marker must call this to the attention of the group leader and a con-
sensus should be reached as to whether it is acceptable or not. Where
it is accepted, all groups should add the new answer to their key.
8. If there is any chance that such an answer has come up before but has
not been mentioned, back papers much be checked and corrected
accordingly in all groups.
9. It is recommended that the group as a whole work as much as pos-
sible on the same task so that any queries can be dealt with while still
fresh. However, markers will inevitably work at different rates so this
will lead to different tasks being marked by people in the same group.
10. When all the listening tasks in the test booklet have been marked, it
is useful if the raters can calculate the total score for each test taker
and place this on the front of the test booklet, for example, Listening
Total = 17. This will help when checking data entry (see6.3.2below)
and the markers calculations can later by corroborated by the statis-
tical programme used.
11. From time to time, it is useful for the person(s) running the marking
workshop to check a random sample of marked test booklets for
consistency. Any anomalies found should be discussed with the
group as a whole.
12. Where there is clear evidence of an insincere (test taker)response pat-
tern, for example, a long string of nonsense answers unrelated to the
task, the test booklet should be set aside in a separate box for the ses-
sions overall administrator to judge whether or not it should be marked.
13. Once all the listening tasks have been marked, and a random sample
of test booklets has been checked,data entry can begin.
6.3 Trial results

6.3.1 Why carry out adata analysis?
A number of people reading this book will probably quail at the idea of
getting involved in any kind of statistical analysis however simple it may
be. As mentioned in Green (2013), the most important thing to remember
is that the results of the analyses you carry out can be directly applied to
the tasks you have painstakingly developed. This makes understanding the
numbers so much easier. By spending copious amounts of time on devel-
oping and trialling tasks, but then leaving the data analyses to others who
have not been involved in the test development cycle, you will lose immea-
surably in terms of what you can learn about your tasks, your test develop-
ment skills and subsequent decision making. Conversely, you will gain so
much more by taking on the challenge that data analyses can offer you.
Item analysis is one of the first statistical procedures that you as a
test developer should carry out on your trialled tasks once data entry
is complete and the data file has been checked for errors. (See Green
2013, Chapters1 and 2 for more details regarding these procedures.)
This is because it provides information on how well the items and the
tasks have performed in the trial. It does this, firstly, by telling us which
items the test population found easy and which they found difficult.
This information should be compared to your expectations; where dis-
crepancies are found for example, where a task which you expected
to be easy turned out to be one of the more difficult ones or vice versa
the findings need to be investigated and a reason for any differences
found.
Secondly, item analysis enables us to see how particular test methods
are working. For example, we can see how many items are left blank across
the various test methods. Thirdly,the data can also show us the extent
to which the distracters in the multiple choice and multiple matching
tasks are working. Fourthly,item analysis can tell us which kind of test
takers (stronger/weaker) are answering the items correctly and which are
not. In other words, it will tell us whether the items are discriminating
appropriately between the test takers, with the stronger ones answering
the items correctly, and the weaker ones not. Fifthly,item analysis can tell
us to what extent the items are working together, that is, whether all the
items seem to be tapping into the same construct (for example, listening
for specific information and important details) or whether some appear
to be tapping into something else (for example, the test takers knowledge
of geography, mathematics and so on) and thereby introducing construct
irrelevant variance into the test.
All of the above helps the test developer immensely in determining
whether their items are performing as they had hoped and to what extent
they are providing an accurate picture of the test takers ability in the
targeted domain.
6.3.2 How do wecarry out adata analysis?
One of the commonest ways of carrying out an analysis of listening test

data is to use a statistical programme such as those provided by IBM-
SPSS or Winsteps (Linacre 2016). Alternatively, Carr (2011) provides
guidelines on how to analyse data using EXCEL.As the availability, and

for some the accessibility, of IBM-SPSS is likely to be higher than that
of Rasch programmes, in this chapter I willexplain how to analyse your
data using the former; for those of you who would like to use a Rasch-
based programme such as Winsteps to analyse your listening tasks, please
refer to Green (2013, Chapters 10, 11, 12, and 13).
At the item level, there are three analyses which are particularly help-
ful in investigating whether your items are working or not these are
frequencies, discrimination and internal consistency.
6.3.2.1 Stage 1: Frequencies
Frequencies (often referred to as facility values) simply describe the per-

centage of test takers who answer an item correctly and the percentage
who do not. The facility value is calculated by dividing the number of
correct responses by the total number of responses. For example, if 20 out
of 30 test takers answered an item correctly, this would mean the item
had a facility value of approximately 66 per cent (20 divided by 30). By
extension, this means that 34 per cent of the test population had either
answered the item incorrectly or did not answer the item. The higher the
facility value the easier the item; the lower the facility value the more dif-
ficult the item.
Let us have a look at an example. The IBM-SPSS output below (Figure
6.3) comes from an eight-item, four-option MCQ listening task used in
a proficiency test that was taken by 184 test takers.
Question 1 Frequency Percent

Valid A=Key 151 82.1
B 25 13.6
C 4 2.2
D 4 2.2
Total 184 100.0
Fig. 6.3 Frequencies on Q1

How do we interpret this table? Looking at the columns first, we can

see that column 1 contains the various options which are available in
the item, in this case, A, B, C and D as well as the Total (number of test
takers). A has been indicated as the key. Column 2 provides informa-
tion about the number of test takers (frequency) who selected A, B, C
or D.Column 3 shows the same information but as a percentage. For
example, 151 of the 184 test takers in the data set chose A, which repre-
sents 82.1 per cent; 25 of the test takers chose B, which equals 13.6 per
cent of the total while just over 2 per cent chose C or D.
How do we interpret these figures? In a proficiency test, the most use-
ful information comes from facility values that are around 50 per cent
(see Popham 2000) as this value suggests that the item might be dis-
criminating positively between the test takers (though it should be noted
thatthis is not always the case). Facility values between 40 and 60 per
cent provide the next best information, followed by 30 to 70 per cent (see
Bachman 2004). The latter parameters (30 to 70 per cent) are the values
which many test developers use when making their first decisions about
whether or not an item is working. Having said that, facility values of
between 20 and 80 per cent can also provide useful information so long
asthe items still discriminate and contribute to the tests internal consis-
tency (see 6.3.2.2 and 6.3.2.3 below).
Facility values of below 20 per cent and above 80 per cent in a profi-
ciency test suggest that most of the test population is either answering the
item incorrectly or correctly, respectively, which means, in turn, that we are
likely to gain less information about the test takers and/or the items. In an
achievement test, however, we might expect to find higher facility values
than in proficiency tests. For example, they may be in the 80 to 90 per cent
bracket, suggesting that the students have understood what has been taught.
The degreeto which the facility valuesare felt to be appropriateshould
be determined in light of the purpose of the test you are administering
(proficiency, achievement, placement, diagnostic or aptitude), the tar-
get test population and how the test scores are ultimately to be used.
Facility values should also be considered and interpreted together with
the discrimination indices (see6.3.2.2) as the former simply tell us what
percentage of the test population answered the item correctly, not who
did and who did not.
With regards to the distracters in question 1 above, C and D are not

really working as each has attracted only 2.2 per cent of the total popula-
tion. Ideally, in a proficiency test, we might expect that each distracter
would be chosen by at least 7 per cent of the test population (Green
2013). This does not always happen, however, and therefore the test
developer must decide whether it is acceptable to have one weak dis-
tracter at the trial stage, where the test population tends to be smaller, in
the hope that in the live test administration it will work better. Where
there are two weak distracters, as in this case, revision needs to take place
otherwise there is a danger that the item is working simply as a true-false
question and thus lending itself to guessing. In an achievement test where
facility values are higher, the percentage of test takers choosing the dis-
tracters will, of course, be lower.
Look at the other facility values in Figures6.4 and 6.5. What can you
learn from them? Question 2 has a facility value of 59.8 per cent and
two rather weak distracters (B and C). In question 3 the item is much
more difficult (facility value = 35.9 per cent), but all the distracters have
Question 2 Frequency Percent

Valid A 51 27.7
B 6 3.3
C 11 6.0
D=Key 110 59.8
Total 178 96.7
Missing No
6 3.3
answer
Total 184 100.0
Question 3 Frequency Percent Question 4 Frequency Percent

Valid A 25 13.6 Valid A 31 16.8
B=Key 66 35.9 B 27 14.7
C 47 25.5 C=Key 89 48.4
D 19 10.3 D 24 13.0
Total 157 85.3 Total 171 92.9
Missing No answer 27 14.7 Missing No answer 13 7.1
Total 184 100.0 Total 184 100.0
Fig. 6.4 Frequencies on Q2-Q4

attracted more than 7 per cent of the test population. Interestingly, how-
ever, 14.7 per cent of the test population have selected no answer at all.
This relatively high (more than 10 per cent) degree of no answers needs
investigating. There is a similar pattern in question 4, though the item is
slightly easier (facility value = 48.4 per cent).
In question 5, the item has a facility value of 50 per cent, but one of
the distracters (A) is not working. Only 2.7 per cent of the test takers
failed to answer this question. Question 6 follows a similar pattern with
a slightly easier facility value (58.7 per cent) with only 3.3 per cent no
answers.

Valid A 5 2.7 Valid A 14 7.6
B 29 15.8 B=Key 108 58.7
C 53 28.8 C 23 12.5
D=Key 92 50.0 D 32 17.4
Total 179 97.3 Total 177 96.2
Missing No answer 5 2.7 Missing No answer 6 3.3
Total 184 100.0 Two answers 1 .5
Total 7 3.8
Total 184 100.0

Valid A=Key 137 74.5 Valid A 4 2.2
B 35 19.0 B 68 37.0
C 6 3.3 C=Key 67 36.4
D 5 2.7 D 38 20.7
Total 183 99.5 Total 177 96.2
Missing Two answers 1 .5 Missing No answer 7 3.8
Total 184 100.0 Total 184 100.0
Fig. 6.5 Frequencies on Q5-Q8
The test takers found question 7 much easier (facility value = 74.5 per
cent), but again two of the distracters (C and D) were quite weak (3.3
and 2.7 per cent, respectively). In question 8, the facility value was 36.4
per cent, but more test takers choose B (37 per cent), suggesting that the
distracter was working too well and needs investigating. Distracter A was
also weak (2.2 per cent) in this item.
Summary
The facility values in the task range from 82.1 per cent to 35.9 per cent. If
this task is supposed to be targeting one ability level, say CEFR B1, these
findings would suggest that some items are not at the appropriate level. A
number of the items have weak distracters (attracting less than 7 per cent
of the test takers) and there are two items that have more no answers
than one might expect. One distracter was stronger than the key (item 8),
though at this stage we do not know who chose B and whether these were
the weaker or the stronger test takers. All of the above needs to be inves-
tigated but first let us turn to stage two of the item analysis to see what
else can be learnt before making any final decisions regarding these items.
6.3.2.2 Stage 2: Discrimination
Discrimination tells us about the extent to which the items in a task are able
to separate the stronger test takers from the weaker ones. What we are hoping
to see is that the better test takers answer more items correctly than the weaker
ones; this is what is referred to as positive discrimination. Discrimination is
calculated by looking at how well a test taker performs on the test as a whole
compared with how s/he performs on a particular item. For example, if a test
taker does well on the test as a whole, one would expect such a test taker to
answer an easy or average item correctly and probably get only some of the
most difficult ones wrong. When this does not happen, when good test takers
answer easy items incorrectly (perhaps due to a flaw in the item or through
simple carelessness), we might find a weak discrimination index on those
particular items. On the other hand, if a test taker does poorly on the test as
a whole, it is more likely that such a test taker will answer a difficult or an
average item incorrectly and probably get only the easier ones correct. Again
where this is not the case, we might find weak discrimination on the particu-
lar items concerned. (Obviously, in either of the above scenarios, where this
happens with only one or two test takers in a large test population, there is
likely to be little impact on the discrimination index of the items involved.)
Discrimination is measured on a scale from 1 to +1. A discrimina-
tion figure of +0.3 is generally accepted as indicating that an item is dis-
criminating positively between the stronger and the weaker test takers.
Depending on how the scores are to be used (high stakes versus low stakes
tests) a discrimination index of 0.25 may also be seen as acceptable (see
Henning 1987). Where the discrimination figure is below 0.3 (or 0.25),
the item should be reviewed carefully as it might be flawed. For example,
the item may have more than one answer (MCQ), no answer, be guessable
by the weaker test takers or have ambiguous instructions. Alternatively,
the item may be tapping into something other than linguistic ability. In
this case the item should be checked for construct irrelevant variance.
It should be remembered that in an achievement test, the discrimina-
tion figures may be low simply because all the test takers have under-
stood what has been taught and have performed well on the test. In other
words the items cannot separate the test takers into different groups, as
the amount of variability between them is too small. Popham (2000)
offers this useful table regarding levels of discrimination:
.40 and above Very good items
.30 to .39 Reasonably good items but possibly subject to improvement
.20 to .29 Marginal items, usually needing and being subject to improvement
.19 and below Poor items, to be rejected or improved by revision
Fig. 6.6 Popham (2000) Discrimination levels
Let us have a look at the same eight MCQ listening items as in 6.3.2.1
and see what this stage of item analysis can tell us. In IBM-SPSS, dis-
crimination is referred to as corrected item-total correlation (or CITC):
Corrected Item-
Total Correlation
Q1 .314
Q2 .340
Q3 .223
Q4 .312
Q5 .280
Q6 .249
Q7 .251
Q8 .203
Fig. 6.7 Discrimination indices

What can we learn from Figure 6.7? If we use the lower parameter of
0.25 (Henning 1987), we can see that there are two items that fail to
reach this level items 3 and 8 (item 6 when rounded up would result
in 0.25). You will remember from Stage 1 that item 3 was the item that
nearly 15 per cent of the trial population failed to answer. This suggests
that perhaps the item and/or that part of the sound file was problematic
in some way for the test population. This finding again suggests that the
item needs to be investigated. In item 8, more test takers chose distracter
B than the key C, and the weak CITC in Figure 6.6 suggests that at
least some of these were the better test takers. Again this finding needs
exploring.
Summary
All but two of the items have satisfactory discrimination values (above
0.25). Items 3 and 8 need examining to reveal the reasons behind their
weak statistics.
6.3.2.3 Stage 3: Internal consistency (reliability)
Internal consistency tells us about the degree to which items stick

together; in other words, the extent to which they are tapping into the
same construct. The measure for internal consistency is based on the
number of items in a task, the range of variance in the items and the test
takers scores. If the test you are trialling also includes a reading com-
ponent, the analysis should be run twice once on the listening items
and then once on the reading ones asrunning the items together in one
analysis might make it difficult to interpret the findings in a reliable way.
By the same token, the reliability analysis should be run at the individual
task level rather than at the test booklet level as the statistics on one task
could be affected by the statistics on another. This presupposes, how-
ever, that you have sufficient cases and items to do this. (To run thistype
ofanalysis, the task needs to consist ofa minimum of three items. Where
you have one-item tasks, you will need to investigate the items reliability
at the test level.)
The programme IBM-SPSS offers a number of ways to measure inter-

nal reliability. One of the most commonly used is Cronbachs Alpha. This
statistic is measured on a scale of +1 to 1, with +1 indicating perfect
internal consistency something not often achieved in the field of lan-
guage testing though with large numbers of quality items and test takers,
it is more than possible to achieve around 0.93.
The higher the overall alpha figure is, the higher the internal consis-
tency of the items as a whole. According to Pallant (2007: 98), Values
above 0.7 are considered acceptable; however, values above 0.8 are prefer-
able. A negative alpha on an item is unusual but not impossible and
could be due to factors such as the wrong answer being keyed in during
data entry, a negatively worded question in a questionnaire that has not
been reversed or a flawed item. As with a weak or negative discrimination
value, an item that contributes negatively to the reliability of the task or
the test needs investigating as it might be doing something different from
the other items.
The overallalpha can be influenced by a number of things. Firstly,
it is easier to achieve a higher level of internal reliability in a task with
multiple items simply because there are more pieces of evidence (items)
on which to determine the degree of consistency. Secondly, items that
discriminate between test takers are likely to add to the tests internal
reliability, as there will be more variability in the test takers perfor-
mances. Thirdly, where the items are too easy or too difficult for the test
population, the level of internal reliability is likely to be lower because
the majority of the test takers are answering the items either correctly
or incorrectly, respectively, and therefore discrimination is likely to be
weaker. Thus, in achievement tests, results may show lower levels of
both discrimination and internal reliability as the amount of variance
within the class is likely to be less unless it contains a range of different
abilities.
Fourthly, the more homogeneous the items are in terms of the con-
struct being targeted, the higher the level of internal reliability is likely to
be because the test takers will respond to them in similar ways and there-
forethe itemswill appear to stick together more. Where a test taker is
requiredto use his/her non-linguistic knowledge to complete a language
item, this may result in a weaker level of internal reliability as the items
will not be so closely related in terms of what is being targeted (the con-
struct). In other words, a test taker may do well when his/her linguistic
knowledge is being targeted but when s/he also has to use mathematical
knowledge, s/he may respond in a different way to the item. This will be
reflected in the Cronbach Alpha value for that item if a significant pro-
portion of the population has experienced this problem (see Green 2013,
Chapter 3 for more on this issue).
Figure 6.8 shows us the Cronbach Alpha values for the task as a whole
(top table) and for the eight individual MCQ items (bottom table). In
order to understand the figures in the second table we need to look at the
two Cronbach Alpha values together.
Reliability Statistics
Cronbach's Alpha N of Items
.561 8
Cronbach's Alpha
if Item Deleted
Q1 .518
Q2 .503
Q3 .543
Q4 .513
Q5 .524
Q6 .535
Q7 .534
Q8 .550
Fig. 6.8 Reliability statistics
As mentioned above, ideally we would want an overall task alpha of 0.7

as a minimum. Here we can see an alpha of only 0.561, suggesting that
the consistency among the items is not as strong as might be expected if
they are all testing the same construct. In the second table, the figures in
the column entitled Cronbachs Alpha if Item Deleted (CAID) tell us
what will happen to the overall alpha if we delete an item. For example,
in Q1 we can see that the overall alpha of 0.561 would change to 0.518
if this item were removed. In other words, it would drop. Since we want
as high an alpha as possible, we would therefore not want to remove Q1.
What about the other figures in this column? What you should find is
that each item does in fact contribute something positive to the overall
consistency of the task, even Q8.
An item which is contributing negatively to the alpha would be indi-

cated by a figure in the CAID column which is higher than the overall
alpha. Where this happens, the item needs to be reviewed, revised and
re-trialled or dropped. The item that is contributing most in terms of
internal reliability in this task is Q2 (the overall alpha would drop to
0.503 if this item were removed); this item also had the highest level of
discrimination.
Summary
We have now analysed how the items perform in terms of their facility
values, discrimination indices and internal consistency. What conclusions
have we come to? At the facility value stage, item 3 appeared to be more
difficult, which might be interpreted as suggesting that it does not belong
to the same level of difficulty as the other items. Its discrimination power
was also a little weak (0.223) and it contributed little to the overall alpha.
This suggests that the item should be reviewed. Item 8 was also seen to
be problematic at the facility value stage where one of the distracters was
selected by more test takers than the key. In terms of discrimination it
was the weakest (0.203) of all the items and contributed least to the tasks
internal consistency. It should also be reviewed.
6.3.2.4 Overall task difficulty
One final statistic which provides useful insights into how your task is
performing is the average score that the test takers achieved; in other
words, the mean. IBM-SPSS provides this information as part of the reli-
ability analysis and the figure is shown in Figure 6.9 below:
Mean N of Items
4.46 8
Fig. 6.9 Overall task difficulty

This table tells us that the average score among the 184 test takers who
took the task was 4.46 out of a possible 8, or, in percentage terms, 55.7
per cent, suggesting that the task was neither very easy nor very difficult
for this test population. This statistic should be matched against your
expectations of how difficult or easy you expected the test takers to find
the task.
6.3.3 Drop, revise or bank?
In light of the outcomes of the item analysis, there are usually three pos-
sible routes the task can take: it can be banked for future test purposes;
it can be revised; or it can be dropped. Quantitative and qualitative data
from test taker feedback questionnaires (see6.1.9) should also be taken
into account when making this decision. Where it is felt that an indi-
vidual item should be dropped due to weak statistics, care must be taken
to ensure that this does not impact on the other items by, for example,
creating a lengthy unexploited gap in the sound file which could lead, in
turn, to possible confusion or anxiety in the test takers performance. Any
revisions which are made to the task will need to be re-trialled as solving
one issue could result in creatinganother unforeseen problem.
It goes without saying that item analysis should take place not only at
the field trial stage but also after the live test administration to confirm
the decisions taken about the items and tasks, and to provide further use-
ful feedback to all stakeholders including the test developers.
6.4 Conclusions
The wealth of insights that trialling and data analyses offer to the test
developer is immeasurable. In your own test development situation, you
might not be able to do everything that has been discussed in this chap-
ter, but the more you can do, the more confidence you will have in the
tasks that you and your colleagues create and the test scores that they
produce.
DLT Bibliography
Bachman, L. F. (2004). Statistical analyses for language assessment. Language
Assessment Series. Eds. J.C. Alderson & L.F. Bachman. Cambridge: CUP.
Carr, N.A. (2011). Designing and analysing language tests: A hands-on introduc-
tion to language testing theory and practice. Oxford Handbooks for Language
Teachers. Oxford: Oxford University Press.
Drnyei, Z. (2003). Questionnaires in second language research. Mahwah, NJ:
Lawrence Erlbaum Associates.
Macmillan.
Green, R., & Spoettl, C. (2009). Going national, standardised and live in Austria:
Challenges and tensions. EALTA Conference, Turku Finland. Retrieved from
http://www.ealta.eu.org/conference/2009/docs/saturday/Green_Spoettl.pdf
Henning, G. (1987). A guide to language testing: Development, evaluation,
research. Cambridge, MA: Newbury House.
Linacre, J.M. (2016). WINSTEPS Rasch measurement computer program version
3.92.1. Chicago, IL: Winsteps.com.
Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd
ed., pp.13-103). NewYork: Macmillan.
Popham, W. J. (2000). Modern educational measurement (3rd ed.). Boston:
Alleyn & Bacon.
7
How do wereport scores andset pass
marks?
7.1 Reporting test scores

7.1.1 Separate skills or all skills?
The first decision you need to make when considering how scores should
be reported is whether your listening test results will be reported as an
individual skill, or as part of a total test score including other skills such
as reading, language in use, writing and speaking. Your answer needs to
take into account such factors as the purpose of the test and how the test
results are to be used. For example, if the purpose of the test is diagnos-
tic, placement or achievement, there are good reasons for the skills to
be reported separately. In a diagnostic test, the more information you
can obtain about a test takers strengths and weaknesses the better; col-
lapsing the scores will result in a lot of useful information being hidden.
The results of a placement test are generally used as the basis for deter-
mining which class is appropriate for a test taker. Clearly having more
details will help particularly if the classes are subdivided for the teaching
of different skills. The results of an achievement test are usually fed back
into the teaching and learning cycle. Receiving information on individual

DOI10.1057/978-1-349-68771-8_7
skills would help the teacher to decide which particular skills need further
attention.
If the test has been designed to assess a test takers proficiency, however,
a global score might be more useful. This is especially true if it is to be
sent to end-users such as tertiary level institutions or prospective employ-
ers. Having said that, if a particular course is linguistically demanding,
the receiving department might well be more interested in the profile of
the test takers abilities so they can more easily judge whether the student
will be able to cope with various aspects of the course.
Having access to both types of results (separate and overall) seems
to be the most practical option and is the approach which some inter-
national examinations take. For example, in IELTS (the International
English Language Testing System) the test taker is awarded a band from
1 to 9 for each part of the test listening, reading, writing and speaking.
The bands are then averaged to produce the overall band score. All five
scores (four individual and one overall) appear on the certificate the test
takers receive. Some examination boards also report scores at the sub-skill
level. For example, the Slovenian Primary School National Assessment
in English reports performance on listening for main ideas, listening for
details and so on.
Some professions also prefer a breakdown of results and go so far as to
advertise job openings citing the specific linguistic requirements neces-
sary in each skill. For example, to qualify for posts within SHAPE (the
Supreme Headquarters Allied Powers Europe), candidates need to show
that they have the required SLP (Standardized Language Profile) for that
particular post. If the necessary SLP were 3332, for instance, this would
mean that the candidate would need a Level 3in Listening, a Level 3in
Speaking, a Level 3in Reading and a Level 2in Writing. (STANAG Level
2 = Fair: limited working; STANAG Level 3 = Good: minimum profes-
sional see Green and Wall 2005: 380.)
Whether you choose to report both sets of scores, or just the global
result, you will also need to decide whether a compensatory approach
should be allowed. This is where a test takers weak performance in one
skill can be helped by a stronger performance in another skill. Let us take,
for example, a test taker whose performance across the four skills, based
on the CEFR language descriptors, was as follows: C1in reading, B1in
7 How do wereport scores andset pass marks? 177
listening, B2in writing, and B2in speaking. When applying a compen-

satory approach, the global score reported for this test taker would be
CEFR B2 if all four skills were equally weighted (see 7.1.2). In some situ-
ations, however, it may be that a minimal threshold on a particular skill
is needed before the test taker is allowed to move forward. For example,
to be considered for a particular posting or scholarship there might be
a minimum threshold of B2in all four skills. In this case, the test taker
whose performance is described above would not be eligible for consider-
ation for either the posting or the scholarship.
In such a scenario, there are two possible options open to the test taker.
First, s/he could retake the complete test and hope that s/he achieves a
B2 minimum in all four skills. Second, the test taker could simply resit
the one skill in which his/her performance did not meet the required
standard. The latter option presupposes the examination system allows
for this possibility; in reality not all do for practical reasons involving the
extra resources needed for administering a one-skill test alongside the
usual test administration.
7.1.2 Weighting ofdifferent skills
As already discussed in 1.6, tests do not necessarily award the same

weighting to all skills. As this can have a negative washback effect on
those skills which are deemed less important, there needs to be a justifi-
able reason for differentially weighting the skills in a test. For example,
the focus of some practical courses may genuinely result in the speaking
and listening skills being more important than reading or writing, and
therefore the weighting of the latter skills will be understandably less.
This may also be the case for young learners. Wherever such a decision is
taken, it should be made transparent to the test takers as to which skills
carry more points, so that they can prepare accordingly. This informa-
tion should appear on the tests website (see 7.4) as part of the guidelines
for test takers, and should also appear on the test paper itself to act as a
reminder with regards to time management.
Where a differentially weighted system is used, care needs to be taken
to ensure that it is as practical as possible so as to avoid errors creeping
into the final results. Some educational systems provide an online cal-
culator into which schoolteachers can feed the raw numbers for each of
the various skills being tested. The calculator then takes those figures and
produces the final result, having factored in any necessary weighting.
7.1.3 Method ofreporting used
Once you have decided whether to report the skills individually, as a

total test score, or both you must decide how the result will appear on
the certificate (if one is to be given). In other words, will the test takers
score appear as a number, a percentage, a band or a description of lan-
guage test performance? Traditionally (and to this day this is still the case
in many countries), the use of numbers is a popular method of report-
ing test results. This is partly because they are practical: they are easy to
add up and it is simple to see who has scored the highest and the lowest
scores. Many stakeholders feel comfortable with numbers, even though
they mean different things to different people. As Geranpayeh (2013:
266) states:
Numbers alone will have no meaning unless they are accompanied by some
informed expert judgement about what the numbers actually mean given a
typical population and bearing on different aspects of the testing process.
Other stakeholders advocate the use of letters when reporting test tak-
ers scores but are these really any better? For example, what does A
mean? Is the difference between A and B the same as the difference
between C and D? And the perennial question: is a performance which
is awarded a grade A on X test the same as a grade A awarded on Y test?
In other words, we seem to be in a similar predicament to that of scores
being reported as numbers above. Without some accompanying state-
ment as to what A means in the context of a given examination, we are
really none the wiser. What about scores which are reported as percent-
ages? Do they provide a clearer picture? Unfortunately, if a test taker gets
75 per cent on a test, you still need to know what the 75 per cent relates
to in terms of content in order to allocate some meaning to that figure.
Clearly, providing a description of language test performance explaining

what the test taker can do in a particular skill will help both the recipient
and other stakeholders. Such a statement could either stand-alone or be
set alongside a number, letter or per cent to give the latter some meaning.
7.1.4 Norm-referenced approach
A norm-referenced approach involves comparing the performance of

each test taker against the rest of the test takers who are taking the same
test. For example, let us imagine a scenario where a group of civil servants
have just taken a 60-item reading and listening test. The performances
have been marked and the scores from the two tests added together. The
results are then rank ordered, starting with the test taker who achieved the
highest mark out of 60 at the top, right down to the one who obtained
the lowest mark out of 60 at the bottom. For example, Alex was ranked
first with 57/60, Charlie was ranked second with 52/60, and so on all the
way down to Chris who got 3/60, and who probably should not have sat
the test. These rank orders are then used to make decisions; for example,
they could be the basis for determining who should be awarded a particu-
lar scholarship, receive promotion or be appointed to a post. So, in the
above scenario, if there were only one scholarship or one job available,
Alex would be the recipient.
In some school educational systems, norm-referenced results are used
to determine who passes and who fails, or who is placed in a particular
achievement category. For example, the school might decide that those
test takers who fall into the top 10 per cent of the rank order should be
awarded a distinction; the next 10 per cent should be awarded a good
pass and so on.
There are still many people who feel that a norm-referenced approach
is fairer as the numbers (raw scores) make it possible (for which read
easier) to determine which test taker is better than another. What is
not very frequently acknowledged is the arbitrary nature involved in
using this approach. Imagine a situation in which a reasonably good can-
didate is taking an annual scholarship test. As luck would have it, the
candidature that year is particularly strong, and as there are only a few
scholarships places available, the reasonably goodcandidate is less likely

to be successful. But if that candidate had taken the test in a different
year, together with a weaker group of candidates, s/he would probably
have been successful in obtaining one of those scholarships. This rather
inconsistent approach to success seems hardly the type of model which
good language testing practice would want to endorse. However, trying
to convince norm-referenced users that a criterion-referenced approach is
much fairer and reliable takes time, energy and money.
7.1.5 Criterion-referenced approach
By contrast to the norm-referenced approach, a criterion-referenced one is

based on judging a test takers performance against a set of criteria. In the
case of the listening tasks we have been discussing in this book, these criteria
would equate to the language descriptors, or the standards, that appear in the
first part of the test specifications (see 2.4). The items which the test develop-
ers produce, based on those language descriptors, should reflect those criteria.
Having produced those criteria-related items, the decision makers now
need to come to an agreement on how many of those items a test taker should
answer correctly in order to be able to say that s/he has met the required stan-
dard. For example, say the construct in the test specifications has identified
six types of listening behaviour which have been successfully operationalised
in the items produced by the test development team. The question that the
decision makers need to answer is what percentage of these items (assuming
they are all equally weighted), reflecting the six different types of listening
behaviour, would a test taker at X level be expected to answer correctly? In
other words, what is the standard which should be expected of someone at
that level? Unlike in norm-referenced tests, where a certain part of the test
population will always be designated as passing the test, in criterion refer-
enced testing, if the test population is not at that level, then they cannot be
awarded a pass at that level. As Geranpayeh (2013: 266) states:
Setting performance standards is in sharp contrast to traditional methods of

norm referencing where a fixed percentage of test takers will pass or fail an
exam.
Which leaves us with the crucial question: who determines the stan-
dard? Having been involved in the development of the test items, it is
quite difficult for the test developer to do this in an objective way. This
means that ideally the decision makers need to come from outside the
task development team and yet they also need to have a clear understand-
ing of the context in which the standard is to be applied. No single per-
son can do this reliably; this is where procedures such as standard setting
can help enormously (see 7.2).
7.1.6 Pass marks
The pass mark in a test is perhaps a more traditional way of talking about
the standard. It is no easier to set, however. A decision still needs to be
made regarding what constitutes sufficient evidence to state with confi-
dence that a test taker has reached the required level, and therefore can
be awarded a pass. The actual pass mark in many school examinations
seems to be somewhat arbitrary; personal experience has shown that this
can range from as low as 32 per cent, up to 65 per cent. As Alderson
etal. (1995: 155) remark, the pass mark is usually simply a matter of
historical tradition.
Depending on the type of examination you are involved with, you may
have to identify not just one pass mark or cut score, but several within
one test. For example, if you have developed a multi-level test, target-
ing CEFR A1-B2, you will need to decide on the cut scores between A1
and A2, A2 and B1, and B1 and B2 as well as what is considered to be a
performance which is below A1 and which thus cannot be awarded that
CEFR level.
The above scenario would entail making decisions about four cut
scores. This is not an easy task. Some examinations leave such decisions
to the end-users, and simply report the raw score. For example, the most
prestigious universities in a given country may set a very high thresh-
old on a university entrance test for students wishing to study there.
In the Slovenian Primary National Assessment Tests, by contrast, there
is no pass mark; the students receive a report telling them their score
and how well they have done in comparison with the whole population.
Some international English language tests also leave the decision to the
end user. For example, IELTS reports the results of a test takers perfor-
mance, but it is lefttothe receiving department at a university to decide
whether the bands are sufficient for the particular coursefor whichs/he
is applying.
For many people working in the assessment field, leaving the deci-
sion to the end-user is not an option. Stakeholders expect informed deci-
sions to be made regarding whether test takers should pass or fail, and/
or whether they have reached the required standard(s). One possible
solution to this dilemma is to carry out a standard setting procedure as
described in 7.2 below. This procedure is of particular relevance to those
who are involved in high-stakes testing but hopefully will be of interest
to all involved in setting standards in their tests.
7.2 Standard setting

This procedure makes it possible for test development teams to obtain
external judgements about the suitability of their tasks for determining
whether test takers have reached a particular level of ability or perfor-
mance standard. This is accomplished through a complex process which
ultimately results in the establishing of cut scores. Sections 7.2.1 to 7.2.8
below provide an overview of what standard setting is, why we standard set
and how standard setting is carried out. The procedure which is outlined
is based on a modified Angoff method using test tasks. For more detailed
accounts of other standard setting approaches, please refer to Cizek and
Bunch (2006), Zieky etal. (2008), and the Manual: Council of Europe
(2009): Relating language examinations to the Common European
Framework of Reference for Languages: Learning, teaching, assessment.
7.2.1 What is standard setting?
Standard setting refers to the process of establishing one or more cut scores
on a test (Cizek and Bunch 2006: 5). It is a procedure that enables those
who are involved to make decisions about which test takers p erformances
should be classified asstrong enough to pass, in other words, to say that

they have reached the required standard; and to identify those whose per-
formances are not good enough and so must be classified as not achieving
that standard.
Standard setting is based on a criterion-referenced approach. As men-
tioned above, the criteria should be those which have been listed under the
construct heading in the first part of the test specifications (see2.4). The
tasks that are presented in this procedure should have been field trialled,
shown to have good psychometric properties and be considered eligible
in any other relevant respect for use in a future live test administration.
7.2.2 Why do westandard set?
There are a number of reasons why test development teams should put
their tasks through standard setting. Firstly, the decisions made by the
external judges (see 7.2.4) concerning the appropriateness of the tasks for
measuring the targeted criteria are invaluable in helping the facilitators,
who are in charge of the standard setting session, to determine the stan-
dard required by the test takers. In other words, the procedure makes it
possible for the facilitators to identify the minimum cut score which a test
taker needs to reach in order to be at the required standard or level in a
particular examination (see 7.2.9). (Unfortunately, these minimum cut
scores are not always put into practice by the relevant educational systems.)
A second reason for putting the tasks through this procedure is that the
judges can provide informed feedback on the quality of the tasks. This
can include insights into the appropriateness of the sound files in terms
of the accents used, the speed of delivery and the topics. Information
about the suitability of the task methods with respect to the test tak-
ers level of familiarity, and the relationship between the tasks and the
targeted construct, can also be obtained. In addition, feedback on the
level of difficulty of both the sound file and the task, and how well they
reflect the targeted standard, are further useful benefits such sessions can
produce. All of these insights can be channelled back into the task devel-
opment cycle (see 1.7.1) by the sessions facilitators after the standard
setting procedure is complete.
Thirdly, this procedure encourages a higher degree of transparency

and accountability to be associated with the test. This is important as,
quite often, tasks which are used in high-stakes tests, and many of the
procedures associated with them, are kept a closely guarded secret. For
example, it is not uncommon in some countries for high-stakes tasks to
be revealed only on the day of the live test administration. This is because
they have not been piloted and, in many cases, have been written by indi-
vidual test developers working in isolation for reasons of security. This
means that any errors in the tasks, the use of incorrect instructions, inap-
propriate layout, or indeed any undoable tasks, are unlikely to surface
until the examination begins with obvious consequences.
Making the tasks available through a secure system involving expert
judges before the live test provides an opportunity for any possible prob-
lems to come to light. It also helps to encourage the degree of confidence
the stakeholders are likely to have in the test (and the testing system).
Experience (Green and Spoettl 2011) has also shown that those who are
invited to participate in standard setting sessions, acquire a much clearer
understanding of the complexity of the test development cycle, and this
in turn increases their appreciation of how much work has been involved
in bringing the tasks to the standard setting table. (See also Papageorgiou
2016 on this issue.)
Fourthly, another common reason for carrying out standard setting
is that many test development teams want to be able to formally link
their newly created tests with an established and recognised set of lan-
guage standards in order to receive acknowledgment and acceptance by
a range of stakeholders (Papageorgiou 2016). In order to claim that the
new test really is at X level, there must be empirical evidence to support
the assertion that the content of the new test is in alignment with the tar-
geted standard, and that minimal cut scores have been established. (See
Figueras and Noijons 2009; Martyniuk 2010; and the Manual, Council
of Europe 2009 for more details on the procedures for aligning tests to
the CEFR.)
However, it must be emphasised that standard setting is only one step
in the validation cycle that all tasks should go through (Figure1.2). By
itself, it is insufficient to claim that a test has been validated. As Fulcher
(2016: 34) points out:
Validation is increasingly being interpreted as standard setting by following the

procedures as set out in some authoritative text. This has the effect of subverting
the primary requirement of validation theory, which is the specification of test
purpose, and the provision of evidence to support score meaning for intended
decisions. Thus, if policy makers specify that B2 on the CEFR is the required
standard for university entrance, linking a university entrance test to the CEFR
at level B2 becomes the de facto requirement for recognition.
7.2.3 Who is involved instandard setting?
7.2.3.1 Before standard setting
If you are thinking of carrying out a standard setting session, you should
be aware that there is a substantial amount of preliminary work to be
done before it can take place. First of all, identifying experts who can ful-
fil the requirements of being a standard setting judge is time-consuming,
and this work must be carried out well before the session takes place.
(See 7.2.4 for a discussion regarding the pre-requisites of being a judge.)
Putting this phase into effect a year in advance is really not too soon as
the people you will probably want to invite as judges are likely to be busy.
As mentioned in 7.1.5, it is not recommended that test developers be
called upon as judges due to the difficulties they would face in remaining
objective during the standard setting procedure.
Once the judges have been identified, they need to be contacted and
their availability for the whole of the standard setting session must be
confirmed. A judge who wants to leave halfway through the sessions, or
dip in and dip out, causes mayhem for the final decision-making pro-
cess. Moreover, such judges leave with only a partial picture of not only
their own role in the process, but of the purpose of standard setting as a
whole.
Second, it helps to appoint an administrator who will be in charge of
such issues as the venue where the standard setting sessions will be held,
hotel accommodation, travel, per diem and so on.
Third, members of the testing team need to decide which tasks should
be presented at the standard setting session. These tasks should have
good qualitative and quantitative statistics, have been banked after field
t rialling (see Figure1.2) and reflect the targeted standard. Including tasks
which fail on any of these criteria would be an extremely inefficient use
of resources (the tasks are likely to be rejected by the judges) and lead to
reliability issues in terms of cut score decisions (see 7.2.9).
Once appropriate tasks have been identified, a judgement needs to be
taken regarding how the task will appear in the test booklets. This will depend,
of course, on which standard setting method is to be used (see 7.2.6). For
example, if the Bookmark Method is to be followed, the tasks need to be
placed in order of difficulty; if a modified Angoff method is selected, it is
usually more practical to organise the tasks by test method to save time.
In addition to creating the test booklets, the testing team will need to
prepare the following documents:
Copies of the sound files in the order in which the tasks appear in the
judges test booklets. These should include the task instructions. The
amount of time provided should replicate the conditions under which
the test takers completed the tasks.
The key for each of the tasks in the test booklets.
The language descriptors and global scale tables against which the tasks
are to be standard set.
The rating sheets which the judges will use to record their judgements
including those which contains the field trial statistics (see 7.2.7).
Copies of the familiarisation exercise (see 7.2.5).
Copies of the evaluation sheets for judges to provide feedback to the
facilitators on the session, including their confidence in the ratings
they have given.
Copies of a confidentiality agreement (high-stakes situations).
The above list underlines the importance of early preparation when

carrying out standard setting sessions.
7.2.3.2 During standard setting
It is strongly recommended that two facilitators run the standard setting

session since the procedure is quite complex. One of these facilitators
should also take on the responsibility of explaining the role of field trial
statistics in standard setting (see 7.2.7). In addition, as will be completely

understandable from the list of documents outlined in 7.2.3.1 above, it
is incredibly useful to have other helpers who can assist in the handing
out, and collecting in, of the materials as tight security must be main-
tained at all times. The non-return of just one test booklet means that
those tasks cannot be used in the live test administration. (It is strongly
recommended that all test papers be numbered to facilitate the checking
procedure.)
Finally, of course, an appropriate number of judges are needed. The
Manual for Relating Language Examinations to the Common European
Framework of Reference for Languages: Learning, Teaching, Assessment
(CEFR) (2009: 38) states that between 12 and 15 judges should be
considered as the minimum number required. Clearly, for some second
foreign languages, finding such a number of appropriate and available
judges, including the requisite level of the language in question, will be
a struggle. If you have no alternative but to work with fewer judges than
the recommended number, this will need to be factored into the final
considerations regarding perceived item difficulty levels (see 7.2.9).
7.2.4 Importance ofjudge selection
It is crucial that those judges who are selected to attend the standard set-
ting session have the necessary qualities to carry out that role. They should
be regarded as stakeholders and be as representative as possible in the
given context. For example, in a school leaving examination, the judges
are likely to include some or all of the following: school and university
teachers, teacher trainers, school inspectors, headmasters and ministry
officials. Where the test is a national one, selecting judges from various
parts of the country is also recommended so as to avoid any question of
possible bias. Finally, if resources permit, it is useful to invite an external
participant, that is, someone from outside the immediate context (pos-
sibly from another country) who can bring an external perspective to the
session.
Finding such a range of judges is not easy as they need to have not
only a certain level of ability in the targeted language (at least one level
higher than that being targeted and preferably more), but also a sound
knowledge of the relevant system within which the tasks they are to judge
are situated. For example, the judges mentioned above would need to
be familiar with the educational context the tasks will be used in. The
judges also need to be familiar with the language descriptors against
which the test items are to be measured, for example, the CEFR, ICAO,
or STANAG among others.
In addition to the above prerequisites, judges must also be able to fill
the role of a judge. To do this, they must have the capacity to set aside
their own ability in the language being targeted. In other words, they
must ignore what they personally find easy or difficult, as well as what
their own students might, and focus purely on the scale against which the
tasks are to be measured.
Finally, as mentioned in 7.2.3.1, they must be able to devote sufficient
time to the procedure. Standard setting sessions can last up to five days
depending on the number of skills and tasks being tabled, and judges
who cannot commit to the whole period of standard setting should not
be invited (see Cizek and Bunch 2006, Chapter 13 for more insights on
the participant selection process).
7.2.5 Training ofjudges
The function of the judges in standard setting is to determine the level(s)

that the items or tasks are targeting, in accordance with the set of lan-
guage descriptors or standards on which the tasks have been developed.
Based on their judgements, those tasks which confirm the test d evelopers
a priori judgements regarding the targeted difficulty level should be
reserved for possible use in a live test administration. Those which the
judges deem to be either below or above the standard being targeted
should be set aside. Once this has been agreed upon, the facilitators can
move forward to establishing a cut score on the eligible tasks.
Even when judges do comply with all the necessary qualifications dis-
cussed in 7.2.4, time must still be factored into the standard setting ses-
sion for some training and for an explanation of the complete procedure
which will be followed (see 7.2.8). For example, it is crucial for the facili-
tators to confirm that the judges are indeed familiar with the language
descriptors that are to be used in the session as it is their judgements
which will be factored into the cut-score decisions after the standard set-
ting procedure is complete (see 7.2.9). This confirmation is normally
achieved by asking the judges to complete a familiarisation exercise on
the first morning of the procedure. The exercise can take various forms,
but one of the most popular ones involves the judges being given a list
of randomised descriptors taken from the scales they are to set the tasks
against. Equipped with rater numbers to protect their anonymity, judges
are then asked to put one scale level against each of the descriptors.
Figure 7.1 below shows an extract from such an exercise based on the
CEFR.
Your Key
Answer
1 Can understand the main points of radio news bulletins
and simpler recorded material about familiar subjects
delivered relatively slowly and clearly.
2 Can understand most TV news and current affairs
programmes.
3 Can follow extended speech even when it is not clearly
structured and when relationships are only implied and
not signalled explicitly.
4 Can follow in outline straightforward short talks on familiar
topics provided these are delivered in clearly articulated
standard speech.
5 Has no difficulty in understanding any kind of spoken
language, whether live orbroadcast, delivered at fast
native speed.
Fig. 7.1 Extract from CEFR familiarisation exercise (listening)

Once the judges have completed the column with their responses, the
papers should be collected in. The judges responses are then entered into
a spreadsheet and projected onto a screen so that all participants can
see how the descriptors have been rated. Discussion of the various rat-
ings, as well as clarification regarding any perceived ambiguities in the
descriptors, then follows with the key being revealed at the end. Where
any of the judges are shown to have an unacceptable lack of familiarity
with the descriptors, the facilitators must decide whether they should
remain in the pool of raters. (Where a number of skills are being standard
set within one session, this familiarisation procedure should be repeated
with descriptors from each skill.)
Where a pool of standard setting judges can be established, and can
be called upon on an annual basis, this is obviously of great benefit to
the facilitators as it cuts down on the amount of time needed for train-
ing, and familiarisation in, the standard setting session. It also makes it
possible to compare the difficulty level of tasks year on year, and even
across languages where there are a sufficient number of multilingual
judges available (see Green and Spoettl 2011). Ideally, all tasks which
are used in high-stakes tests should go through some form of external
review which ultimately means holding a standard setting session every
year. For practical reasons, unfortunately, this does not happen in many
countries.
7.2.6 Selecting astandard setting method
There is a wide range of different standard setting methods available

but, in general terms, these can be reduced to three main types. First
of all, there are those that focus on the judgement of test questions, for
example, Angoff; second, those that focus on people or products, for
example, the Borderline Group Method or the Body of Work Method;
and third, those that are based on the judgement of profiles or groups
such as the performance profile method. (See Cizek and Bunch 2006
Section II for more details concerning a range of standard setting
methods.)
Which method(s) you choose to use depends on a number of issues.

First of all, the method needs to relate to either the test questions you wish
to standard set, or to the persons who make up your target test popula-
tion. In many scenarios, the judges are not sufficiently familiar with the
test population;where this is the case,basing the procedure on test tasks is
likely to produce a more reliable result. Secondly, the method needs to be
applicable to the level of knowledge, skills and/or abilities being assessed
by the test. Thirdly, it should be able to handle any specific type of test
method(s) that the test developer wants to use in the test. For example,
the Nedelsky Method can only be used with multiple choice questions.
Fourthly, some methods such as the Ebel and the Bookmark Method,
require IRT calibrations. Fifthly, the chosen standard setting method
needs to be able to accommodate the number of performance categories
(that is, the number of cut scores) that you will require in your particu-
lar testing situation. Finally, it should match the amount of resources
available, for example, finances, judges, time, and scheduling. Being able
to use two standard setting methods is advantageous in minimising any
method effect but for most test projects this would be impractical due to
the resources required as well as judge (and facilitator) fatigue.
If the standard setting session is to focus on test tasks, the most com-
mon and judge-friendly methods are the Angoff ones. The procedure,
outlined in 7.2.8 below, is based on a modified Angoff method.
7.2.7 Role ofstatistics instandard setting
Trial statistics provide a useful measure against which the judges can
compare the ratings they have assigned to each item once their judge-
ments have been completed (see step 16 below in 7.2.8). Although it is
the language descriptors which should be the final arbiter in deciding the
difficulty level of an item, judges are sometimes unwittingly influenced
by some characteristic of the task and/or the sound file. The field trial
statistics provide empirical evidence of how the tasks performed which,
in turn, should help highlight any personal reaction to an item or task
and prompt the judge to review their rating(s).
When revealing the statistics, the judges are usually supplied with
information about how many test takers answered the item correctly
(facility values), how the test methods performed and, where feedback
questionnaire data are available, how the test takers perceived the tasks.
Details about the test takers are also supplied including the numbers
involved, their representativeness of the target test population as a whole,
their appropriateness in terms of targeted ability level and the time of
year the field trial was administered in case this has had any impact on
the difficulty level of the items (see 6.2.2).
7.2.8 Standard setting procedure
As mentioned above, it is very important that the standard setting judges

have a clear idea of each phase of the procedure and how each stage is
linked to the next one. The list below provides a step-by-step account of
the various stages the standard setting procedure could follow using a
modified Angoff method. It is based on using the CEFR language descrip-
tors as the standard against which the item difficulty is to be judged.
1. The judges are provided with an overview of the standard setting

procedure.
2. A rationale for selecting the standard setting method is provided.
3. The judges are assigned a number for anonymity purposes.
4. The importance of familiarisation with the CEFR language descrip-
tors is explained.
5. The CEFR familiarisation exercise takes place, the judgements are
entered into a spreadsheet and the findings discussed in plenary.
6. The CEFR global scales and language descriptors are distributed and
discussed as necessary.
7. The procedure for assigning a level to each item is demonstrated
using a sample task.
8. The results are discussed and any questions answered.
9. The judges are asked to complete an evaluation form after this initial
orientation to ascertain their understanding of the procedure.
10. The judges are reminded that the purpose of standard setting is not
to discuss the quality of the items they are going to judge, but simply
to place each of the items at a particular CEFR level. (At the discre-
tion of the facilitators, time may be set aside for task discussion once
the ratings are complete and have been submitted so as not to disrupt
the procedure.)
11. The judges are provided with the first test booklet and asked to apply
a level to each test item in each task based on the sound files they will
hear and using the language descriptors and global scales. This is
known as Round 1.
12. The keys to the items are distributed. The judges check their answers
and, where necessary, review the CEFR levels they have assigned.
13. The judges ratings from Round 1 are entered into a spreadsheet.
14. The levels awarded by the judges are looked at globally (and anony-
mously) on screen.
15. The average ratings per item across the judges are discussed as well as
any outliers (those who have assigned extreme levels in comparison
with the rest of the judges). During the discussion individual judges
can provide their rationale for assigning a particular level if they so
desire but this is not compulsory.
16. The statistics from the field trial are provided and discussed in rela-
tion to the judges ratings.
17. The judges are given an opportunity to make adjustments to their
Round 1 judgements if they so wish in light of the discussion and the
field statistics. There is no obligation to do so. These become the
Round 2 ratings.
18. The Round 2 ratings are entered into a spreadsheet for use in the cut
score deliberations after the standard setting procedure is complete.
19. The judges repeat the above process with further test booklets as
necessary.
20. The judges complete a final evaluation form providing feedback on
their level of confidence in, and agreement with, the final recom-
mended level of the items.
21. The standard setting facilitators review the judges decisions regard-
ing the difficulty level of the items and their feedback on the
session.
7.2.9 Confirming item andtask difficulty levels
Once standard setting is complete, the data entry from Round 2 should
be checked and analysed to ascertain the overall level of each task. Once
this has been done, those tasks which have been judged to be above or
below the targeted level should be set aside. The facilitators then need to
make an initial selection from the remaining tasks as to which ones might
be the most appropriate for use in the live test.
In making this selection the facilitators need to factor in the field sta-
tistics in light of their suitability: the time of year when the trial took
place and hence the test takers motivation, as well as how well they rep-
resent the target test population. The facilitators also need to take into
consideration the degree of confidence they have in the judges ratings.
For example, they should take into account the judges knowledge of
the language descriptors used, their previous exposure to standard setting
procedures, the judges own confidence in the levels they have awarded,
and the relationship between their judgements and the available empiri-
cal data.
The above procedure should result in identifying the most eligible
tasks. Sometimes, however, even these tasks might contain one or two
items on which the judges did not completely agree. For example, some
judges may have given an item a B2 rating, while others gave it a B1
rating. As mentioned in 2.5.1.4, it is not unusual for a task to include
an item which is either slightly easier or slightly more difficult than the
others. However, when such items are to be included in a live test, further
deliberation is necessary to decide how these might affect the cut score.
Lets look at an example. In a B2 listening test made up of four stan-
dard set tasks, the judges ratings have indicated that there are five B1
items, and 25 B2 items. If we work on the hypothetical basis that a test
taker who is at B2 should be able to get 60 per cent of the B2 items cor-
rect, as well as 80 per cent of the B1 items, this would mean that the test
taker would need to answer 19 items correctly (15 at B2 plus 4 at B1) in
order to be classified as a B2 listener. A score of 19 out of 30, or 63.3 per
cent, would therefore be the cut score which would divide the B2 listen-
ers from the B1 listeners on these four particular tasks.
7.3 Stakeholder meetings

Many test development teams simply do not have the resources to put
their tasks through a standard setting procedure. One possible alterna-
tive is to organise a series of stakeholder meetings instead. The type of
people who would participate in such meetings would be very similar to
those who might attend a standard setting session. The time that they
will necessarily need to devote to such meetings as well as the necessary
preparation for the test development team, should, however, be consid-
erably less.
The objectives of such a meeting would include providing the stake-
holders with an overview of the test development procedures which
have been followed. This would consist of insights into the construct
of listening, and how the test materials were selected and developed.
It should also allow stakeholders an opportunity to complete a range
of sample listening tasks reflecting different levels (if it is a multi-level
test), test methods and different constructs. They should also be pro-
vided with an overview of the field trial statistics. On the basis of that
experience, the stakeholders could be asked to assign a difficulty level to
the tasks using the language descriptors on which the tasks were devel-
oped. By being transparent and accountable in this way, the session
should engender a positive attitude towards the test (see Bhumichitr
etal. 2013). It will also provide useful feedback for the test develop-
ment team.
7.4 Sample tasks andtest website

Our discussions on standard setting sessions and stakeholder meetings
above have highlighted two ways of making the required performance
standard(s) clear. Another complimentary way in which this can be done
is by making a range of sample tasks and related documentation available
on the tests website. Figure 7.2 lists those materials which would be use-
ful to a range of stakeholders.
Website materials
1. Guidelines for test takers
2. Test specifications
3. Sample listening tasks:
a. Sound files and keys
b. One for each test method
c. Justifications for answers
d. Assessment criteria
Fig. 7.2 Website materials
Let us look at these materials in a little more detail. The guidelines

should provide the test taker with information about the content of the
test (number of tasks/items) and the types of listening behaviour that
will be targeted. It should also list the different test methods they might
encounter and how much time they will have to study and complete the
tasks. Information on how their performance will be assessed and how
the results will be reported (either as a total number of points, a percent-
age or in terms of a particular framework such as the CEFR) should also
be included (see 7.1.3). As Alderson etal. (1995: 37) state:
The more students know about the content and aims of a test, the more likely
they are to be able to do themselves justice in the examination hall.
The test specifications that are available on the website should be of a

more discursive type than the ones used by test developers to create the
tasks (see 2.6). They should provide more detailed information about the
types of listening the test will measure and the conditions under which this
will be carried out, including whether the sound file will be played once or
twice, the kind of topics that will feature in the sound files inter alia.
The sample listening tasks should be selected from those which
provedsuccessful in the field trials (6.4). One task for every test method
type which might appear in the test, for example, MCQ, MM and
SAQ, should be made available on the website so that all stakeholders
can study them. The key in the sample tasks should be accompanied
by a detailed justification explaining the rationale behind each of the

answers. The guidelines for marking the answers to the SAQ listening
items should make it clear that provided the meaning of the test takers
answer is understandable, their responses will not be penalised for spell-
ing and grammatical errors.
7.5 Post-test reports

Having put your tasks through all (or, at least, as many as possible) of the
steps outlined in the task development cycle (Figure1.2), it is important
to document how the tasks (and the test as a whole) performed in the live
test administration. This report is useful not only for a range of external
stakeholders (see 7.2.4 for those in the school educational context), but
also for the test development team itself. The content of the report is
likely to vary according to the targeted readership, but should include
details about where and to whom the live test was administered; how the
items and tasks performed and were perceived (if feedback questionnaires
were administered) as well as recommendations for future test administra-
tions. The report should enhance the accountability and transparency of
the test and the testing system by underlining the validity, reliability and
meaningfulness of (your) test and the results (Alderson etal. 1995: 197).
7.5.1 Post-test item analysis
Even though all the listening tasks which appear in the live test book-
lets should have gone through field trials, statistical analyses, and ideally
some form of standard setting prior to being selected, it is still impor-
tant to analyse their live test performance. This is because the field trials
will necessarily have been carried out on test takers who were differently
motivated and therefore it is possible that the facility values might have
changed.
It is recommended that the same analyses be carried out on the live test
results as those described in 6.3.2, that is frequencies, discrimination and
reliability analyses. Since the test population is likely to be much larger
than at the field trial stage, it should prove both useful and insightful to
carry out a number of other analyses, including t-tests and ANOVA.The

former could be run on such variables as the test takers gender or location
(rural versus urban) to check for any possible bias in the materials. The lat-
ter could be used to investigate any differences across the various regions
and school types. (See Green 2013, Chapters 7 and 8 for further informa-
tion on how to run these types of analysis.) Where the test includes other
skills, correlational analyses between the component parts should be car-
ried out in order to check the degree of overlap between them.
In the event that any item or task performs differently from what was
expected, for example an item which has either weak or negative discrim-
ination, the reason for this needs to be identified and a decision taken as
to how the item or task should be dealt with. Ideally, such analyses and
investigations should take place before any results are reported, so that, if
necessary, the item can either be dropped from the calculation of the final
test scores, or every test taker can be allocated one mark regardless of their
performance on that item. The situation becomes a little more difficult if,
due to practical circumstances, for example, the need to release the results
within X days of the test administration, and/or the lack of knowledge
about the importance of carrying out such statistical analyses, the post-
test analyses take place after the release of the results.
Where any gender or location bias is found (see also 6.1.5), or signifi-
cant differences are located between regions or school types, such results
need to be discussed and the findings fed back to the task development
team and other stakeholders.
In addition to the above analyses, it is recommended that the standard
error of measurement (SEM) should be calculated before the live test
results are released. This statistic tells us about the degree of confidence
which test developers can have in the scores produced. It can easily be
calculated if the overall test reliability and the standard deviation of the
test scores are known (see Green 2013).
The SEM is of particular importance for those test takers who are bor-
derline cases, that is, those whose scores position them just above or just
below the cut-off points. For example, say the test you administered had a
SEM of 2, and a cut-score (pass/fail) of 60. This would mean that any per-
formance which achieved a score of between 58 and 61 should be reviewed.
(A score of 62 would still be a pass with or without the SEM.) A test taker
with a score of 58, for example, could have a real score of between 56 and
60; a test taker with a score of 60 could have a real score of between 58
and 62, and so on. In order to be fair, all such borderline cases need to be
reviewed and their results confirmed before the final test scores are released.
7.5.2 Recommendations
In addition to providing insights into how the tasks have performed, the
post-test report should provide a list of recommendations. These might
include observations about the tasks themselves in terms of the test meth-
ods used, the topics, the amount of time provided to read and complete the
task, the level of difficulty inter alia. Although such issues will have been
analysed and reported on after the field trials, it is still useful to revisit these
aspects of the test if only to confirm that they are all working as expected.
The report might usefully include details about any test administration
issues which have come to light. For example, concerns regarding the acous-
tics at the test venue(s), the delivery of the test material, timing issues, and,
where possible, feedback from the test administrators and test takers. The
marking of the live test might also result in further recommendations regard-
ing grading issues including online support, for example, hotline or email.
Final thoughts
The main objective behind developing good listening tasks is to produce
valid and reliable test scores. As Buck reminds us (2009: 176):
All measurement is accompanied by some degree of error, and it is considered a

basic requirement of good science to attempt to estimate the extent of that error.
Educational measurement is no different in principle, but because we are
attempting to measure something in the learners mind, something we cannot
observe directly, it is very difficult to identify what we are measuring. As a
result, assessment specialists continually struggle to improve the reliability and
validity of their assessments. We never get it right, but this imperative drives us
all, and this is the unspoken subtext of all our professional work. It is our
underlying ethic, and the foundation of our professional integrity.
There is no such thing as a perfect test, but infollowing all the stages
outlined in this book, I would argue thatwe have a much better chance
of getting it right than if we had not done so.
DLT Bibliography
Bhumichitr, D., Gardner, D., & Green, R. (2013). Developing a test for diplo-
mats: Challenges, impact and accountability. LTRC Seoul, Korea: Broadening
Horizons: Language Assessment, Diagnosis, and Accountability.
Cizek, J.G., & Bunch, M.B. (2006). Standard setting: A guide to establishing and
evaluating performance standards on tests. Thousand Oaks, CA: Sage
Publications, Inc.
Council of Europe. (2009). Relating language examinations to the common
European framework of reference for languages: Learning, teaching, assessment. A
Manual.
Figueras, N., & Noijons, J. (Eds.) (2009). Linking to the CEFR levels: Research
perspectives. Arnhem: CITO.
Fulcher, G. (2016). Standard and frameworks. In D. Tsagari & J. Banerjee
(Eds.), Handbook of second language assessment (pp. 29-44). Boston: De
Gruyter Mouton.
Geranpayeh, A. (2013). Scoring validity. In A.Geranpayeh & L.Taylor (Eds.),
Macmillan.
Green, R., & Spoettl, C. (2011). Building up a pool of standard setting judges:
Problems solutions and Insights C.EALTA Conference, Siena, Italy.
Green, R., & Wall, D. (2005). Language testing in the military: Problems, poli-
tics and progress. Language Testing, 22, 379.
Martyniuk, W. (Ed.) (2010). Relating language examinations to the Common
European framework of reference for languages: Case studies and reflections on the
use of the Council of Europes Draft Manual. Cambridge, UK: Cambridge

University Press.
Papageorgiou, S. (2016). Aligning language assessments to standards and frame-
works. In D.Tsagari & J.Banerjee (Eds.), Handbook of second language assess-
ment (pp.327-340). Boston: De Gruyter Mouton.
Zieky, M. J., Perie, M., & Livingston, S. A. (2008). Cutscores: A manual for set-
ting standards of performance on educational and occupational tests. Princeton,
NJ: Educational Testing Service.
DLT Bibliography
Alderson, J.C. (2009). The politics of language education: Individuals and institu-
tions. Bristol: Multilingual Matters.
Brunfaut, T., & Rvsz, A. (2013). The role of listener- and task-characteristics
in second language listening. TESOL Quarterly, 49(1), 141-168.
Buck, G. (2009). Challenges and constraints in language test development. In J.
Charles Alderson (Ed.), The politics of language education: Individuals and
institutions (pp. 166-184). Bristol: Multilingual Matters.
Council of Europe. (2001). Common European framework of reference for lan-
guages: Learning, teaching, assessment. Cambridge, UK: Cambridge University
Press.
Ebel, R. L. (1979). Essentials of educational measurement (3rd ed.). Englewood,
NJ: Prentice-Hall.
Green, R., & Wall, D. (2005). Language testing in the military: Problems, poli-
tics and progress. Language Testing, 22, 379-398.
Harding, L. (2015, July). Testing listening. Language testing at Lancaster summer
school. Lancaster, UK: Lancaster University.
Hinkel, E. (Ed.) (2011). Handbook of research in second language teaching and
learning. NewYork: Routledge.
Linn, R. L. (Ed.) (1989). Educational measurement (3rd ed.). New York:
Macmillan.

DOI10.1057/978-1-349-68771-8
204 DLT Bibliography
Pallant, J. (2007). SPSS survival manual (6th ed.). Maidenhead: Open University
Press.
Tsagari, D., & Banerjee, J. (2016). Handbook of second language assessment.
Boston: De Gruyter Mouton.
Index
A purpose, 45
assessment security, 146, 150, 157-8
criterion-referenced, 180-1, 183 size, 155
grading, norm-referenced, task selection, 23
179-80 test booklet preparation,
pass marks, weighting, 49 155-7
test population, 154, 192, 197
time needed, 48, 80
F
field trials
administration, 17, 22, 45, 107, I
146, 148, 150-1, 157-8, 172, input
197, 199 authentic, 17, 37-8
administration guidelines, 150-1 background noise, 15, 38
dates, 154-5 copyright, 35-46, 99
feedback questionnaires, 151-3, density, 60, 82
157, 172 discourse type, 40-1
marking, 23, 24, 48, 140, 158-60, EFL sources, 37
199 invisible, 2

DOI10.1057/978-1-349-68771-8
206Index
input (cont.) health, 19

length, 10, 42-3, 45, 72, 78, 83, interest, 19
86, 87, 99, 136 known / unknown, 14, 38
nature of content, 41, 52 L1, 13
number of pieces, 14, 75 motivation, 19
number of times heard, 43, second language, 4, 5, 11, 14, 17,
51, 96 18
recording, 14, 36, 39, 61 strengths, 28, 91, 175
redundancy, 8-12, 16, 45, weaknesses, 28, 91, 175
72, 103 young learner, 7, 11
self-created, 35-6 listening behaviour
sound quality, 15 careful, 6, 76-7, 82
source, 4, 35-6, 40, 52 collaborative, 7-8
speech rate, 17, 57 construct, 6, 22, 30, 33, 39, 44,
speed of delivery, 14, 17-18, 32, 47, 86, 93, 98, 130, 148, 155,
37, 45, 52, 58, 99 180
spoken features, 8-11 continua, 7
talking points, 36 explicit, 58
topic, 7-8, 14-16, 33, 35, 36, 40, gist, 33, 59, 76, 86, 93,
44, 52, 99 118, 130
transcript, 11, 57, 83 global, 34, 101
transient, 2, 61 implicit, 58
visuals, 15, 16 important details, 33, 58, 59,
voices, 1, 18, 46, 52 68-70, 91, 120-1, 128, 132,
written features, 8 148
interactional, 7-8
local, 34
L main ideas, 34, 59, 76-7, 80, 124,
listener 126, 136, 148
age, 17, 19, 45 non-collaborative, 7-8
anxiety, 147 non-interactional, 8
beginner, 6 selective, 6, 44, 68, 82, 132
characteristics, 18-19 specific information, 34, 59,
cognitive ability, 7 68-76, 120, 128-9, 132, 135,
concentration, 18 148
expert, 6 supporting details, 59, 76, 124,
fatigue, 18 136, 148
gender, 17, 45 synthesise, 59, 117-18, 130
Index
207
P task, 9, 22, 48, 98, 126, 146, 147,

post-test 168-71, 195-7
administration, 24, 25, 197-9 test scores, 14, 22, 107, 140, 146,
ANOVA, 198 147, 198, 199
recommendations, 199
reports, 197-9
SEM, 198 S
stakeholders, 197, 198 speaker
t-tests, 198 accent, 14
processing age, 45
acoustic-phonetic processor, 2-5 articulation, 5, 58
attention, 4, 5, 11, 45 gender, 17
automatic, 3, 4, 13 L1, 13, 95
cognates, 3 speed of delivery, 5, 14, 17-18,
cognitive, 14, 29, 42 42, 45, 58, 124, 151
complexity, 13 swallowed, 57
context, 3, 7, 10-15, 24 spoken language
controlled, 3-4 afterthoughts, 8
decode, 2, 3 asides, 8
higher-level, 2-4 backtracking, 37
lexical search, 2, 3 corrections, 8
lower-level, 2 density, 5
overload, 4 dialect, 9, 32
parser, 2, 3, 13 false starts, 8
schemata, 2, 14 fillers, 8, 9
segment, 2 hesitations, 8, 9
simultaneous, 3, 13, 72, 126, idea units, 9
129 interactional, 8
sound waves, 2 intonation, 10
working memory, 4 linking devices, 9
world knowledge, 2, 7, 14 pauses, 8-10
planned, 7
prosodic clues, 10
R redundancy, 8-12, 83
reliability repairs, 8
empirical, 191 repetition, 8
judges, 186, 191 signposting, 10
non-empirical, 30 temporary, 9
rating, 48, 186 transactional, 7
208Index
spoken language (cont.) dropped, 22, 23, 96, 105, 108,

unplanned, 7 149, 171, 172, 198
utterance, 9, 10 example, 107, 141
volume, 10 feedback, 22, 23, 82, 84, 107-8,
standard setting 112, 118, 126, 150, 151, 155,
cut scores, 39, 182-4, 186, 191, 157, 158, 172, 183, 197, 199
194 guidelines, 22, 98-107, 150-1,
facilitators, 183, 186, 188, 193, 196, 197
194 identifier, 85-7, 101, 108-12
judges, 24, 50, 150, 184-9, 191-4 instructions, 14, 85, 88-9, 100-1,
method, 186, 190-1, 192 116, 117, 122, 133, 146-7, 186
procedure, 24, 182, 183, 185, key, 148
189, 192-5 layout, 85, 96-7
statistics, 23, 150, 185-7, 191-5 number, 4, 10, 16, 22, 27, 43,
training, 24, 188-90 46-8, 51, 52, 78, 85, 90-1, 95,
statistical analysis 96, 98, 102, 115, 118, 125, 127,
bias, 187, 198 128, 130, 145, 146, 148, 150,
Cronbach alpha, 170 156, 158, 166, 168, 169, 188,
discrimination, 198 196, 198
facility values, 171, 192 resources, 23, 24, 146, 186, 191
frequencies, 197 revised, 23
heterogeneous, 47 sample, 17, 47, 92, 104, 130,
internal reliability, 169, 171 141-3, 149, 192, 195-7
markers, 160 standardised, 8, 20, 46, 88-9,
qualitative, 150, 172, 185 100-1, 107, 146, 156
quantitative, 150, 172, 185 topic, 14-16, 19, 80, 100, 128,
raters, 160 149, 158
representative, 187, 192 test
standard error of measurement achievement, 28-30, 40, 163,
(SEM), 198 164, 167, 175
test population, 24, 50, 197 acoustics, 16-17, 99, 199
bi-level, 48
diagnostic, 175
T environment, 16, 29
task/task development high-stakes, 24, 145, 153, 155,
banked, 23, 24, 172, 185 182, 184, 190
development cycle, 21-5, 148, international, 176, 182
154, 183, 197 location, 16, 29, 120, 121, 148-9,
difficulty, 47, 171-2, 194-5 154, 158, 198
Index
209
low-stakes, 95, 167 test specifications

multi-level, 47, 48, 181, 195 accountable, 49
national, 24 authenticity, 17, 37-8, 120
occupational, 8 blueprint, 27, 50
placement, 28, 175 conditions, 15, 17, 22, 24, 27,
proficiency, 28, 29, 47, 162-4 30, 32, 34-49, 145, 196
purpose, 43, 172, 185 construct, 10, 22, 27, 29-34, 39,
stakeholders, 20, 50, 51, 140, 40, 42, 44, 45, 47-51, 86, 102,
145, 149, 153, 172, 178, 179, 145, 180, 183
182, 184, 195-7 copyright, 35
standardised, 95 criteria of assessment, 48
transparency, 24, 83, 197 descriptors, 30-4, 42, 45, 49,
uni-level, 47, 48 180
washback, 20, 149, 156, 177 discourse types, 40-1
young learners, 7, 11, 29, 177 external, 50, 145
test developer input, 10, 29, 31, 35-46, 51
feedback, 22, 23, 107-8, 112, internal, 50
147, 150, 153, 172 iterative, 22, 50
guidelines, 22, 98, 111, 196 mode of delivery, 43, 52
review, 22, 23, 64, 83, 98, 107, nature of content, 41, 52
108, 112, 148 purpose, 10, 27-9, 37, 40, 43, 47,
task selection, 23 51
team, 16, 24, 48, 51, 76, 98, 145, quality, 38-9, 183
154, 180, 182-6, 195, 197 rationale, 107
training, 21, 22, 24 sound files, 22, 29, 35, 36, 38-45,
test method 51-3
gap filling, 92, 98, 154, 172 source, 30, 35-6, 40, 52
integrated tests, 97 speaker characteristics, 33, 45-6,
multiple choice, 45, 47, 73, 80, 52
87, 89, 104, 106, 115, 135, stakeholders, 50, 51, 196
139, 161, 191 task consistency, 43
multiple matching (MM), 47, 52, test method, 22, 45-7, 73, 98,
87, 91-2, 104, 105, 115-18, 107, 153
121, 126, 130, 161 test population, 10, 28-9, 37, 40,
sequencing, 95-6 47, 51, 145, 180
short answer, 45, 47, 48, 52, 87, test taker, 28, 34, 39-41, 44, 49,
92-4, 104, 115, 127-33, 148, 102, 180, 183, 196
158 times heard, 43-5, 52, 96
true/false, 95-6 transparency, 24, 197
210Index
test specifications (cont.) results, 61, 71, 72, 78, 102, 140
types of listening, 33, 41, 44, 47, re-textmapping, 82
180, 196 selective, 68, 82
versions, 28, 50 SIID, 68-76, 135
working, 22, 38, 45, 60, 98, 153 silence, 60, 66
test taker, 16, 17, 28, 34, 39-41, 44, target, 38, 64, 83, 128, 136,
47, 49, 70, 80, 88, 90, 92-8, 140
101-10, 116-19, 120-2, textmapper, 58, 60, 62, 76-82
124-30, 132-5, 138-40, 146, textmap table, 64-6, 75
147, 149, 150, 156, 158-60, textmap time, 59, 60, 66, 71, 75
166, 172, 175-81, 183, 194, transcript, 57-9, 83
196, 198, 199. See also listener unexploited, 72, 75, 102
textmapping
by-products, 82-3
careful, 76 V
co-construction, 57-9 validity
collate, 61, 71 cognitive, 7, 8, 11, 36, 43, 89, 90,
communalities, 62-3, 67 101, 126
consensus, 57-9, 77-9, 81, 83 concurrent,
context, 60, 61, 83 construct, 30, 97, 148
distance, 84 construct irrelevant-variance, 30
distribution, 72, 78, 129 construct under-representation,
entries, 69, 75 30
exploit, 57-9, 74-5, 83 evidence, 28, 30
face-to-face, 58, 84 face, 37, 105, 106, 141
gist, 61, 66-7 predictive, 148, 158
instructions, 60n1, 61, 85, 100,
108-9
interpretation, 57, 58, 60, 67 W
key words, 64 website sample tasks
listening behaviour, 22, 57-9, justifications, 149, 197
76-7, 82, 140 keys, 141-3, 196
MISD, 80-1 written language
multiple files, 67-8, 76 clauses, 9
negotiation, 61 complex, 9
numbers, 70, 128 gerunds, 9
procedure, 7-5, 22, 38, 58, 60, participles, 9
64, 66-7, 80-1, 108 permanent, 11
redundancy, 45, 57-9, 83, 139 syntax, 9, 126

Bok 3A978 1 349 68771 8

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bok 3A978 1 349 68771 8

Uploaded by

Copyright:

Available Formats

Designing

Designing Listening Tests:

ISBN 978-1-137-45715-8ISBN 978-1-349-68771-8(eBook)

Library of Congress Control Number: 2016950461

The Editor(s) (if applicable) and The Author(s) 2017

Printed on acid-free paper

This Palgrave Macmillan imprint is published by Springer Nature

Who is this book for?

The organisation of this book

I would like to start by thanking my colleagues and friends for their

1 What is involved inassessing listening?1

1.5.6 Speaker characteristics17

2 How can test specifications help?27

3 How do weexploit sound files?55

4 How do wedevelop alistening task?85

4.3.1.4 Other test methods 95

5 What makes agood listening task?115

5.2.1 Sound file 120

5.7.2.1 Listening behaviour 134

6 How do weknow if thelistening task works?145

6.3 Trial results 160

7 How do wereport scores andset pass marks?175

7.3 Stakeholder meetings 195

CAID Cronbachs Alpha if Item Deleted

Fig. 1.1 Extract from lecture 12

Fig. 5.1 Janes reading habits (MM) 117

Assessing a test takers listening performance is a complex procedure

1 . What the listening process involves.

The aim of this chapter is to explore these issues in preparation

The Editor(s) (if applicable) and The Author(s) 2017 1

1.1 What thelistening process involves

Generally speaking, as listeners in our native tongue(s), we carry out

test developers need to be very aware of when selecting input to use in

automatic processing requires little/no attention and as such is less likely to

Field (2013: 106-7) adds

The importance of automaticity in all these processes cannot be overstated. If

1.2How listening differs between contexts

what we listen to and why are important influences on how we listen.

Secondly, the degree of attention a listener exhibits also varies accord-

Imagine a scenario where we simply want to identify a detail, say a

While an expert listener at C1 on the same scale should be able to:

understand a wide range of recorded and broadcast audio material, including

This comparison, while somewhat extreme, demonstrates clearly the

1.3 How listening input varies

from unplanned to planned (Ochs 1979); from oral to literate (Tannen

The decision as to which continua the sound files should be selected

1.4How thespoken andwritten forms

because features such as pauses, hesitations, fillers, repetition, repairs,

1.5 What makes listening difficult?

1.5.1 Nature oflistening

1.5.1.1 No permanent record

One common feature of many listening events is that there is no per-

1.5.1.2 Lack ofreal gaps

As mentioned in 1.4, unlike in the written language, there are no

1.5.1.3 Lack ofredundancy

It was observed in 1.4 that oral language includes a lot of superfluous

am going to give a brief overview of some typical stages of language

development. Then I am going to briefly cover some important theories of child

Fig. 1.1 Extract from lecture (Harding 2015)

A student taking notes on this overview would probably write down

1st language acquistion

In other words, s/he would write a total of 11 words. In the real-life

1.5.2 Complexity ofprocessing

Who is this book for?

1 What is involved inassessing listening?1

1.5.6 Speaker characteristics17

2 How can test specifications help?27

3 How do weexploit sound files?55

4 How do wedevelop alistening task?85

4.3.1.4 Other test methods 95

5 What makes agood listening task?115

5.2.1 Sound file 120

5.7.2.1 Listening behaviour 134

6 How do weknow if thelistening task works?145

6.3 Trial results 160

7 How do wereport scores andset pass marks?175

7.3 Stakeholder meetings 195

Fig. 1.1 Extract from lecture 12

Fig. 5.1 Janes reading habits (MM) 117

1.1 What thelistening process involves

1.2How listening differs between contexts

1.3 How listening input varies

1.4How thespoken andwritten forms

1.5 What makes listening difficult?

1.5.1 Nature oflistening

1.5.1.1 No permanent record

1.5.1.2 Lack ofreal gaps

1.5.1.3 Lack ofredundancy

1.5.2 Complexity ofprocessing

1.5.2.2 Controlled versus automatic processing

1.5.3.3 Sound quality

1.5.3.4 Mode ofdelivery

1.5.5 Listening environment

1.5.6 Speaker characteristics

1.5.6.1 Speed ofdelivery

1.5.6.2 Number andtype ofvoices

1.5.7 Listeners characteristics

1.6 Why is assessing listening important?

1.7.1 Task development cycle

2.1 What are test specifications?

2.2 Purpose ofthetest

2.3 Target test population

2.4 The construct

2.5 Performance conditions