Professional Documents
Culture Documents
Bok 3A978 1 349 68771 8
Bok 3A978 1 349 68771 8
Listening
Tests: A
Practical
Approach
RITA GREEN
Designing Listening Tests: A Practical Approach
Rita Green has spent many years at the coalface of language test develop-
ment and training in a variety of international contexts; this book is the
sum of this experience. This book is a fantastic resource for anyone look-
ing to develop listening tests: a highly practical, theoretically-grounded
guide for teachers and practitioners everywhere. Green covers a range of
important principles and approaches; one highlight is the introduction
to the textmapping approach to working with sound files. This book is
highly recommended for anyone involved in the development of listen-
ing tests.
Luke Harding, Senior Lecturer, Lancaster University, UK
RitaGreen
UK
Acknowledgements
vii
Contents
ix
xContents
DLT Bibliography203
Index205
Acronyms
xvii
List of figures
xix
xx List of figures
In other words, the more time that listeners can spend in auto-
matic mode, the less demand there will be on their working memories
(Baddeley 2003; Field 2013). This, in turn, means that in the assess-
ment context, the listener will have more working capacity for dealing
with other issues, such as applying what s/he has understood to the
1 What is involved inassessing listening? 5
task. Test developers therefore need to think carefully about the degree
of cognitive strain they are placing on test takers when asking them
to process a sound file. Not only do test takers need to cope with the
listening processes discussed above but they also need to manage such
factors as language density, speaker articulation, speed of delivery,
number of voices, accessibility of the topic inter alia, all of which are
likely to contribute to the burden of listening for the second language
listener (see2.5.1).
people often listen at only 25 per cent of their potential and ignore, forget, dis-
tort, or misunderstand the other 75 per cent. Concentration rises above 25 per
cent if they think that what they are hearing is important and/or they are
interested in it, but it never reaches 100 per cent.
6 Designing Listening Tests
follow speech which is very slow and carefully articulated, with long pauses for
him/her to assimilate meaning. (Overall Listening Comprehension)
understand instructions addressed carefully and slowly to him/her and follow
short simple directions. (Listening to Announcements and Instructions)
the listeners are, the wider the range of different listening behaviours the
tasks should measure in order to avoid construct under-representation.
Secondly, the test developer needs to decide whether the test takers
listening ability should be measured by means of collaborative tasks,
non-collaborative tasks (Buck 2001) or both. At the collaborative (or
interactional) end of such a continuum, both listening and speaking
abilities would be involved, possibly through some kind of role-play,
problem-solving exercise, conversation, negotiation (for example, busi-
ness or diplomatic context) or transmission (aeronautical context). At
the non-collaborative (non-interactional) end, the listening event might
involve listening to a lecture, an interview or a phone-in. According to
Banerjee and Papageorgiou (2016: 8) large-scale and standardised listen-
ing tests use non-collaborative tasks.
Lets look at some concrete examples. Air traffic controllers (ATC)
need to be able to demonstrate not only good listening skills but also
the ability to interact when communicating with pilots or fellow ATC
colleagues (see ELPAC: English Language Proficiency for Aeronautical
Communication Test). Therefore, an interactional listening task is likely
to have much more validity. In occupational tests, such as those aimed
at civil servants or embassy support staff, where an ability to communi-
cate on the telephone is considered an important skill, the test would
ideally include some interactional tasks (see INTANs English Language
Proficiency Assessment Test). Although tertiary level students need to dem-
onstrate their ability to take notes during lectures, which would suggest
non-interactional tasks have more cognitive validity, they may also need
to function in small-group contexts involving speaking which would
indicate interactional tasks are also important. In the case of young learn-
ers, it is also likely to be both.
on. In other words, many of the words a speaker produces are redun-
dant they simply form part of the packaging and can be ignored
by the listener (see 1.5.1.3). The writer, on the other hand, is often
instructed or feels obliged to make every word count. This has obvi-
ous consequences for the listener when a written text is used as the
basis for a sound file.
Fourthly, due to its temporary nature, the spoken form may contain
more dialect, slang and colloquialisms than the written form. On the
other hand, though, the speaker may well exhibit more personal and
emotional involvement which may aid the listeners comprehension espe-
cially where there is also visual input.
Fifthly, the discourse structure and signposting used differs across
the two forms. The written form has punctuation, while the spoken
has prosodic cues such as intonation, stress, pauses, volume and speed.
Depending on the characteristics of the speakers voice, these prosodic
cues can either aid comprehension or hinder it take, for example, a
speaker who talks very fast or someone who exhibits a limited or unex-
pected intonation pattern.
To summarise, where a sound file contains many of the written charac-
teristics discussed above, this increases the degree of processing required
by the listener. This is because the resulting input is likely to be more
complex in terms of grammatical structures, content words, and length
of utterances; also because it will probably exhibit less redundancy.
While this does not mean that input based on speeches or radio news, for
example, is invalid, careful thought must be given to the purpose of the
test, the test takers needs and the construct upon which the test is based.
In other words, the test developer needs to ask him/herself whether in
a real-life listening context, the test population for whom s/he is devel-
oping a test, would ever listen to such a rendition. To this end, the test
developer may find it useful to carry out a needs analysis in order to
identify appropriate listening events for the target test population while
developing the test specifications (see2.5). (See Chafe 1985, and Chafe
and Danielewicz 1987 for a more in-depth discussion of the differences
between the spoken and written word.)
1 What is involved inassessing listening? 11
Hi everyone
Er, today were going to talk about first language acquisition or to put it more
simply, how children learn their first language. In the first part of the lecture, I
language acquisition. ..
1.5.2.1 Multi-tasking
By now, it should have become clear to the reader why listening is con-
sidered a complex process. In order to be successful, the listener must
identify what the speaker is saying by simultaneously using a proces-
sor (which decodes the incoming message), a lexicon (against which
the words/phrases are matched), and a parser (which produces a mental
idea of what has been said). In addition, the listener is likely to call on
their knowledge of the topic, the speaker and the context while continu-
ously checking how everything fits into the whole picture. Visual input
(see1.5.3.4), adds yet another dimension.
Given the need for multi-tasking, it is therefore not at all surprising that,
even with native speakers, listening breaks down and the listener must ask
for repetition or clarification if the speaker is present. Indeed it is really
quite amazing that as listeners we manage to do this in our own L1, let
alone that our students can manage this in their second or third languages.
14 Designing Listening Tests
In 1.1 above it was pointed out that the amount of time a listener has to
spend in controlled as opposed to automatic processing mode is likely
to impact quite heavily on the degree to which their listening is likely
to be successful. If we then add to this the requirements of a task which
involves reading, and sometimes also writing, we have yet another factor
that the test developer needs to take into account. Too often, the strain of
having to process the sound file in real-time as well as respond to a task
is not fully appreciated, particularly if the tasks have not been through all
the recommended stages of task development (see 1.7).
1.5.3 Input
Based on the discussion so far in this chapter, it will have become clear
that the type of input the listener needs to process plays a major role in
terms of difficulty, and impacts on whether successful comprehension
takes place or not. The degree of success may be influenced by a number
of variables which are discussed below.
1.5.3.1 Content
Research carried out by Rvsz and Brunfaut (2013) found that input
which contained a higher percentage of content words, as well as a broader
range of words in general, increased the difficulty level for listeners as it
required more cognitive processing. Field (2013: 87) notes that the way
a word sounds when used in context, as opposed to the word being used
in isolation, also impacts on its level of difficulty for second language
listeners. He adds that longer pieces of input place an added burden on
the listener, as s/he has to continually modify the overall picture of what
the speaker is trying to convey.
1.5.3.2 Topic
lead to reliability issues in terms of the resulting test scores (see Buck
2001; Banerjee and Papageorgiou 2016). This is also true of input that
entails a lot of cultural references, as listeners may need to understand
more than the actual language used.
Going into a listening event cold is liable to increase the difficulty
level. Where the topic can be contextualised, listeners are likely to acti-
vate their world knowledge or relevant experiences (schemata) and thus
reduce some of the pressure which their working memories will need
to deal with (Vandergrift 2011). It therefore seems reasonable to argue
that the topic of the sound file be signalled to the listener in the task
instructions (see 4.2). Where this does not happen, it is more than pos-
sible that the first utterance or two of the recording will be lost as the
listener attempts to grapple not only with the unknown topic but also
with the speakers accent, intonation and speed of delivery as well as the
task itself. In such scenarios, items which are placed at the very beginning
of the sound file are likely to prove particularly difficult to answer.
However, sometimes a test takers background knowledge of a topic can
have a negative effect (Rukthong 2016). Lynch (2010: 54) points out:
It hardly needs to be said that, all things being equal, a poor quality sound
file is going to be much more difficult to process than one with good
sound quality. While in real life there are occasions when we do have to
cope with the former, it would be unfair to assess a test takers listening
ability on something that is of poor sound quality unless it can be argued
that this is something the listener would have to do in the real-life listen-
ing context. Even air traffic controllers and pilots, who may well be faced
with such conditions, are able to ask the speaker to repeat the message.
Many test developers (often with their teachers hat on) feel that sound
files that include background noise are unfair. However, from arealistic point
of view, some type of background noise is nearly always present, be it the
16 Designing Listening Tests
humming of lights, the air conditioner or noise resulting from traffic. The
important issue to remember is that any background noise should be sup-
portive rather than disruptive; in other words, the noise should help the lis-
tener by providing clues as to the context in which the event is taking place.
1.5.4 Task
There are a number of ways in which the task can contribute to the difficulty
experienced by listeners. These include the test method (how much does
the listener need to read and/or write in order to complete the items? Is the
method familiar? Is it appropriate to the type of listening being targeted?);
1 What is involved inassessing listening? 17
the wording of the instructions (Do these prepare the test taker for the task
they are to encounter? Do they introduce the topic in a helpful way?); the
example (Has this been included? Does it fulfil its role?); the total number
of items (Is there sufficient redundancy between the items for the listener to
process the input and complete the task before the next item needs answer-
ing?) amongst others. These issues are discussed in more detail in Chapter 4.
The actual physical location where the test takes place can also impact on
the difficulty level of the listening event. Such aspects as the acoustics of
the testing room as well as other conditions such as heat, space, light and
so on, can impact on the test taker and by extension his/her performance
on the test. Venues should be checked the previous day to field trials and
live administrations to minimise any external factors which might influ-
ence test performance (see 6.2.5).
The speed at which the speaker talks is likely to contribute to the difficulty
level of the input (Lynch 2010; Field 2013). Brunfaut (2016: 102) writes:
18 Designing Listening Tests
Since faster speech gives listeners less time for real-time processing, it has been
proposed that it results in more comprehension difficulties, particularly for less
proficient second language listeners. A number of experimental as well as non-
experimental studies have confirmed this hypothesis.
Many test developers have little idea of how fast people speak on
the sound files they select, and yet this is of crucial importance when
attempting to link a sound file with the appropriate level of ability (see
2.5.1.12). This holds true for the listeners mother tongue as well as for
second languages. According to Wilson (1998), when a sympathetic
speaker talks to a second language listener, not only does s/he uncon-
sciously adapt the content, but the speed of delivery is also spontane-
ously adjusted until the speaker is sure of what the listener can cope
with. He states:
What could be more natural than a native speaker slowing down their rate of
speech and using simplified vocabulary to a foreigner? What could be less natu-
ral than a native speaker talking at full speed to a foreigner and not grading
their language?
The more voices there are on a sound file, and the more overlap there
is between them, the more difficult it becomes for the second language
listener to discern who is saying what. This is particularly true if more
than oneof the voices isfemale. Both these issues must be taken into
account when determining the difficulty level of a particular sound
file.
1.7 Summary
This chapter has attempted to outline the importance of having a clear
idea of what is involved in assessing listening before any attempt is made
to try to measure the skill. It has also investigated the different types of
listening that we engage in, how the spoken and written language differ
1 What is involved inassessing listening? 21
and the impact this can have in terms of successful listening. The issues
which contribute to making listening difficult were also explored as well
as the importance of assessing listening.
The subsequent chapters of this book investigate how we can move from
this rather abstract concept of what listening involves to the somewhat
more concrete manifestation of a listening task. Each chapter discusses
one or more of the various stages a task should go through before it
can be used in a live test administration. Figure 1.2 illustrates the stages
which occur within this task development cycle:
stage will go forward to the field trial (Stage 6a); those which do not must
be dropped (Stage 6b). Inevitably, not every task will be successful, particu-
larly in the early stages of test developer training; this is one of the lessons
that both reviewers and test developers have to learn to accept.
The next stage in the task development cycle is the field trial (Stage 6a, see
Chapter 6). Prior to the trial taking place, some test developers may also be
involved in task selection for the trial test booklets (see 6.2.4) while others may
have the opportunity to take part in administering the trial, perhaps within
their own school or workplace. Invaluable insights come from the experience
of watching test takers respond to their own and/or their colleagues tasks.
Wherever possible, test developers should be encouraged to participate in
marking the field trial test papers (Stage 7) as again this will provide useful
feedback concerning how their tasks have performed (see 6.2.6).
Once all the trial papers have been marked, it is time for Stage 8
statistical analyses. It is strongly recommended that all test developers be
involved in this procedure as it is extremely helpful in explaining how
their tasks have performed and why some have succeeded and others
have failed (see 6.3.1 and Green 2013). In addition, probably for the
first time in the task development cycle, this stage also provides external
perceptions of the tasks in the shape of the test takers feedback on such
aspects as the sound files, instructions and tasks as well as how the test
was administered (see 6.1.9).
Stage 9 entails making one of three decisions concerning each and
every task which has gone through the field trial, based on the outcome
of the statisticalanalyses (Stage 8). The first option is that the task should
be banked with no changes and go forward to standard setting (see 7.2
and Stage 13) if this procedure is part of the task development cycle. The
second option is that the task should be revised. This is usually due to
some weakness which has come to light during the data analysis stage (see
6.3.2). The third option is that the task should be dropped as it has been
found to be unsalvageable for some particular reason (weak statistics,
negative feedback, inappropriate topic though the latter should have
been picked up long before the trial). For every task which is dropped, it
is important that the test developers learn something from the exercise;
not to do so would mean a waste of resources.
Stage 9b involves the revision of those tasks which were not banked or
dropped; this stage is similar to that of Stages 3 and 4, as it will involve
24 Designing Listening Tests
some peer review. Once the revised tasks are ready, they move to Stage 10,
which is Trial 2. (Other newly developed tasks can obviously be trialled
at the same time as the revised tasks.)
Stages 11 and 12 are a repeat of Stages 7 and 8, only this time there are
just two options available for those tasks which have already been revised.
These are bank or drop. The decision to drop a task which has been tri-
alled twice, and failed to meet requirements, is a practical one. Trialling,
marking and carrying out statistical analyses are time-consuming and
expensive. One exception some test development teams make is if there has
been a test method change after the first trial; that decision must depend on
the resources you have available. Experience, however, suggests that if a task
does not work after going through all of the above stages, including two
periods of peer review and two trials, it is probably not going to work. This
outcome has to be accepted, and lessons learnt for future task development.
Stage 13 involves submitting those listening tasks which have been
banked, to an external review process known as standard setting (see 7.2)
or to a stakeholder meeting (see 7.3). Not all test development teams
will be able to organise a standard setting session due to the resources
necessary to carry out this process (see 7.2.3-7.2.9), but for those test
developers who are involved in high-stakes testing or nationaltests, this
is a procedure you should at least be aware of, and preferably be involved
with. Those tasks which receive a green light from the judges in standard
setting are usually deemed eligible for consideration in a live test admin-
istration (Stage 14). Invaluable insights can be gained from the standard
setting procedure which can be fed back into test developer training.
The final stage of the task development cycle entails the writing of the
post-test report and statistical analyses of the live test results (Stage 15).
For reasons of accountability and transparency among others, it is impor-
tant that a post-test report be drawn up after the live test administration.
This should provide information about where and to whom the live test
was administered, as well as including the results of a post-test analysis
of the items and tasks. Although all the tasks which go into the live test
should already have good psychometric properties, it is still important to
analyse how they have performed in a real-test situation. Remember, no
matter how much care has been taken in selecting the trial test popula-
tion (see 6.2.1), the conditions can never be exactly the same. The test
takers who take part in the live test are much more highly motivated than
1 What is involved inassessing listening? 25
those who took part in the trial. It is important to verify that the statisti-
cal properties on which the tasks were chosen still hold true. In other
words, that the items still discriminate and contribute positively to the
internal consistency of the test (see 6.3.2.2 and 6.3.2.4). These post-test
insights will be of great benefit for the test developers and their future
task development work which, once the administration of the live test is
over, very often will start once more.
Not everyone reading this book will be able to carry out all of these
stages. In many cases, even where test developers would like to do this,
the challenges and constraints (Buck 2009) of their testing context will
make some stages very difficult to achieve. The important thing is to
attempt to do as many as possible.
DLT Bibliography
Alderson, J.C., Clapham, C., & Wall, D. (1995). Language test construction and
evaluation. Cambridge: CUP.
Baddeley, A. (2003). Working memory: Looking back and looking forward.
Nature Reviews Neuroscience, 4, 829-839.
Banerjee, J., & Papageorgiou, S. (2016). Whats in a topic? Exploring the inter-
action between test-taker age and item content in high-stakes testing.
International Journal of Listening, 30 (1-2), 8-24.
Brown, G., & Yule, G. (1983). Teaching the spoken language. Cambridge:
Cambridge University Press.
Brunfaut, T. (2016). Assessing listening. In D.Tsagari & J.Banerjee (Eds.), Handbook
of second language assessment (pp.97-112). Boston: De Gruyter Mouton.
Buck, G. (2001). Assessing listening. Cambridge Language Assessment Series.
Eds. J.C. Alderson and L.F. Bachman. Cambridge: CUP.
Buck, G. (2009). Challenges and constraints in language test development. In
J.Charles Alderson (Ed.), The politics of language education: Individuals and
institutions (pp.166-184). Bristol: Multilingual Matters.
Bygate, M. (1998). Theoretical perspectives on speaking. Annual Review of
Applied Linguistics, 18, 20-42.
Chafe, W.L., & Danielewicz, J. (1987). Properties of spoken and written lan-
guage. In R.Horowitz and S.Jay Samuels (Eds.), pp.83-113.
Fehrvryn, H. K., & K. Piorn. Alderson, J. C. (Series Ed.). (2005). Into
Europe. Prepare for modern English exams. The listening handbook. Budapest:
26 Designing Listening Tests
a number of factors about the test takers. For example, their age, in terms
of the degree of cognitive processing the materials may require; compare
young learners with adults test takers for instance. Age will also have some
bearing on the type of topics that are chosen. In addition, the test takers
gender, first language and location should also be taken into account to
ensure that the materials chosen contain no potential sources of bias. For
example, those living in an urban environment may have an advantage
if some of the sound files are based on specific subjects which are not so
familiar to those who live in rural areas.
Put simply, the construct is the theory on which the test is based. To
expand on this a little, if you are designing a listening test, it is the defi-
nition of what listening is in your particular context: for example, an
achievement test for 11-year-olds, a proficiency test for career diplomats
and so on. Once defined, this construct (or theory) has to be transformed
into a test through the identification of appropriate input and the devel-
opment of a suitable task. Clearly, the definition of what listening is will
differ according to the purpose of the test and also the target test popula-
tion. The construct on which a listening test for air traffic controllers is
based, for example, will be quite different from one which would be used
in a test for young learners.
Defining the construct accurately and reliably is arguably one of the
most important responsibilities of test designers. This is because during
30 Designing Listening Tests
the development of the test specifications and tasks, they will need to
collect validity evidence to support their definition of the construct. This
evidence can be of two kinds: the non-empirical type (Henning 1987;
or interpretative argument, Haladyna and Rodriguez 2013); and the
empirical type based on quantitative and qualitative data (see Chapter 6).
The test designers also need to be aware of the two main threats to con-
struct validity: construct under-representation and construct irrelevant
variance. (These terms are discussed below.)
The construct can be based on a number of sources. For example, in
the case of an achievement test, insights can be gained from the curricu-
lum, the syllabus or the national standards. The construct could also be
based on a set of language descriptors such as those found in the Common
European Framework of References (CEFR), in the Standardised Agreement
(STANAG) used in the military field or on the descriptors developed by
the International Civil Aviation Organization (ICAO) for use with air
traffic controllers and pilots, to name just a few. A third source might be
the target language situation. In this case, the construct could be based
on a set of descriptors outlining the types of listening behaviour test tak-
ers would need to be able to exhibit in a given context. For example, the
listening skills perceived to be necessary to cope with tertiary level studies
or employment in an L2 context. Finally, the construct could be based
on a mixture of these sources, for example, the school curriculum, the
national standards and the CEFR descriptors.
Figures 2.1 to 2.3 below show extracts from different sets of language
descriptors. Figure 2.1 shows the descriptors for CEFR Listening B2.
(The acronyms at the end of the descriptors represent the names of the
tables from which they have been taken, for example, OLC = Overall
Listening Comprehension.)
Figure 2.2 shows the descriptors pertaining to STANAG Level 1
Elementary.
Figure 2.3 displays the descriptors relevant for assessing a test takers
listening ability at ICAO Level 4 Operational.
These three sets of descriptors offer test developers useful insights into
the types of listening behaviour expected at those levels, as well as provid-
ing additional information about the conditions under which listening
takes place (part two of the test specifications see 2.5). For example,
in terms of what the listener is expected to be able to comprehend, the
2 How can test specifications help? 31
2. Can follow extended speech and complex lines of argument provided the topic is
reasonably familiar and the direction of the talk is sign-posted by explicit markers.
(OLC)
3. Can with some effort catch much of what is said around him/her, but may find it
4. Can follow the essentials of lectures, talks and reports and other forms of
complex. (LMLA)
6. Can understand most radio documentaries and most other recorded or broadcast
material delivered in standard dialect and can identify the speakers mood, tone
etc. (LAMR)
o Can understand common familiar phrases and short simple sentences about
everyday needs related to personal and survival areas such as minimum courtesy,
travel, and workplace requirements when the communication situation is clear and
supported by context.
o Can understand concrete utterances, simple questions and answers, and very
o Even native speakers used to speaking with non-natives must speak slowly and
o There are many misunderstandings of both the main idea and supporting facts.
o Can only understand spoken language from the media or among native speakers
COMPREHENSION
sure, and, by extension, what is above and below that level in terms of the
expected construct, topic(s), speaker characteristics and discourse structure.
Unfortunately, language descriptors, as well as other sources such as the
curriculum and the national standards, do not always describe the various
types of listening behaviour in sufficient detail for them to assist in test
design. In such situations, it is useful to add a further set of definitions
which describe the different types of listening behaviour in more practical
terms. Field (2013: 149) supports this approach, saying even a simple
mention of listening types using listening for categories or the param-
eters local/global and high attention/low attention might provide useful
indicators. Such additional descriptors could be added to the test speci-
fications under a separate heading as shown in Figure 2.4 (see also 4.1):
macro-proposition.
Listening for important details Listening selectively to identify words / phrases which
Search listening (SL) Listening for words that are in the same semantic
Listening for main ideas and Listening carefully in order to understand explicitly
2.5.1 Input
2.5.1.1 Source
even talking to bullet points. You should therefore always allow for two
or three attempts for the speakers to warm up, for the recording to come
across as being as natural as possible.
Finding readily available listening input is particularly difficult at the
lower level of the ability spectrum. The development of talking points
as the basis for creating sound files, although detracting from cognitive
validity (Field 2013: 110), is one possible solution when simply no other
materials are available. Talking points provide speakers with some sort of
framework within which they can talk about topics which are appropriate
for lower ability levels whileat the same time allowing for at least some
degree of spontaneity. The framework should be based on an imaginary
listening context in order to encourage appropriate linguistic features and
not on a written text.
The challenge in developing talking points is to provide just enough
key words for the speakers to produce naturally spoken language while
simultaneously avoiding either a scripted dialogue or a framework which
is too cryptic. Speakers who are asked to work on talking points may
need some initial practice; to help them, it is recommended that the
talking points appear in a table form so that it is clear who says what
when (see Figure 2.5). Once recorded these can then be textmapped (see
Chapter3), and a task developed.
John, shop?
OK. Need?
Bread, eggs
Eggs ?
Large, small?
Large. Money
2.5.1.2 Authenticity
What makes a sound file authentic? This is not an easy question to answer
(see Lewkowicz 1996). A speech given by a high-ranking diplomat which
exhibits many written characteristics is no less authentic than a conversa-
tion which reflects more oral features, such as pauses, hesitations, back-
tracking and redundancies. They are both parts of the oral to written
continuum from which test developers might select their sound file mate-
rials. What makes it more or less authentic is its appropriateness to the
given testing context. For example, using the speech mentioned above as
part of a test for diplomats would carry a lot of cognitive (and face) valid-
ity (even more soif the speech maker isphysically present) but this would
not be true if it were used in a test for air traffic controllers. So part of the
authenticity argument has to be the extent to which it relates to the target
test population as well as the purpose of the test.
Let us look at some more examples. Is a sound file exhibiting a range
of non-standard accents authentic? Answer yes, you would definitely
come across this scenario in a university or joint military exercise context.
Could it be used in testing? Answer yes, if that is what test takers would
be faced with in the real-life listening context. What about the relation-
ship between authenticity and the speed of delivery? Would a sound file
with two people talking at 180 words per minute be considered authen-
tic? Answer yes, for higher-level listeners but arguably no, for lower
level ones, as we would not expect someone of that level to be able to
cope with it. All of these examples argue for not divorcing authenticity in
a sound file from the context in which it will be used.
The key factor which test developers need to ask themselves is whether
the language and its related characteristics (accent, speed of delivery,
degree of oral features and so on) reflect a real-life speaking and listening
event. Many of the recordings to be found on EFL websites do not meet
these criteria; this is because the materials have often been developed
with the purpose of language learning and as such the speed of delivery
has often been slowed down or the language simplified artificially. If your
aim in developing a listening test is to obtain an accurate picture of your
test takers ability to understand real-life input, then it is strongly recom-
38 Designing Listening Tests
mended that these sources be avoided (see Fehrvryn and Piorn 2005,
Appendix 1 2.1.2).
When selecting sound files remember that it is not necessary that every
word be familiar to the target test population; provided that the unknown
words are not seminal to understanding the majority of the sound file
(and this should be picked up during the textmapping procedure if this
is the case see Chapter 3), this should not be a problem. On the other
hand, where there are a significant number of new or unfamiliar words,
the listener is likely to be overwhelmed very quickly and processing is
likely to break down.
Although test takers (and some teachers) may initially react in a nega-
tive way to the use of authentic sound files in listening tests, by using
them we are not only likely to get a more reliable test result but also add
validity to the testscores. As Field(2008: 281) states:
A switch from scripted to unscripted has to take place at some point, and may,
in fact, prove to be more of a shock when a teacher postpones exposure to authen-
tic speech until later on. It may then prove more not less difficult for learners to
adjust, since they will have constructed well-practised listening routines for
dealing with scripted and/or graded materials, which may have become
entrenched.
2.5.1.3 Quality
In real life listening, we sometimes have to struggle with input that is not
at all clear; announcements, especially those on planes, are often indis-
tinct or distorted. We have to ask ourselves, though, whether it would
be fair to assess our test takers listening ability under such conditions?
While this may be appropriate in some professions those working in
the aviation field, for example, do have to be able to understand unclear
speech for the majority of the test takers this is not the case, and there
should be a clearly justifiable reason for including sound files that fall
into this category in a test.
Background noise, on the other hand, is ubiquitous and to avoid
including at least some sound files with background noise in a test would
not be reflecting reality. What the test developer has to determine is
2 How can test specifications help? 39
Obviously, the sound file must be in line with the targeted level of the
test. Due to the difficulties involved in finding appropriate sound files,
some test developers resort to using a sound file which is easier and make
up for this by producing items which are more difficult. Thus when the
sound file and items are combined they represent the targeted level. This
procedure means, however, that it is the items that have become the focus
of the test rather than the sound file itself. In reality, it should be the
sound file that is the real test the task is merely a vehicle which allows
the test developer to determine whether the test takers have compre-
hended it. Field (2013: 141, 144) cautions test developers against using
this procedure:
The fact is that difficulty is being manipulated by means of the written input
that the test taker has to master rather than by means of the demands of the
auditory input which is the object of the exercise. item writers always face
a temptation, particularly at the higher levels, to load difficulty onto the item
rather onto the recording.
Similarly, if the sound file is, for example, B2 but the items are B1,
the construct is unlikely to be tested in a reliable way, as the items
are not targeting the listening behaviour at the appropriate level. Of
course, it must be acknowledged that it is very difficult to ensure that
all items in a B2 task are targeting B2; in fact, it is more than likely
that in a task consisting of eight items, at least one is likely to be either
a B1 or a C1 item. This is where procedures such as standard setting
and establishing cut scores are very useful (see 7.2) as these items can
then be identified.
40 Designing Listening Tests
2.5.1.5 Topics
starts with the speaker providing a clear overview of the areas s/he is going
to touch on, and which then proceeds to use clear discourse markers, is
felt to be easier than one where the speaker meanders through the talk
with apparently little direction and includes multiple asides. However,
Rvsz and Brunfaut (2013) report that the few research studies which
have explored the effect of cohesion on listening difficulty have produced
mixed findings.
There are a number of reasons for including more than one sound file
in a test. First of all, including several sound files means you can expose
test takers to different discourse structures, topics and speakers. Secondly,
each new sound file provides the test taker with a fresh opportunity to
exhibit his/her listening ability; thus, if for some reason a test taker reacts
poorly to one particular sound file, there will be another opportunity to
exhibit his/her listening ability. Thirdly, using more sound files in a test
makes it possible to use different sound files for different types of listen-
ing behaviour (see Chapter 3). Fourthly, the inclusion of a number of
sound files is likely to reduce the temptation to overexploit a single sound
file by basing all the listening items on one piece of input.
42 Designing Listening Tests
Test developers need to decide whether the test will use only sound files
or video clips as well, and whether these should be of the talking head
variety and/or content-based. These issues were discussed in 1.5.3.4. The
decision as stated there is often a practical one; to make it fair to all,
the test takers need to have equal access to the input, ideally provided
through individual screens at the desk where they are taking the test.
This, in many testing situations, is simply not a practical option.
A convincing case can be made for both approaches, depending upon factors
such as test purpose, cognitive demand, task consistency, sampling and practi-
cality, all of which reflect the need to balance competing considerations in test
design, construction and delivery.
Lets look in more detail at some of the issues involved. First of all, we
need to ask ourselves to what extent will listening once or twice impact
on the type of listening behaviour employed by the listener, and, by
extension, what effect will that have on the cognitive validity of the test?
Fortune (2004) suggests that listeners tend to listen more attentively if
they know they are only going to hear the input once. Reporting on
research carried out by Buck (1991) and Field (2009), Field (2013: 127)
suggests that test takers carry out different types of processing (lower-
and higher-level) when given the opportunity to listen twice. On the
first listening, they are establishing the approximate whereabouts of the
relevant evidence in the sound file and possibly making initial links with
44 Designing Listening Tests
one or more of the items. On the second listening, the actual position
of the information is confirmed and the initial answer(s) reviewed and
either confirmed or changed. Field also adds that given the cognitive
demands on the test taker (processing of input and confirming/eliminat-
ing distracters) plus the lack of visual and paralinguistic clues, that this
argues for being able to listen twice, as this goes way beyond the cognitive
demands of the real-life listening context.
On the other hand, where test takers simply need to identify specific
information or an important detail in a sound file, it seems reasonable to
argue that this should be achievable on the basis of listening once only.
The amount of content that needs to be processed in order to complete
an item is much less, and from a processing point of view should be less
demanding, than trying to infer propositional meaning. Where test takers
are allowed to listen twice, it becomes very difficult for the test developer to
create such selective listening items at higher levels of ability as the test tak-
ers know they will hear it all again if they miss the required information on
the first listening (see the discussion on Task 5.6, Chapter 5). This, in turn,
can result in the test developer making the items more difficult than they
should beby targeting more obscure (and possibly less important) details.
A second issue which should be considered is that playing every sound
file twice in a listening test takes up a lot of time, and consequently means
that there will be less time for other sound files. This could impact on the
construct coverage, as there may be insufficient time to play a range of
sound files targeting different types of listening behaviour and reflecting
different input types, topics and discourse styles.
Thirdly, and the oft-quoted argument, is that in real life we rarely listen
to the same sound file twice unless it is something we have downloaded
from the internet and/or been given for study purposes. Even in situ-
ations where we are able to ask for clarification from the speaker, s/he
generally reformulates what has been said in order to make the message
clearer. There are also many occasions where even if we do not hear the
input again, we can manage to complete any gaps by using our ability to
infer meaning.
Having said all of the above, there are, of course, counterarguments.
In real life listening, we are not usually asked to simultaneously com-
plete what can be a detailed and demanding task, potentially including a
2 How can test specifications help? 45
2.5.2 Task
The test specifications should define how the instructions are written; for
example, clear, simple and short instructions. They should also indicate
the language in which the instructions should be presented, that is L1
or L2, and whether an example should be included (see 4.2 for argu-
ments regarding this issue as well as the importance of using standardised
instructions).
The test methods which are felt to be suitable for testing listening need to
be agreed upon and added to the test specifications. Due to the fact that
there is no written text for test takers to refer to, the role of memory must
be carefully considered:
2 How can test specifications help? 47
The total number of listening items needed depends on the type of test
that is being developed. For example, if the test is a uni-level test, that
is, with just one difficulty level being targeted, the number is likely to be
fewer than if it is a bi-level (two levels, say B1-B2) or a multi-level test
such as might appear in a proficiency test which has been developed to
handle a heterogeneous test population.
The targeted level of difficulty will also impact on the number of items;
the higher the level of proficiency, the more complex the construct is
likely to be, and thus the need for more items reflecting the different types
of listening behaviour that it will attempt to measure. The purpose of the
test (achievement versus proficiency) and the degree of stakes involved
48 Designing Listening Tests
(classroom test versus university entrance test) should also be taken into
account. Based on a wide range of test development projects, experience
has shown that at the higher end of the learners ability spectrum, 25 to
30 well-constructed test items should provide a reasonable idea of a test
takers listening ability. At the lower end, where the test construct is less
diverse, 10 to 15 items may be sufficient.
On the issue of how many items there should be in a task, many test
development teams feel that there should be a minimum of five items in
order to make efficient use of the time available in the listening test. This
would mean that in order to assess listeners ability to identify the gist,
a number of snippets would need to beincluded in one task in order to
have asufficient number ofitems (see Into Europe Assessing Listening
Task 44 for an example of this kind of task).
The number of tasks, like the number of items, will depend on whether
you are aiming to develop a uni-level, a bi-level or a multi-level test. It
will also be linked to the level of difficulty the higher levels of abil-
ity will require more tasks due to the complexity of the construct being
targeted. For example, if you wish to develop 25 to 30 items, four tasks
with approximately seven to eight items in each would be optimal (see
also 4.3).
The final part of the test specifications focuses on the criteria of assess-
ment that raters employ when marking the test takers responses. In
listening this is generally much less complex than it is for speaking or
writing as no rating scale per se is needed. The key should, however, be
as complete as possible. Field trials (see Chapter 6) help enormously in
terms of providing alternative answers to the key for short answer items;
trials can also be useful in putting together a list of the most common
unacceptable answers. This should speed up the time needed to rate the
answers and should also increase marker reliability.
2 How can test specifications help? 49
2.7 Summary
Many of the issues raised in this chapter will be revisited in Chapter 3,
which looks at a procedure that can be used to exploit sound files, and
Chapter 4, which takes the results of those procedures and explores how
they can be transformed into tasks.
To complete this chapter on the issue of how test specifications can
help, Figure 2.6 provides a summary of the type of information you
should have answers to before beginning any work on task development:
Overall purpose To assess the test takers ability at level X (in accordance
o Background noise
Test Method Those which will be used e.g. multiple choice, multiple
Clarity
Inclusion of example
DLT Bibliography
Alderson, J.C. (2000). Assessing reading. Cambridge, UK: Cambridge University
Press.
Alderson, J.C., Clapham, C., & Wall, D. (1995). Language test construction and
evaluation. Cambridge: CUP.
Bachman, L.F. (1990). Fundamental considerations in language testing. Oxford:
Oxford University Press.
Buck, G. (1991). The testing of second language listening comprehension.
Unpublished PhD thesis, University of Lancaster, Lancaster, UK.
Buck, G. (2001). Assessing listening. Cambridge Language Assessment Series.
Eds. J.C. Alderson and L.F. Bachman. Cambridge: CUP.
Davidson, F., & Lynch, B.K. (2002). Testcraft: A teachers guide to writing and
using language test specifications. New Haven: Yale University Press.
Ebel, R.L. (1979). Essentials of educational measurement (3rd ed.). Englewood
Cliffs, NJ: Prentice-Hall.
Ebel, R.L., & Frisbie, D.A. (1991). Essentials of educational measurement (5th
ed.). Englewood Cliffs, NJ: Prentice-Hall.
Fehrvryn, H. K., & K. Piorn. Alderson, J. C. (Series Ed.). (2005). Into
Europe. Prepare for modern English exams. The listening handbook. Budapest:
Teleki Lszl Foundation. See also http://www.lancaster.ac.uk/fass/projects/
examreform/Media/GL_Listening.pdf
Field, J. (2008). Listening in the language classroom. Cambridge: Cambridge
University Press.
Field, J. (2013). Cognitive validity. In A. Geranpayeh & L. Taylor (Eds.),
Examining listening. Research and practice in assessing second language listening
(pp.77-151). Cambridge: CUP.
Fortune, A. (2004). Testing listening comprehension in a foreign language Does
the number of times a text is heard affect performance? MA Thesis, Lancaster
University.
Fulcher, G., & Davidson, F. (2007). Language testing and assessment. NewYork:
Routledge.
54 Designing Listening Tests
Geranpayeh, A., & Taylor, L. (Eds.) (2013). Examining listening. Research and
practice in assessing second language listening. Cambridge: CUP.
Griffiths, R. (1992). Speech rate and listening comprehension: Further evidence
of the relationship. TESOL Quarterly, 26, 283-391.
Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test
items. Oxon: Routledge.
Harding, L. (2011). Accent and Listening Assessment. Peter Lang.
Harding, L. (2012). Accent, listening assessment and the potential for a shared-
L1 advantage: A DIF perspective. Language Testing, 29, 163.
Henning, G. (1987). A guide to language testing: Development, evaluation,
research. Cambridge, MA: Newbury House.
Lewkowicz, J.A. (1996). Authentic for whom? Does authenticity really matter?
In A. Huhta, V. Kohonen, L. Kurki-Suonio & S. Luoma (Eds.), Current
developments and alternatives in language assessment. Proceedings of LTRC,
pp.165-184.
Rvsz, A., & Brunfaut, T. (2013). Text characteristics of task input and diffi-
culty in second language listening comprehension. Studies in Second Language
Acquisition, 35 (1), 31-65.
Tauroza, S., & Allison, D. (1990). Speech rates in British English. Applied
Linguistics, 11, 90-195.
White, G. (1998). Listening. Oxford: Oxford University Press.
3
How do weexploit sound files?
decisions about the sound files they want to use based on their own indi-
vidual teaching needs and interests, or the perceived needs of their students.
My second question focuses on whether as test developers they have
ever faced any problems with the procedure(s) they have followed. Their
answers are usually in thepositive and are associated with their students not
being able to answer the questions for one reason or another; or producing
totally different responses from those which had been expected. My third
question is aimed at finding out whether their colleagues would target the
same part(s) of the sound file if they wanted to use the same sound file
to develop a task. The responses onthisoccasionare often rather vague
and unsure, possibly because, for practical reasons, many test developers
and teachers tend to create their own tasks and rarely work in teams. My
fourth question then asks them to consider whetherlisteners in general
would target or rather take away the same information. Responses sug-
gest that the test developers are not sure that everyone would take away
the same information and/or details when listening to a sound file.
In light of the last response, my final question to the test developers
focuses on whether different listeners taking away something different
from a sound file is a problem. The test developers usually confirm that
if this happened in a teaching situation it would be seen as productive,
as it could lead to discussion among the students. They add, however,
that in a testing scenario it could be problematic in terms of determining
which interpretations should be considered right and which should be
considered wrong.
Research in the 1980s into how the meaning of a written text was
constructed by the reader suggested a continuum ranging from a passive
acceptance of the ideas in the text to an assertive rewriting of the authors
message (Sarig 1989). This differing approach to texts, and by extension
to sound files, has obvious implications for test development in terms
of deciding which interpretations made by a reader or a listener can be
accepted as being correct and which incorrect.
While the argument put forward by Sarig (1989: 81) that More leeway
should be left for interpretations which never occurred to test developers
seems a reasonable one, it should perhaps take into account Alderson
and Shorts (1981) belief that although individual readers may interpret
a text in slightly different ways, a consensus among readers would help
3 How do weexploit sound files? 57
to define the limits on what a given text actually means. This position
is also supported by Urquhart and Weir (1998: 117) who argue that,
When constructing test tasks, testers need to operate with a consensus as to
what information readers may be expected to extract from a text Nuttall
(1996: 226) suggests that, a useful technique for deciding what mean-
ing to test in a text is to ask competent readers to say what the text means.
Experience has indeed shown that involving students in such a process is
highly informative for the teacherand/ortest developer as well as enjoy-
able for the students.
3.2 A
procedure forexploiting sound files:
Textmapping
So what is textmapping? Textmapping is a systematic procedure which
involves the co-construction of the meaning of a sound file (or text). It is
based on a consensus of opinion as opposed to an individual interpreta-
tion of a sound file (or text). It uses the sound file and not the transcript
as the basis for deciding what to focus on as the latter encourages less
attention being paid to what a listener, as opposed to a reader, might
actually take away. In addition, as there are no time indicators in a tran-
script, the reader has no real idea of the speech rate of the speaker(s)
or the amount of redundancy present, and is completely unaware of the
extent to which words may have been swallowed or not stressed by the
speaker(s). As Lynch (2010: 23) states:
a transcript and the speech it represents are not the same thing, the original is a
richer, contextualized communicative event.
Further support for this approach comes from Field (2013: 150):
It is also important that the physical recording rather than the script alone
should form the basis for the items that are set, enabling due account to be
taken of the relative saliency of idea units within the text and of aspects of
the speakers style of delivery that may cause difficulty in accessing
information.
58 Designing Listening Tests
It is crucial that test writers map a text whilst listening to it in advance of writ-
ing the questions in order to ensure they do not miss out on testing any of the
explicit or implicit main ideas or important details, where this is the purpose of
the listening exercise.
an item. Nor is it their job to decide whether something in the sound file is
so obvious that it can never be tested, and thus choose not to write it down.
Such decisions come later. The textmappers job is simply to document what
they take away from a sound file while employing the type of listening behav-
iour they have been asked to use by the person who provided the sound file.
So how does it all work? Sections 3.3, 3.4, and3.5 describe the procedures
that should be followed when textmapping for gist, specific information
and important details, and main ideas and supporting details respectively.
Identifying the gist of a sound file basically requires the listener to synthe-
sise the main ideas or arguments being put forward in order to come up
with the overall idea the speaker is attempting to get across. For example,
the listener might be asked to identify the gist of a report on a recent
natural disaster, or that of a short speech made by the principal at the
beginning of the academic year, or someones overall opinion of a newly
introduced agricultural policy. Inviting a small group of textmappers to
do this helps to minimise any individual idiosyncrasies that might have
been taken away by a single test developer.
Before starting the textmapping process, however, it is first of all essential
to check everyones understanding of the term gist, as this is very often
60 Designing Listening Tests
confused with the terms topic and/or summary. The most practical way
to do this is to focus on the number of words that are likely to be involved.
For example, it could be argued that the topic is often summed up in just
two or three words; a summary, on the other hand, usually requires a num-
ber of sentences; while the overall idea often needs something in between in
terms of length. Asking textmappers to use between 14 and 20 words (10
words minimum) often helps to guide them towards identifying the gist,
rather than the summary or the topic. (The number of words will of course
depend to some extenton the length and density of the sound file used.)
Secondly, in order to encourage a focus on the gist of the sound file
rather than the details, the textmappers should be instructed that they
are not allowed to write anything down during the exercise. Thirdly, it is
important that they be made to understand the importance of remaining
quiet, not only while listening to the sound file, but immediately after-
wards when they write down the overall idea. This silence is crucial in the
textmapping procedure due to the amount of information the working
memory can retain at any one time. This content can easily be dislodged
by an inadvertent comment from one of the textmappers. Another reason
for remaining silent at this stage is to minimise any possible influence on
what an individual textmapper might write down.
Finally, just before beginning the textmapping session, it should be
made clear that there is no such thing as a right or wrong textmap;
it is more than possible that an individual textmapper could take away
something quite different from another due, for example, to their own
personal interpretation or reaction to the sound file. This does not make
it wrong, just different.
Once the textmappers are clear as to what they have to listen for, and
how they are going to do this, provide them with the context of the
sound file so that they can activate any relevant schema and not go into
the listening event cold. Then remind them of the key points (put them
on screen if possible) (See Figure 3.1).1
The sound file should then be played once only regardless of how
many times it will be played in a future task. This is due to the fact that
1
The sound file for this example is Track 6, CD2 (Task 30) Into Europe Listening. For textmap-
ping purposes, the sound file was started at the end of the instructions (at 30 seconds). The sound
file can be found at: http://www.lancaster.ac.uk/fass/projects/examreform/Pages/IE_Listening_
recordings.htm.
3 How do weexploit sound files? 61
The next stage involves comparing what each listener has taken away from
the sound file to see whether there is a consensus. In textmapping, high
consensus is defined as n1, so if there are six textmappers, five of them
(83 per cent) should have written down approximately the same thing.
Low but still substantial consensus (Sarig 1989) would constitute approxi-
mately 57-71 per cent agreement. Checking for consensus will obviously
involve some negotiation, as the textmappers will have used different
words in phrasing the gist due to the transient nature of the input as well
as influence from their own personal lexicons and background knowledge.
The person who originally identified the sound file should collate the
results by asking each textmapper in turn what they have written down
and recording this verbatim. (The collator should remain silent about
his/her own textmap results until the very end of this process so as not
to influence the proceedings.) Textmappers should not change what they
have written in light of someone elses textmap. When the list of text-
maps is complete it might look something like the following:
62 Designing Listening Tests
2. The Red Cross helps after a heavy earthquake caused major destruction to
Cross relief co-ordinated, the epicentre far from the capital Lima.
5. There was a major earthquake in South America lasting for about 2 minutes;
buildings were destroyed, people were killed and injured; help was organised
quickly.
The textmappers should take a general look at these results and decide
whether or not there seems to be a consensus of opinion. Remember,
high consensus in textmapping constitutes n1 so if there are six text-
mappers and only five have similar overall ideas this would still equate
to a consensus. Where textmappers feel that there is a consensus, they
should then be asked to look in more detail at the answers given in order
to identify communalities. For example, the highlighting in Figure 3.3
below shows a number of similarities across the textmaps.
The results reveal that where the textmappers have identified key words
(important details) as an essential part of the gist, for example, earthquake
or buildings, their answers are less varied as we would expect. However,
when it comes to describing what has happened (damage/destruction),
how strong the earthquake was (strong/massive/major/severe), or the aid
which was involved (Red Cross/help/emergency operations/rescue), there
3 How do weexploit sound files? 63
2. The Red Cross helps after a heavy earthquake caused major destruction to
Cross relief co-ordinated, the epicentre far from the capital Lima.
5. There was a major earthquake in South America lasting for about 2 minutes;
buildings were destroyed, people were killed and injured; help was organised
quickly.
is more variation. This is partly due to the fact that as there is no written
word to rely on, listeners will employ different words based on their
personal schema and internal lexicons. Figure 3.4 shows the list of com-
munalities which suggests that there is consensus on the overall idea.
o earthquake
o damage / destroyed
o buildings
Where consensus has been achieved, the final steps in the textmap-
ping procedure involve deciding on an appropriate test method and
the development of the task itself. These issues will be dealt with in
Chapter 4.
1. Identify a suitable gist sound file. Make your own textmap the second time you
2. Find at least three other people who have not been exposed to the sound file.
3. Explain that you want them to textmap the sound file for gist. Check their
o They should try to use between 14 and 20 words (depending on the length /
5. Provide a general context to the sound file. Be careful not to give too much
6. Play the sound file once only and then allow the textmappers time to write the gist.
7. Ask the textmappers to count the number of words they have written. This is
useful in determining whether the textmappers have identified the gist or the topic
8. The person who originally identified the sound file should then record what each
textmapper has written. If this can be projected onto a screen so all can see, this
helps; if not, gathering around the computer screen may also work.
9. The group should carry out an initial general review of the gists to see if there is
10. Where this is not the case, it would suggest that the sound file does not lend
itself either to gist or to one interpretation of the sound file. It may, however, be
possible to use it for something else (see 3.5 Textmapping for Main Ideas below).
optional words if these occur). This list should formthe basis of the targeted
answer.
13. The textmap results should be added to a textmap table (see Figure 3.5above).
14. A suitable test method should be identified and task development work should
the sound files have been textmapped, discuss the results in the same
way as in the Natural disaster example above. If there is too much
overlap in the gist textmaps regarding two of the snippets, one of
them may have to be dropped. This procedure should not be used for
a continuous piece of spoken discourse where there is no logical reason
for segmenting it.
3.4 T
extmapping forspecific information
andimportant details (SIID)
3.4.1 Defining thelistening behaviour
Specific Information
o Prices e.g. 5
Important details
2
The sound file for this example is taken from the VerA6 project, Germany and can be found on
the Palgrave Macmillan website.
70 Designing Listening Tests
the same reasons as mentioned above in the gist exercise, the textmap-
pers should be reminded of the importance of remaining quiet, not only
throughout the playing of the sound file but also immediately afterwards
when the textmappers write down the SIID they have taken away from
the sound file.
The sound file should be played once only regardless of how many
times it will be played in a future task. This is because overexposure to
the sound file is likely to result in more SIID being captured than any test
taker might fairly be asked to identify. Once everyone has completed his/
her list of SIID, it is useful to ask the textmappers to do two additional
things. Firstly, they should be asked to look through their lists and make
sure that the entries can be classified as specific information or important
details; ask them to refer to the information in Figure3.7 above or a simi-
lar list that you might have compiled. Anything not in the list needs to
be discussed (see 3.4.2) and if it is not SIID should be deleted. Secondly,
the textmappers should be asked how many entries they have managed
to write down. A smaller than expected number might be interpreted as
suggesting that the sound file does not really lend itself to SIID (or that
the textmapper has not textmapped for the right type of information).
A larger than expected number might mean that the list still contains
entries that are perhaps not what would be classified as SIID.For exam-
ple, there might be verbs or partial ideas in the list of entries that have
been written down.
As with gist, the next stage in the SIID textmap procedure involves com-
paring what each listener has written down to see whether a consensus
has been reached. This is likely to involve much less negotiation than gist,
as SIID tends to be more concrete. Textmappers sometimes have prob-
lems with remembering numbers accurately unless they can write them
down as they listen (see 3.4.5 for an alternative SIID procedure below)
and the test developer must use his/her discretion to decide whether to
accept very similar numerical combinations given that in a real life listen-
3 How do weexploit sound files? 71
SIID Consensus
1. Dad 11/14
2. John 12/14
3. Airport 13/14
4. 30 minutes 13/14
5. Taxi 12/14
Once the results have been collated, the textmappers must decide whether
there are sufficient items to make it feasible to turn them into a task. In
order to do this, the distribution of the SIID within the sound file needs
to be taken into account. The easiest way to do this is shown in Figure3.9
below:
Only those parts of the textmap being targeted in the example and
actual items should have information in the Target column. Thus
above, 0 (representing the example) is opposite John, and Q1 and
Q2 are opposite Airport and Taxi respectively. Six seconds is a rela-
tively short time between items but if the test method is a multiple
choice picture task where, for example, the test takers simply have to
recognise the correct venue and mode of transport, then it may prove
sufficient.
The final step in the use of the SIID textmap results is deciding on
an appropriate test method and the development of the task itself (see
Chapter4).
74 Designing Listening Tests
1. Identify a suitable SIID sound file and produce your own textmap.
2. Find at least three other people who have not been exposed to the sound file.
3. Explain that you want them to textmap the sound file for SIID and check their
5. Provide a general context about the sound file. Be careful not to give too much
6. Play the sound file once only and then allow the textmappers time to write a list of
SIID.
7. Ask the textmappers to count the number of SIID they have written. This is useful
in determining whether the sound file works for SIID and/or whether the
textmappers have mapped for the appropriate type of information. They should
8. The first textmapper should be asked to read out an entry s/he has written down
and the other textmappers asked if they have it. The total number should be
written next to the entry, for example, Dad 11/14, so that a consensus can be
verified or not. The second textmapper should then be asked for his/her next
9. The list of SIID and their degree of consensus should be discussed and a decision
made as to whether the sound file provides a sufficient number of SIID to warrant
making a task.
10. Where this is not the case, it would suggest that the sound file does not lend itself
to SIID. It may, however, be possible to textmap it for something else (see 3.6
below).
11. The textmap results should be added to a textmap table (see Figure 3.10 above)
and the time added in order to check for sufficient redundancy between potential
items.
12. A suitable test method should be identified and task development work should
As with gist, where there are a number of related short sound files, for
example, different messages on an answer machine, each sound file should
be textmapped separately, the SIID written down at the end of each one
and then the findings discussed sound file by sound file.
3.5 T
extmapping formain ideas
andsupporting details (MISD)
3.5.1 Defining thelistening behaviour
Ferguson was a very skilful player in his youth. He was a top goal
The person responsible for the sound file should collate the textmaps by ask-
ing the first person in the group for the first main idea/supporting detail they
3
The sound file for this example is Track 4, CD1 (Task 21) Into Europe Listening. For textmap-
ping purposes, the sound file was started at the end of the instructions (at 34 seconds). The sound
file can be found at: http://www.lancaster.ac.uk/fass/projects/examreform/Pages/IE_Listening_
recordings.htm.
78 Designing Listening Tests
have written down. Once recorded, the others should be asked if they have
the same point and then the number of people, for example, 5/6, should be
added. It should be noted that this procedure involves some negotiation due
to the paraphrasing the various textmappers will have used. Those options
which have the same meaning should be accepted. The next textmapper
should then be asked for his/her next main idea/supporting detail and the
above process repeated. This method should be followed for all the MISD
that the textmappers have written down. Once again, it is possible that the
order in which the MISD are discussed will differ slightly among the text-
mappers; this can be rectified once the ideas are moved to the textmap table.
While collating the results of the textmap, you may find a split in the
consensus (for example, 2:2) between those who have written down the
main idea and others who have identified the related supporting detail.
For example, in this particular sound file, some textmappers might have
written: She doesnt come from a rich family background (= main idea)
while others might have identified: She saved her money for flying lessons
(= supporting detail). Such a result would mean that there is no consen-
sus on either the main idea or the supporting detail. However, it seems
reasonable to argue that it was simply a personal choice as to which part
was written down and that where this happens the test developer could
combine the textmapping results and then decide which aspect to focus
on in the item.
Once all the MISD have been discussed, the textmappers again need to
review the total number of points on which consensus has been reached
in order to decide whether these are sufficient to make a task worthwhile
developing, taking into consideration the length of the sound file. If the
answer is in the positive, the next thing that needs to be checked is the
distribution of the textmapped points. Again, putting these into a table
helps. Unlike SIID, the time it takes for a main idea to be put into words
is likely to take more than one second. The complete amount of time
taken should appear in the table so as to provide as accurate a picture
as possible regarding the amount of time occurring between each of the
textmapped points:
3 How do weexploit sound files? 79
the moon.
commander.
the astronauts.
a part-time job.
G.
(At this age) the child doesnt realise the 02.53 03.08
risk involved.
You will note that the above table includes points on which the text-
mappers did not have a consensus; some test developers find this useful
information to record so that they can avoid tapping into it when they
80 Designing Listening Tests
are developing items located nearby in the sound file. It also acts as a
reminder as to why a certain part of the sound file has not been targeted.
Figure 3.13 reveals that, in some cases, as one idea finishes, another
begins. This brings to light the issue of how much time test takers
need between items in order to complete them. The answer to this
is dependent on a number of factors. Firstly, the test method; for
example, if the test taker is confronted with a multiple choice item,
s/he may need more time due to the amount of reading involved, as
opposed to an item which simply requires a one-word answer, for
example, taxi in the SIID example above. Secondly, the type of lis-
tening behaviour; in general, items focusing on main ideas are likely
to require more redundancy than those focusing on SIID as more
processing time will be needed, especially if the task requires the test
takers to infer propositional meaning. Thirdly, the difficulty level of
the sound file and task, the type of content (concrete versus abstract)
and the topic will also impact on the amount of time needed. With so
many variables involved, it is very difficult to recommend an appro-
priate amount of time needed between items, and this is one of the
many reasons why peer review (see 4.5.1) and field trialling are so
important (see 6.1).
1. Identify a suitable MISD sound fileand carry out your own textmap.
2. Find at least three other people who have not been exposed to the sound file.
3. Explain that you want them to textmap the sound file for MISD and check their
o Textmappers can write while listening and should try to use the words of the
6. Play the sound file once only and then allow the textmappers time to finish writing.
7. Ask them to read through what they have written, to finalise any notes and to
confirm that what they have written down is MISD and not SIID or the gist.
8. Ask the textmappers to count the number of MISD they have written down. This is
useful in determining whether the sound file has sufficient ideas on which to
develop a task.
9. The first textmapper should be asked to read out the first MISD s/he has written
and the other textmappers asked if they have it. The total number should be
written next to the MISD, for example, 5/6 in order to confirm whether there is a
consensus or not. The second textmapper should then be asked for the next
point, and the results recorded in the same way. This procedure should be
10. The list of points and thedegree of consensus should then be discussed and a
decision made as to whether the sound file provides sufficient points to warrant
developing a task.
11. Where this is not the case, it would suggest that the sound file does not lend itself
to MISD.
12. The textmap results should be transferred to a textmap table and the time added
13. A suitable test method should be identified and the task development work
3.6 Re-textmapping
Sometimes the initial textmap does not work for one reason or another
because of disparate or insufficient entries, for example. If the sound file
was textmapped for SIID, based on memory only, it is possible to textmap
it again to see if it would work for careful listening, that is, MISD.This
is particularly useful given the amount of time it takes to find a suitable
sound file. The important issue to remember here is the order in which
the textmapping procedures take place; that is, it should move from selec-
tive to careful. Once a sound file has been textmapped for MISD, it
cannot be re-textmapped for gist as the file is too well known and the
textmapped gists would reflect this. Thus:
1. The difficulty level of the sound file in terms of its density, speed of
delivery, lexis, structures, content (abstract versus concrete), back-
ground noise and so on.
3 How do weexploit sound files? 83
2. The topic of the sound file in terms of its appropriateness for the tar-
get test population (the level of interest, its accessibility, gender/age/
L1 bias).
3. The length of the sound file in terms of its appropriateness to the test
specifications and to the construct being targeted.
If the sound file is inappropriate for whatever reason, the test developer
who found the sound file must be told. Not to do so will waste everyones
time and energy as the sound file will be deemed appropriate for task
development and more people than just the test developer will spend
time on it as the task moves from draft to peer review to trial.
3.8 Summary
Textmapping is not a foolproof system; involving human judgements as
it does, it cannot be. Having said that, it does provide a more systematic
approach to deciding how best to exploit a sound file and, if the pro-
cedure is followed carefully, goes some way to minimising some of the
idiosyncrasies that test developers may unwittingly introduce into the
assessment context. It certainly makes those involved much more aware
of what they are testing in terms of the construct and why. It also argues
for a fairer test, taking into account as it does the necessary redundancy
required when asking test takers to complete a task at the same time as
listening to a sound file. Using the sound file to carry out textmapping as
opposed to a transcript also acknowledges the true nature of the spoken
word. As Helgesen (quoted in Wilson (2008: 24)) so succinctly puts it:
DLT Bibliography
Alderson, J. C., & Short, M. (1981). Reading literature. Paper read at the
B.A.D.Conference, University of Lancaster, September.
Field, J. (2013). Cognitive validity. In A. Geranpayeh & L. Taylor (Eds.),
Examining listening. Research and practice in assessing second language listening
(pp.77-151). Cambridge: CUP.
Lynch, T. (2010). Teaching second language listening: A guide to evaluating, adapt-
ing, and creating tasks for listening in the language classroom. Oxford, UK:
Oxford University Press.
Nuttall, C. (1996). Teaching reading skills in a Foreign language. London:
Heinemann.
Sarig, G. (1989). Testing meaning construction: Can we do it fairly? Language
Testing, 6 (1), 77-94.
Weir, C.J. (2005). Language testing and validation: An evidence-based approach.
NewYork: Palgrave Macmillan.
Wilson, J.J. (2008). How to teach listening. Harlow: Pearson.
Urquhart, A., and Weir, C. J. (1998). Reading in a second language. Harlow:
Longman.
4
How do wedevelop alistening task?
This chapter focuses on the next set of stages that a task needs to go
through once a sound file has been successfully textmapped. These
include:
results (see Chapter3). It is also useful for the task reviewer(s) later on in
the test development cycle (see 4.5). Based on the sound file Earthquake
in Peru, which was discussed in 3.3, the TI would appear as shown in
Figure4.1 below.
B1.4 Gist
URL: http://www.lancs.ac.uk/fass/projects/examreform/
Source
Date when downloaded:
Length of sound file: 2.43 seconds Words per minute: approximately 180
* Name as appropriate (for example, STANAG, ICAO, National Standards inter alia)
Test developer: to save time, use the test developers initials, for exam-
ple, HF
CEFR Focus: select the appropriate descriptor(s) from the test specifi-
cations that describes the listening behaviour(s) your task is attempt-
ing to measure. For example, in Figure4.1 the CEFR descriptor B1.4
is indicated. This is the fourth CEFR descriptor in this particular ver-
sion of B1 test specifications (hence B1 point 4) and the one that
relates to the testing of gist. If there is more than one relevant descrip-
tor, list them in terms of priority. This part of the TI is very important
as it concerns the construct.
General Focus: complete this with the listening behaviour(s) your
task is attempting to measure (see Figure2.4), for example here you
4 How do wedevelop alistening task? 87
can see Gist. This part is also very important. (See 2.4 for a discussion
as to why both the CEFR Focus and the General Focus are included in
the TI.)
Levels: should include information about the perceived levels of both
the sound file and the items. If you feel that the sound file and/or the
items might cover more than one level, include both for example,
B1/B1+. It is expected that these levels will be the same or very close.
Remember where there is a marked difference, for example, the use of
a more difficult sound file, even easy items will not help (see 2.5.1.4).
Test method: state which one you hope to use in the task. Again for
quick and easy completion, use sets of initials, for example, SAQ (short
answer questions), MCQ (multiple choice questions), MM (multiple
matching) and so on.
Topic: select an appropriate topic from the list which appears in the
test specifications (see 2.5.1.5).
Title of the sound file/task: this should be the same for both the
sound file and the task to make matching the two easier, especially
during the peer review stages.
Source: the copyright of sound files, video clips (if used) and/or any
pictures has to be obtained (unless you are using copyright free
sources). This box should provide full details of the sound file source/
video clip, the date it was downloaded (in case it is withdrawn and you
need to cite it when asking for copyright permission) and similar
information about any pictures that may be included in the task. These
links also help the reviewer to check the source if questions arise
regarding the suitability of the materials (language issues, picture qual-
ity and so on).
Length of the sound file: this should be completed and be within the
parameters cited in the test specifications.
Speed of delivery: make sure this is in line with the parameters pro-
vided in the test specifications (see 2.5.1.12).
Date: the date this version of the task was completed. This should be
updated each time the task is revised.
Version: this number should be updated each timethe taskis revised.
This way the test developer and the task reviewer can keep note of any
changes which have been made.
88 Designing Listening Tests
Listen to two girls talking about their holiday in Mexico. Choose the correct
answer (A, B, C or D) for questions 1-7. The first one (0) has been done as an
example.
The instructions that are heard at the beginning of the sound file
should be the same as those that appear in the task in the test booklet.
This helps the test taker to engage in a non-threatening act of listening
before being faced with having to understand what is being said and
needing to respond to questions based on the sound file. The instructions
should also include information about how long the test takers have to
read the questions prior to the beginning of the actual recording and how
long they will have to review and complete their answers once the record-
ing has finished. For example,
You are going to listen to a programme about lead mining in north Yorkshire.
First, you will have 45 seconds to study the task below, and then you will hear
the recording twice. While listening, choose the correct answer (A, B, C or D)
for questions 1-8. Put a cross () in the correct box. The first one (0) has been
done for you. At the end of the task you will have 15 seconds to complete your
answers.
The amount of time which should be provided for reading and com-
pleting the items depends on a number of factors, such as the type of
test method, the test takers level of familiarity with it and the num-
ber of items in the task. Multiple choice questions, for example, usually
take longer to read than sentence completion items. The amount of time
required at the end of the sound file depends to some extent on whether
the test takers hear the sound file twice. If it is only played once, they will
definitely need some time to review and complete their answers. When
in doubt, provide more time rather than less, this can be confirmed after
the trial (see 6.1.2).
Certain research (Field 2013; Buck 1991) suggests that test tak-
ers perform better when they are allowed to preview certain types of
items as they gain insights into what to listen out for in the sound file.
Wagner (2013), on the other hand, feels that further research on this
area is needed to confirm that item preview does help. (The possible
conflict that item preview may have with cognitive validity was dis-
cussed in 1.5.1.1.)
90 Designing Listening Tests
There are a number of things to bear in mind when selecting which test
method should be used in a listening task. First of all, and most importantly,
the test method should lend itself to the construct which is being targeted in
the task (see Haladyna and Rodriguez 2013: 43). Field (2013: 141) advises
caution in those situations wherethe test format is driving the thinking of test
designers and item writers rather than the nature of the construct to be tested.
In other words, the construct should come first, the test method second.
Secondly, the test developer must always be aware of the amount of
reading the test method requires the test taker to undertake in order to
answer the questions. To this end, the stems and options should be as
short as possible though not so short that they become cryptic. Thirdly,
the wording must be carefully crafted so that the test taker does not waste
precious seconds trying to understand what it means while simultane-
ously listening to the sound file and trying to identify the answer.
Choosing the most appropriate test method to measure the targeted
construct is not always obvious and experience shows that some tasks
need to go through two test methods before the task works. The reason
for this could be related to the nature of the sound file (lack of sufficient
detail for MCQ items), to the construct (difficult to develop items which
sufficiently target it) or to the test developers own ability to work with
a particular method especially early on in their training. To some extent,
choosing the best test method is a matter of experience which becomes
easier with practice.
Developing items at higher levels, for example at CEFR C1 and above,
can lead test developers into using linguistically and propositionally com-
plex wording in their items in an attempt to match the perceived dif-
ficulty level. This has obvious consequences for the processing demands
faced by the listener. Field (2013: 150) reminds us with construct and
cognitive validity at stake, it is vitally important to limit the extent to which
difficulty is loaded onto items particularly given that those items are in a
different modality from the target construct.
4 How do wedevelop alistening task? 91
Each test method has its strengths and weaknesses; these are discussed
in turn below.
One test method that appears to work well in listening tasks is multi-
ple matching (MM). There are a number of different formats, includ-
ing: matching answers with questions, for example, in an interview (see
Chapter5, Task 5.1); matching sentence beginnings with sentence endings
(see Chapter5, Task 5.3); matching topics with a series of short sound files
(see Into Europe Assessing Listening: Task 44); or matching what is being
said to a range of pictures (see Into Europe Assessing Listening: Task 43).
MM tasks can be used to target different types of listening behaviour
(Field 2013: 132, 137). For example, if you want to target the test takers
ability to infer propositional meaning, you could develop a task which
requires them to match the speakers mood or opinion about a particular
subject to one of the options. If you want to assess main ideas compre-
hension, you can paraphrase the textmap results (see 3.5) and then split
them into two parts (sentence beginnings and endings). Testing impor-
tant details can also be targeted through matching (see Into Europe
Assessing Listening: Task 41).
MM tasks are compact with little redundancy and require much less
reading than MCQ items (Haladyna and Rodriguez 2013; Field 2013).
Another advantage of MM tasks is that they involve no writing and there-
fore reduce the chance of any construct irrelevant variance that writing
may bring to the task. Post-trial feedback in a number of countries has
shown that test takers appear to enjoy this particular method. This is con-
firmed by Haladyna and Rodriguez (2013: 74) who state that the format
is very popular and widely accepted.
Care must, however, be taken to ensure that where sentence begin-
nings and endings are used, that the task cannot be completed simply
through the use of grammatical, syntactical or semantic knowledge
without listening to the sound file. This is an argument that is often
raised against using this type of MM task. (See Task 5.3 for an example of
this.) However, this can be minimised by careful wording of the sentence
92 Designing Listening Tests
SAQ items require the test taker to produce an answer, rather than to
select one from a range of options. They are often referred to as con-
structed response items. When using this method, the test developer needs
to define what short means in their particular test situation. If you
have a look at SAQ tasks in general, you will probably find that they
require a maximum of five words. This means the item can be answered
in between one tofive words depending on what is being targeted. This
limit is imposed in an attempt to minimise any construct irrelevant vari-
ance, deriving from the test takers ability to write, from affecting his/her
performance on the listening task (see Weir 2005: 137). When targeting
SIID, the answer can often be written using one or two words, but with
MISD and gist, it is more likely that a minimum of three words will be
needed for the test taker to show that they have understood the idea.
There are two main types of SAQs: those that consist of closed ques-
tions, for example, When was John Smith born? and those that require
completion (often referred to as sentence completion tasks), for example,
John Smith was born in ____. It is strongly recommended that the com-
pletion part be placed at the end of the sentence rather than in the mid-
dle (see sample Task 5.6in Chapter5). This is because there is a strong
possibility that test takers will engage in guessing strategies (Field 2013:
131), in other words attempt to apply their syntactical, grammatical and
semantic knowledge to complete a gap when it appears in the middle of
an item, rather than one that appears at the end. Table completion tasks
are a further option (see Into Europe Assessing Listening: Task 25).
4 How do wedevelop alistening task? 93
MCQ tasks can also be used in listening, and like MM tasks, are useful
in targeting different processing levels (Field 2013: 128). In terms of dif-
ficulty, Innami and Koizumi (2009) found that MCQ items are easier
than SAQ items in L2 listening. Careful thought, however, must be given
to MCQ item construction due to the amount of reading that may be
involved and the impact this can have on the test taker who is trying to
process the input and confirm or eliminate distracters at the same time.
In light of this, it is recommended that MCQ options should be as short
as possible preferably only half a line at most (seeChapter 5, Task 5.8).
A decision also needs to be taken as to whether the item should have
three or four options. Recent research (Harding 2011; Shizuka etal. 2006;
Lee and Winke 2013) suggests that given the demands upon the listener,
and the minimal differences in discrimination, that a three-option item
(ABC) is optimal in MCQ tasks. Haladyna and Rodriguez (2013: 66)
add that for average and stronger test takers, the three-option MCQ is
more efficient but for the weaker test takers four or five should be used
on the grounds that they are more likely to employ guessing strategies.
From a practical point of view, three-option MCQ items also take less
time to construct and can save time during the test administration (Lee
and Winke 2013) thus possibly allowing for other items to be added,
depending on the overall amount of time allocated to the listening test,
and thereby providing more evidence of the test takers listening ability
(see Haladyna and Rodriguez 2013: 66).
Whether you choose to use three or four options, they all need to
be as attractive and plausible as possible to limit successful test-taking
strategies. This is particularly true, however, where only three options are
used, as being able to easily dismiss one of these options will provide the
test taker with a 50:50 chance of answering the item correctly through
guessing. Options that are ridiculous in content, making them easy to
4 How do wedevelop alistening task? 95
eliminate, and those, which are in any way tricky, must be avoided
(Haladyna and Rodriguez 2013: 62).
Where MCQs are used to measure MISD, the input of the sound file
needs to be detailed enough to produce a sufficient number of viable
options. Sound files of a discursive nature, such as those where two or
three people are putting forward different arguments, where someone is
being interviewed or where one person is explaining different opinions
held by a number of other people, lend themselves to MCQ items.
As with MM tasks, pictures are particularly useful at the lower end of
the ability range; for example, the Cambridge ESOL suite uses MCQ tasks
with pictures at KET and PET. Using a set of four pictures, test takers
could be asked to match the correct picture to the content of the sound
file (see Into Europe Assessing Listening: Task 13 for an example of this
type); or there could be multiple sets of related pictures, based on what the
speaker is talking about or describing, and the test taker must choose the
correct answer to each question in turn. Field (2013: 134-5) points out
that this approach might be particularly useful where test takers are from
L1 contexts which do not use the Western European alphabet.
The optimal number of items in a task (or test) should have been made
at the test specifications stage (see 2.5.2.3). During task development,
it is important that the test developer complies with the minimum and
maximum number of items per task unless there are good reasons for
reviewing this decision before the task goes into the trial. For example,
where the textmap results allow for one or two extra items in a task, it
may be useful to include these at the trial stage; any items with weak sta-
tistics can then be dropped after checking for any newly created gaps in
the sound file content.
Given the number of demands placed upon a test taker during a listening
test (listening, reading and sometimes writing), it is crucial that the task
layout be as clear and as listener-friendly as possible. Where a task needs
4 How do wedevelop alistening task? 97
two pages, these must be placed opposite each other in the test booklet
to avoid page turning. In addition, there should be ample space for the
test taker to write his/her answer in a SAQ task and the MCQ options
should be spread out sufficiently well for the test taker to be able to see
them clearly. In MM tasks where test takers are required to match sen-
tence beginnings and endings, it is strongly recommended that the two
tables are in the same position on opposing pages so that the test taker
simply needs to read across from one to the other (see Chapter5, Task 5.3
A Diplomat Speaks for an example of this.)
should be awarded. Where doubt exists and the test is a high stakes one,
another colleague should be asked for their opinion. If no one is avail-
able, look through the rest of the test takers answers to see if this can help
you to determine whether the test taker should be given the benefit of the
doubt or not.
It is strongly recommended that half-marks are not used; experience
shows that these tend to be used in an inconsistent (and therefore unre-
liable) way across different markers. In addition, items that carry more
than one mark often only serve to artificially inflate the gap between the
stronger and the weaker test takers. Where a particular aspect of listen-
ing is felt to be more important (for whatever reason), then it is better
to include more items targeting that type of listening behaviour than
to award more than one mark to an item (Ebel 1979). However, you
should also be aware of redundancy and construct over-representationif
too many items target the same construct.
4.4 G
uidelines fordeveloping listening
items
Developing a set of item writing guidelines which test developers can use,
as the basis for task development work, is crucial for a number of reasons.
Firstly, they help to ensure that the items conform to the test specifica-
tions. Secondly, guidelines should help to minimise any reliability issues
that might creep in due to the inclusion of inappropriate wording in
the instructions. Thirdly, they should encourage all members of the test
development team to work in the same way. Fourthly, they act as a check-
list to refer to during peer review (see 4.5).
Guidelines need to address issues related to the sound file, the instruc-
tions including the use of the example and picture (if used), task devel-
opment, the test method and the grading procedure. Based on past
experience of working with task development teams, recommendations
regarding how each of these issues can best be dealt with are presented
below.
4 How do wedevelop alistening task? 99
1. Use authentic sound files. These could be ones which have been
downloaded from the internet (check copyright permission) or ones
which you have created yourself. For example, an interview of some-
one talking about the kind of books they like to read (see Task 5.1,
Chapter 5).
2. The length of the sound file must be within the test specification
parameters.
3. The topic should be accessible in terms of cognitive maturity, age and
gender and should be something the target test population can relate
to.
4. The sound file should exhibit normally occurring oral features (see
1.4) in keeping with the input type (for example, speech versus
conversation).
5. The speed of delivery must be commensurate with the targeted level
of difficulty and conform to the test specifications.
6. Accents should be appropriate in terms of range, gender and age.
7. The number of voices should be in keeping with the difficulty level
being targeted. (The more voices there are, the more difficult a sound
file usually becomes.) (See Field 2013: 116.)
8. At least some sound files should have background noise to replicate
what listeners have to deal with in many real-life listening contexts.
Such background noise should be supportive and not disruptive (see
Task 5.8in Chapter5).
9. Sound files must be of good quality that will replicate well in the
target test situation (acoustics).
10. Where phone-ins form part of the sound file, ensure that the audibil-
ity level is sufficiently clear as the volume can often differ at those
points.
11. Check that the sound file does not finish abruptly, for example in the
middle of a sentence, as test takers might think there is something
wrong with the recording. Instead edit the last few words of the
sound file so that they fade out naturally.
100 Designing Listening Tests
You are going to listen to While listening, match the beginnings of the
sentences (1-7) with the sentence endings (A-J). There are two sentence end-
ings that you should not use. Write
HF_Earthquake_in_Peru_MCQ_v1
102 Designing Listening Tests
18. The answers to the items must be in the order in which they appear
in the sound file otherwise they are likely to impose a heavy load on
the testtakers memory (see Buck 2001: 138; Field 2013: 133-4).
19. Make sure there is sufficient redundancy in the sound file between
two consecutive items so the test taker has time to process the input
and complete his/her answer (ibid.). According to Field (2013: 89)
much listening is retroactive, with many words not being accurately
identified until as late as three words after their offset.
20. Avoid using referents (personal pronouns, demonstratives) in test
items. For example, Where did he go on Monday? should be writ-
ten as Where did John go on Monday? If John appears throughout
the sound file and is the only male voice/male person referred to in
the sound file, he can be used after the initial question.
21. Make sure the content of the options do not overlap. For example,
22. Word the stem positively; avoid the use of negatives in both the stem
and the options as this has a negative effect on students (Haladyna
and Rodriguez 2013: 26, 103).
23. Avoid humour in items as it detracts from the purpose of the test
(ibid.: 107).
24. All tasks should include a key, which should appear on the final page
of the task separated from the rest of the task so as not to influence
those involved in peer review (see 4.5 below). It should not appear
within the task.
25. Check that the key is correct and that any final changes made to the
task (distracter order, for example) are reflected in the final version of
the key.
This is confusing for the test taker and would require two sets of
instructions.
8. Where the item is targeting a main idea, test takers should be required
to write more than just one word. (One word is not usually sufficient
to test a main idea though it occasionally can do at a higher level of
difficulty and/or where the targeted answer is based on an abstract
concept.)
9. Ensure the items do not lead the test takers to having to use the same
answer more than once as this might lead to confusion (good test tak-
ers are likely to reject this possibility) and may result in a lack of face
validity.
1. Check that there is only one correct answer unless the task allows test
takers to use the same option more than once in the task.
2. In order to minimise the use of syntactical, grammatical and semantic
knowledge in putting sentence beginnings and endings together, start
all sentence endings with the same part of speech. Where this is not
possible, use two parts of speech.
3. Make sure that the combination(s) of sentence beginnings and end-
ings can be processed while listening, in other words, they are not too
long.
4. At the trial stage it is useful to include two distracters just in case one
of them does not work (seeChapter5, Task 5.3). One of these can be
subsequently dropped if necessary. Where a task contains only a few
items (under five), one distracter may be sufficient.
5. Make sure that the wording of the options has been paraphrased so
that the test takers cannot simply match the words with those on the
sound file.
6. The distracters should reflect the same construct as the real options.
106 Designing Listening Tests
In the above case, the word Different could be moved into the MCQ
stem.
8. Check that there is only one correct answer.
9. Where figures, times, dates and so on are used, put them in logical or
numerical order. For example:
A 1978
B 1983
C 1987
D 1993
4 How do wedevelop alistening task? 107
Task 5.3 A Diplomat Speaks does this for the MM test method, and
Task 5.6 Oxfam Walk for the SAQ one.
constructive feedback, the reviewer must wear the reviewer hat and no
other.
Wherever possible (and admittedly this is not always the case), the
feedback is likely to be even more useful if the reviewer is someone who
has not taken part in the textmapping procedure. Where the latter is
the case, unless there has been some time between the two events, the
reviewer may well remember certain aspects of the sound file and this
can influence his/her feedback on the task. For example, the items might
seem easier, the answers more obvious, because s/he remembers parts of
the sound file.
In addition, to peer reviewers being able to provide constructive feed-
back, test developers have to be able to accept it and to acknowledge that
sometimes their task is not going to work and that it needs to be dropped.
For the sake of everyone involved in test development, it is important
that this aspect of task development is aired and embraced from the very
beginning.
d. Is there more than one answer? If the task is SAQ and there is
more than one answer, check whether the answers relate to the
same concept or to two separate ones. If the latter, add a note, if
the former, ask the test developer whether your alternative sugges-
tion would be acceptable.
e. Can all the questions be answered based on the sound file?
f. Do the distracters work? That is, does your eye engage with them
or not even grace them with a blink? If the latter, you need to
leave a comment.
g. Is there any overlap in terms of content between the items? For
example, do two of the items have the same answer?
h. Does the answer to one item help with the answer to another
item?
i. Do any of the items target something other than the construct
defined in the TI? If so, check the textmap table to see what the
test developer meant to target.
j. Do any of the items require the test takers to understand vocabu-
lary or expressions above the targeted level in order to answer the
item correctly?
k. Can the answer be written in the number of words allowed by the
task (SAQ)?
l. Is the test method the most appropriate one given the contents of
the sound file and the targeted construct?
12. Now do the task under the same test conditions as the test taker as
far as possible. If the instructions say the sound file will be played
twice, then play it twice even if you do not need to hear it twice. Give
yourself the same amount of time, as the test takers will have to read
and then complete the questions. If the recording should be listened
to twice, mark the items in such a way that the test developer can see
which ones you answered on the first listening and which on the
second. By doing this you provide useful insights to the test devel-
oper on the differing difficulty levels of the items or the related part
of the sound file.
13. Do not stop the sound file while doing 1-10 above; simply make
quick notes on the task that you can later complete. (After a while
4 How do wedevelop alistening task? 111
this will become second nature and you will do it much more
quickly.)
14. Once you have finished completing the items and your comments,
check the answers you have given against the key the test developer
has provided. (This should be on a separate page so you are not influ-
enced while completing the task. The answers must not be marked
in the task.)
15. Where any differences are found between the key and what you have
written/chosen, add a note. If your answer is not in the list (SAQ
tasks), or you have chosen another option (MCQ/MM) ask the test
developer whether s/he would accept it or not.
16. If you could not answer an item, tell the test developer including the
reason if known.
17. Where you find that there is insufficient time to complete an item,
check the Time column in the textmap table, which should be
located at the end of the task. If the time appears to be sufficient, try
to deduce why the item was problematic and mention this in your
feedback.
18. Look through the textmap table results to ensure that what is there,
has been targeted in the items and that all points relate to the con-
struct defined in the TI.Add comments as necessary.
19. Finally, taking all the feedback into consideration, decide whether
the test developer should be encouraged to move on to the next ver-
sion of the task or not. If not, summarise your reasoning so as to help
the test developer as much as possible with his/her future task
development.
20. Once your comments are complete, add your initials to the file
name for example, HF_Earthquake_in_Peru_MCQ_v1_RG
and return the task to the test developer.
21. If you feel that in light of doing the task any of your comments
might impact on the test specifications or the item writing guide-
lines, make sure this information is passed to the person responsible
for this aspect of task development so that the documents can be
reviewed and/or updated as necessary.
112 Designing Listening Tests
4.5.2 Revision
On receiving feedback, the test developer should read through all the
comments to get a general idea of what issues have been raised. Then if
the task has been recommended to move forward to the next version, the
test developer should work through each comment, making changes as
necessary. To help the reviewer, it is better if the test developer puts any
new wording or comments in a different colour. This should help speed
up the review process.
Where a test developer disagrees with something the reviewer has said,
a reason must be provided. For example, if the test developer feels that
an answer suggested by the reviewer in an SAQ item is not acceptable, a
reason must be given. If something the reviewer has written is not clear,
the test developer should ask for further explanation or clarification.
Comments should not be left unanswered; this only leads to lost time, as
the reviewer will need to post the comment again on the next version of
the task if s/he sees it has not been responded to.
Once the revisions are complete, the version number and date in the
TI should be changed and the reviewers initials removed from the file
name so that it appears as follows: HF_Earthquake_in_Peru_MCQ_v2.
The task should then be re-posted to the same reviewer for further
feedback.
4.6 Summary
Developing good tasks takes time, but it is time well spent if it results in
tasks that provide a reliable and valid means of measuring the test takers
ability. In addition, the procedures outlined above should increase the
test developers own expertise and ability to produce good listening tasks.
DLT Bibliography
Buck, G. (1991). The test of listening comprehension: An introspective study.
Language Testing, 8, 67-91.
4 How do wedevelop alistening task? 113
Introduction
In choosing the tasks that are discussed in this chapter, I had a number of
objectives in mind. Firstly, I wanted to include tasks that focused on dif-
ferent types of listening behaviour; secondly, I looked for tasks that could
exemplify different test methods (multiple matching, short answer ques-
tions and multiple choice questions); and thirdly, I selectedtasks which
targeted a range of different ability levels. In addition to these consider-
ations about the tasks themselves, I wanted to include a range of sound
files that reflected different discourse types, topics, target audiences and
purposes. The final selection will hopefully provide some useful examples
of what works well and what can be improved upon.
In the case of each task the test population, instructions and sound file
are described and then the task presented. This is followed by a discussion
of each task in terms of the type of listening behaviour the test developer
is hoping to measure, the suitability of the sound file in terms of reflect-
ing a real-world context, the test method and the layout of the task in
terms of facilitating the listeners responses.
The keys for all the tasks are located at the end of this chapter and the
relevant sound files can be found on the Palgrave Macmillan website. It
should be noted that sometimes the instructions are present at the begin-
ning of the sound file and sometimes they are not.
To receive the maximum benefit from this chapter, I strongly recommend
you actually do the tasks as a test taker under the same conditions, that is,
if the instructions say the recording will be played twice, then listen twice.
Read the task instructions carefully to see what you should do and study
the example and the items in the time provided. I find it very helpful to use
different colours for the answers I give during the first and second times that
I listen to the sound file as they provide an indicator of those items which
might be more difficult or which might be working in a different way than
had been anticipated by the test developer. Above all, you should remember
first of all that there is no such thing as a perfect task and secondly, what you
as a reader and/or teacher may feel is problematic quite often goes happily
unnoticed by the test taker and is not an issue in the resulting statistics!
This first multiple matching task was part of a battery of tasks which were
developed for adult university students who required a pass at either B1
or B2in order to graduate from university. Time was provided before and
after the task for the test takers to familiarise themselves with what was
required and to complete their answers. The instructions and the task
itself appear in Figure 5.1.
Listen to Jane answering questions about her reading habits. First you have
45 seconds to study the questions. Then you will hear the recording twice.
Choose the correct answer (1-7) for each question (A-I) . There is one extra
question that you do not need to use. There is an example (0) at the beginning.
At the end of the second recording, you will have 45 seconds to finalise your
answers. Start studying the questions now.
5 What makes agood listening task? 117
Question Answer
younger? Q7
5.1.2 Task
The items were aimed at measuring the test takers ability to synthesise
the ideas presented in each response that Jane gave in order to determine
the overall idea and then link this with the relevant question. For exam-
ple, in attempting to find the answer to question 1, the test taker needs
118 Designing Listening Tests
5.1.2.3Layout
The two parts of the table are opposite each other so the test taker simply
has to look across to the options, select one and fill in the appropriate box.
This task was developed for use with 11-12 year old schoolchildren. The
test takers were provided with time to study the task before being asked
to listen twice to the sound file. Further time was allowed at the end of
the second listening for the test takers to complete their answers. The
instructions and the task itself appear in Figure 5.2.
5 What makes agood listening task? 119
Listen to the description of a school class. While listening match the chil-
dren (B-K) with their names (1-7) . There are more letters than you need.
There is an example at the beginning (0). You will hear the recording
twice.
At the end of the first recording you will have a pause of 10 seconds.
At the end of the second recording you will have 10 seconds to complete
your answers. You now have 10 seconds to look at the task.
D G
F E
C
H
I
B K J
0 Miss Sparks A
Q1 Ben
Q2 Mary
Q3 Judy
Q4 Linda
Q5 Susan
Q6 Michael
Q7 Sam
The speaker in this sound file has obviously been asked to describe the
students in the picture, which does not reflect real-life listening in the
same way as the previous task and therefore lacks authenticity. In terms
of the content, however, it is something that the target test population
would be able to relate to. The sound file is approximately 50 seconds
long and consists of just one female voice talking in a reasonably natural
and measured way. The test developer put the combined sound file and
items at CEFR A2.
5.2.2 Task
According to the test developer, the items were aimed at measuring the
test takers ability to identify specific information (the names of the chil-
dren) and important details (things which help to differentiate the chil-
dren from one another such as descriptions of their hair, their clothes and
so on).
Lets take a look to see how well this works. The first child to be
described is Susan. The speaker mentions that she has long dark hair
and a striped pullover. The next child to be described is Ben; however,
in order to answer this item correctly the test taker has to rely on the
childs location (Ben is next to Susan). The item is, therefore, arguably
interdependent that is, if the test taker did not identify Susan cor-
rectly, s/he may be in danger of not identifying Ben correctly either.
Understanding important details helps with the next child, Linda, who
is described as wearing glasses. A further piece of information (though
an idea) helps to confirm her identity (she knows the answer, her hand is
up). The next child, Sam, can be identified through a series of important
details such as black curly hair and black jacket. (Further information
is also provided regarding hislocation, though again like Ben, the extent
5 What makes agood listening task? 121
to which this helps depends on whether the test taker has managed to
identify Linda correctly.)
It seems that although some of the items can be answered by under-
standing important details, others involve understanding ideas and there
is a degree of interdependency between some of the items. The intended
focus of the task could easily be tightened by focusing on the important
details of the children rather than on their location or what they are
doing. On the positive side, seven items are likely to provide a reason-
able picture of the test takers ability to identify specific information and
important details (once the task has been revised). Although the sound
file is relatively dense and some test takers might miss one of the names,
being able to hear the sound file again provides them with a second
chance. The sound filecould also be made more authentic by building
in other natural oral features such as hesitation, repetition, pauses and
so on.
5.2.2.3Layout
This task was used as part of a suite of tasks to assess the listening abil-
ity of career diplomats. The test takers were provided with time to study
the task before they heard the sound file, which was then played twice.
Further time was allowed at the end of the second listening for the test
takers to complete their answers. The task instructions are below while
the task itself appears in Figure5.3.
You are going to listen to part of an interview with a diplomat. First you
will have one minute to study the task below, and then you will hear the
recording twice. While listening, match the beginnings of the sentences (1-7)
with the sentence endings (A-J). There are two sentence endings that you
should not use. Write your answers in the spaces provided. The first one (0)
has been done for you.
After the second listening, you will have one minute to check your
answers.
5 What makes agood listening task? 123
The sound file is an extract from an interview with the then Australian
Ambassador to Thailand and, as such, had content validity for the test tak-
ers as the topic covered issues related to their profession. It was approxi-
mately four minutes in length and consisted of two voices the female
interviewer and the male ambassador. Both of the speakers have Australian
accents and talk in a rather measured way; the test developer estimated the
speed of delivery at approximately 170 words per minute. The lack of any
background noise was probably due to the fact that the interview took place
in a studio. The test developer put the sound file at around CEFR B2/B2+.
5.3.2 Task
The test developer asked colleagues to textmap the sound file for main ideas
and supporting details (MISD). The results were paraphrased to minimise
the possibility of simple recognition, and then the textmapped MISD were
split into two parts, beginnings and endings, as shown in Figure5.3.
Let us look at a couple of items to see the extent to which the test
developer was successful in requiring test takers to understand MISD
starting with the example which was also based on a main idea that came
out of the textmap:
Its purpose, as discussed in 4.2, is not only to show the test takers what
they have to do in order to complete the other items, but also to provide
them with an idea of the type of listening behaviour they should employ
and the level of difficulty they should expect to find in the rest of the task.
The test taker needs to find some information in the sound file which
means something similar to the sentence beginning 0 The relationship
with Thailand and then match what comes next in the sound file with
one of the options, in this case F, which is marked as the answer to the
example. The ambassador says:
5 What makes agood listening task? 125
I think the best way to describe whats happened over that period in Australia-
Thailand relations is a relationship of quiet achievements, that weve actu-
ally seen that relationship grow in a steady way over that entire period
Student numbers have grown from just a few thousand students in the 1990s
to over 20,000 students these days.
Having identified the appropriate part of the sound file, the test taker
must then find a suitable ending from within the options A to J.Part
of sentence ending I refers to growth: have increased hugely; moreover
the time frame mentioned by the ambassador matches the second part
of sentence ending I: over the past two decades. Therefore the correct
answer is I.
Question 4 states: Thai businesses are now putting money ____,
indicating to the test takers that they need to identify some reference in
the sound file which relates to Thai business and putting money. In the
interview, the ambassador says:
for the last couple of years investment has been the story, especially Thai
investment in Australia, which has gone from a very low base to be really sub-
stantial, where you have major Thai investments in our energy sector, in our
agri-business and in our tourism industries as well.
5.3.2.3Layout
Experience has shown that placing the two tables containing the sentence
beginnings and endings opposite each other minimises the amount of
work the test takers eyes have to undertake in order to complete the task.
This is important given the various constraints of the listening task. Test
takers were asked to enter their answers directly into the table to reduce
any errors that might occur in transferring them to a separate answer
sheet. (It is acknowledged that this is not always a practical option in
large-scale testing.)
5 What makes agood listening task? 127
This first short answer question task was part of a bank of tasks given on an
annual basis to 11to12 year old schoolchildren to determine what CEFR
level they had reached. Time was provided before and after the task for
the test takers to familiarise themselves with what the task required and to
complete their answers. The instructions and the task appear in Figure 5.4.
Listen to a girl talking about her holidays. While listening answer the ques-
tions below in 1 to 5 words or numbers. There is an example at the beginning
(0). You will hear the recording twice. You will have 10 seconds at the end
of the first recording and 10 seconds at the end of the task to complete your
answers. You now have 20 seconds to look at the task.
0 When did the girl go on holiday? winter
The sound file lasts just under one minute and is based on an 11 year
old girls description of her winter holidays. The delivery sounds rather
studied, suggesting it was based on either a written text or a set of scripted
bullet points. The language itself, however, seems reasonably natural and
appropriate for an 11 year old. The test developer felt the sound file was
suitable for assessing CEFR A2 listening ability and that the test takers
would find the topic accessible.
128 Designing Listening Tests
5.4.2 Task
According to the test developer, the items were aimed at measuring the
test takers ability to identify specific information and important details
(SIID) based on the results of the textmapping exercise. For example, in
question 1, the test taker has to focus on who else went with the speaker
and her parents (answer: her brother important detail); in question 2,
the test taker must listen out for a kind of transport (answer: car impor-
tant detail); in question 3, the test taker must identify the length of time
the journey took (answer: eight hours specific information) and so on.
The SAQ format lends itself well to items that target SIID, as the num-
bers of possible answers are limited (unlike MISD items see Task 6
below). In general, this makes it is easier to mark and usually easier for
the test taker to know what type of answer is required. Another advantage
of using the SAQ format here is that the answers require little manipula-
tion of language (limited construct irrelevant variance).
The example indicates to the test taker how much language s/he needs
to produce and the type of information being targeted. This should help
them to have a clear picture of what they need to do in the rest of the task.
However, the answer to the example does appear in the first sentence of
the sound file, giving the test taker little time to become accustomed to
the speakers voice and topic. This is not ideal. It is recommended that the
first utterance in a sound file be left intact and that the example be based
on the second or third one, depending on the results of the textmapping
exercise. With short sound files, however, this sometimes proves difficult
and arguably it is better to have an example based on the first utterance
than to have no example at all.
The wording of the items is not difficult and appears to match the test
developers aim of targeting A2. Six items (each answer to question 4 was
awarded 1 mark) provides a reasonable idea of the test takers ability to
5 What makes agood listening task? 129
identify SIID (mainly important details here). The pace of the speaker
and the distribution of the items throughout the sound file allow suf-
ficient time for the test taker to complete each answer.
5.4.2.3Layout
The layout of the task encourages the use of short answers, as there is
insufficient room for a sentence to be inserted. (Experience shows that
even when a maximum of 4 words is mentioned in the instructions, some
test takers still feel they should write a complete sentence.) The test takers
are required to write their answers directly opposite the questions; this
should help when simultaneously processing the sound file.
This SAQ task comes from a range of tasks aimed at assessing the English
language ability of 14to15 year old students. Test takers simply had to
complete one question based on the sound file following the instructions
given in Figure 5.5 below:
5.5.2 Task
The item requires the test takers to determine the reason why the man,
Jim, is making the call. In order to do this, the test takers need to syn-
thesise a number of ideas firstly, that the caller wants to speak to Mike,
who is out at the time of the call; secondly, that he is speaking to Mikes
sister who is willing to take a message; thirdly, that Jim and Mike were
scheduled to meet at 8 p.m.; and fourthly, that Jim is not feeling well so
he will not be able to make the appointment. (We also learn that Jims
sister will pass the message on, although this is not needed to complete
the task.) The test taker needs to combine the information these ideas
represent and produce the overall idea in order to answer the question.
The answer should reflect something along the lines of he cant come
tonight, hes not feeling well, or he cant meet Mike.
The short answer question format works well in this type of task, in
which the gist or overall idea is being targeted, as it requires the test
taker to synthesise the ideas him/herself rather than simply being able
to select one from three or four options. The number of words required
are limited (they are told they can use up to seven words, but it can be
done within four or five) so it should not be too taxing; nor are the words
particularly difficult to produce which should minimise any construct
irrelevant variance which writing might bring to the task.
There is no example as there is only one item; this is usually the case
with single gist items (as opposed to a multiple matching gist task such as
that discussed in Task 1). Where there is any doubt as to the test takers
level of familiarity with such items, a sample task should be made available.
5.5.2.3Layout
The layout of the task is very simple and should cause no particular
problems.
5 What makes agood listening task? 131
This SAQ task comes from a bank of tasks aimed at assessing the ability
of final year school students. The instructions and task can be found in
Figure 5.6 below.
Oxfam Walk
0 Rosie works for the charity Oxfam as the ___ . marketing coordinator
Q8 To find out about the job offer get in touch with ___.
5.6.2 Task
The task requires the test takers to identify some of the specific infor-
mation and important details in the sound file. The sample question
provides the test takers with the kind of important detail they should be
listening out for in this case the role Rosie fulfils at Oxfam in order to
complete the statements in questions 1-9. Other items, such as 4, 7 and
8, also target important details, while the rest focus on specific informa-
tion. Question 8 could be answered with either the name Simon Watkins
(specific information) or his role current chairman.
Although the test developer successfully identified SIID in the sound
file, the fact that the test takers are allowed to listen twice suggests that
they will employ careful listening as opposed to selective listening, and
that the level of difficulty (despite the speed of delivery) may be lower
than B2.
At first sight, the short answer question format seems well suited to this
task in that the test taker simply needs to complete the statements with
numbers, names, figures and so on. In reality, the trial showed that the
test takers came up with a myriad ways of completing the statements,
making the final key of acceptable answers (not all produced in this chap-
ters key for reasons of space) incredibly long. This was surprising as it
5 What makes agood listening task? 133
was expected that the answers the test takers would produce for the SIID
items would avoid the multiple answer situation often faced by MISD
questions.
5.6.2.3Layout
As with Task 4 above, the layout of the task encourages the use of short
answers as there is insufficient room for a sentence to be written in the
space provided. The need for a short answer is also stressed in the instruc-
tions (a maximum of four words) and helps to minimise any construct
irrelevant variance.
The instructions provide a clear context for the sound file, which is based
on a young man explaining to someone how to find the hospital. The
directions given last just under 20 seconds. The test developer felt the
sound file was suitable for assessing CEFR A2 listening ability and the
topic was felt to be something 14-15 year olds would be able to relate to.
Listen to a man describing the way to the hospital. While listening, tick
the correct map (a, b, c or d). You will hear the recording twice.
You will have 10 seconds at the end of the recording to complete your
answer.
You now have 20 seconds to look at the maps.
134 Designing Listening Tests
a) b)
KINGS ROAD
HOSPITAL
KINGS ROAD
HOSPITAL
HOSPITAL
KINGS ROAD
HOSPITAL
5.7.2 Task
The item was aimed at measuring the test takers ability to grasp the
overall meaning of the directions based on identifying and understanding
the relevant SIID.For example, the test taker needed to understand such
details as straight on, turn right, roundabout, left, second building,
on right and specific information such as Kings Road.
5 What makes agood listening task? 135
The multiple choice question format, in the shape of a map, lends itself
well to instructions such as these as they display the necessary informa-
tion in a non-verbal way. The test taker simply has to match what s/he
is hearing to the visual display. The task is very simple to mark. There is
obviously no example as there is only one item; where there is any doubt
about test takers familiarity with this type of task, a sample exercise
should be made available to them prior to the live test administration.
5.7.2.3Layout
The layout of the task is compact and it is possible to look at all four
options simultaneously, although a little more space between the four
maps might have helped. The box, which the test taker needs to tick,
is quite small and may take a few seconds to locate. Putting the boxes
outside the maps might have made them easier to see.
Tourism in Paris
A it is a famous building.
B it is a popular museum.
D go everywhere on foot.
A religious buildings.
B famous hotels.
D groups of tourists.
C an exclusive atmosphere.
0 Q1 Q2 Q3 Q4 Q5 Q6 Q7
The sound file is an authentic interview with Elliott, who works for the
Paris tourist office. It takes place outside, which is indicated by appropri-
ate supportive background noise. Both the questions and the responses
in the interview are delivered quite naturally and in an engaging way.
The length of the sound file is just under three minutes and the speed of
delivery was estimated to be approximately 150 wpm. The test developer
put the task as a whole at CEFR B1.
5.8.2 Task
The task requires the test taker to understand the main ideas and sup-
porting details presented in the sound file. For example, at the beginning
of the sound file Elliott is asked what there is to do in Paris. He answers
that this depends on how many days the tourist is going to spend in the
city. This idea has been transformed and paraphrased into question 1
(When choosing activities in Paris you should think about _____. The cor-
rect answer isA the duration of your visit).
The second question attempts to target the first venue that Elliott rec-
ommends and also his reason for doing so in other words, Montmartre
so as to get a nice view of the city. The test developer manages to avoid
using the name of the place, which would cue the answer, but the stem
does presuppose that the test taker is aware that this is the first place
Elliott mentions. This also happens in question 5 (second area). This is
one of the challenges that test developers meet when trying to test the
main idea without signalling too precisely where the answer is located
which could lead to test takers answering an item correctly through rec-
ognition rather than comprehension. Sometimes slips occur, as in the
example and question 6 where the words tourist office appear on both
the sound file and in the items.
Having said that, it is sometimes practically impossible to paraphrase
certain words without the results appearing engineered or being more
5 What makes agood listening task? 139
difficult than the original wording. Where a word occurs many times
in a sound file, it seems reasonable to use the original word if it proves
too difficult to paraphrase as arguably the test taker still has to be able to
identify the correct occurrence of the word(s) and use this to answer the
item concerned.
There are a total of seven items plus the example in the task which,
with a three-minute sound file, would suggest sufficient redundancy for
the test takers to complete and confirm their answers by the end of the
second listening. (The actual distribution of the items should of course be
checked at the textmapping stage see 3.5.)
The sound file is quite detailed in terms of ideas and information about
what people should do when visiting Paris and therefore lends itself to a
multiple-choice task. The options are reasonably short, thereby minimis-
ing the burden placed on the test takers as they listen to the sound file
and try to determine the correct answer. The distracters are not easily
dismissible and it is unlikely that the test taker will be able to eliminate
any before listening to the sound file.
5.8.2.3Layout
The layout is neat and concise and the space for writing the answers
clearly indicated by the example in the table at the bottom of the task.
Boxes at the side of each item might have helped rather than having to
transfer them to the bottom of the task.
5.9 Summary
In this chapter you have worked through eight listening tasks reflecting
different behaviours, test methods, topics and types of sound file and read
the discussion concerning their advantages and disadvantages. Based on
140 Designing Listening Tests
DLT Bibliography
Field, J. (2013). Cognitive validity. In A. Geranpayeh & L. Taylor (Eds.),
Examining listening. Research and practice in assessing second language listening
(pp.77-151). Cambridge: CUP.
6
How do weknow if thelistening task
works?
Introduction
If you have followed the steps outlined in Developing Listening Tests so
far, your listening tasks should have gone through a number of carefully
applied stages from defining the construct and the performance condi-
tions in the test specifications (Chapter 2), to textmapping (Chapter 3),
and task development, peer review and revision (Chapter4). Even so, it
cannot be guaranteed that the final product will be without error. To be
certain that an item/task is likely to contribute positively to a valid and
reliable test score, it is necessary to subject it to a trial on a representative
test population (Green 2013; Buck 2009). The resulting data should then
be analysed to determine whether they have good psychometric proper-
ties. In addition, where high-stakes tests are involved, the task(s) should
then be subjected to an external review (see Chapter 7).
Some test development teams believe that it is impossible to trial tasks
because of security concerns. While this is indeed an issue that must be
considered very carefully, particularly in high-stakes tests, a decision not to
trial can have major negative effects on a test takers performance and on
the confidence level which stakeholders should have in the validity and reli-
ability of the resulting test scores. Experience shows that trialling ahead of
when the tasks will actually be needed (see6.2 below) helps to minimise any
perceived security threats, as does trialling multiple tasks simultaneously,
so that it is unclear as to which tasks will finally be presented in any live
administration. Of course, the latter presupposes that there are a number of
test developers working together and that resources are available for a large-
scale trial. In the school context, by contrast, it is recommended that tasks
be trialled on parallel classes or in other schools in order to gather informa-
tion about how the tasks perform. Without trials, it is impossible to know
whether or not an item or task will add to the validity and reliability of the
test score. This is something that all decision makers should be aware of.
To summarise, trialling in general allows us to ascertain if the tasks
perform as expected and whether they are likely to contribute to a valid
and reliable test score. Many things can impact on the success of an item
or task and each of these can be examined through field trials.
First of all, we need to check that the task instructions (sometimes referred
to as rubrics) are doing their job. If these have not been carefully written
using language that is at the appropriate level (equal to or lower than that
which is being targeted in the task) and avoiding metalanguage, the test
takers might not understand what is expected of them. So although a
test taker might understand the contents of the sound file, s/he might be
unable to complete the task.
Instructions, like the task itself, need to be trialled and standardised
so that they do not influence a test takers performance. Some examina-
tion boards use the test takers mother tongue in the instructions. This is
particularly appropriate when developing tests for children, on the basis
that the instructions should not be part of the test.However, care must
obviously be taken in multilingual societies that using the mother tongue
does not disadvantage any test takers.
One way of finding out whether the instructions, including the exam-
ple, have fulfilled their role is by administering a feedback questionnaire
6 How do weknow if thelistening task works? 147
(see 6.1.9) to the test takers as soon as they have finished the trial and
including a question on this issue. Remember test taker anxiety is likely
to be reduced if the instructions on the sound file match those which
appear at the beginning of the task, as the listener will be able to follow
what is being said with the aid of the written words.
The amount of time test takers need to study the task prior to listening to
the recording, and the amount of time they should have at the end of each
task to complete their answers, is usually included in the task instructions.
When a new test is developed, it is necessary to trial the amount of time
provided to make sure it is neither too short nor too long. Where the for-
mer is the case, the reliability of the test scores can be affected if test takers
simply have insufficient time to read through the items or to complete
their answers; in the latter scenario, it is likely to lead to increased test
anxiety or may encourage the test takers to talk to each other.
Useful evidence can be gathered by the test administrators during the
trial as to whether the test takers appear to be ready when the recording
starts and whether they had sufficient time to complete the questions.
Further information can also be gathered by means of a test taker feed-
back questionnaire. Test developers should not be reluctant to change the
amount of time provided during the trial based on the evidence gathered;
this is one of the reasons for field trialling.
Trial data also reveal insights into how different test methods work. For
example, it provides evidence of which test type the test takers appear to
perform better on and which they find more challenging. It also reveals
which methods are discriminating more strongly (see 6.3.2.2). Where
a test method is unfamiliar to the test takers, this may be reflected in
lower scores and/or in an increased number of no responses. Hence the
importance of including a range of different test methods in the test so as
to minimise any test method effect which might influence the test takers
performance.
148 Designing Listening Tests
The key to short answer tasks (SAQ), in particular, benefit from being
trialled as it is impossible for the test developer to predict all valid answers
in this type of task. This is especially the case when main ideas and sup-
porting details are being targeted, as there will be a number of ways that
test takers can respond. The field trial also allows us to witness the extent
to which the test takers responses reflect what the test developer was
hoping to target in the item; experience has shown that sometimes test
takers produce a totally different answer than that expected, which may
cast doubt on the construct validity of the items. For example, if the
item were designed to test a main idea but some test takers managed to
produce a valid answer using specific information or important details,
it would suggest that the wording of the item had not been successful in
targeting the right type of listening behaviour possibly due to a lack of
control in the wording. Fortunately, this is one of the advantages of the
trial situation; the test developer has the chance to review and revise the
item, and then to trial it once more.
Where a live test administration involves a large number of markers
working in separate locations, it is useful to include not only an extended
key based on the results of the field trial but also a list of unacceptable
answers. This helps to save valuable time as well as reducing possible
threats to reliability. Deciding on new answers to the key is often a prob-
lem when central marking is not possible. Where the test is a high-stakes
one, thought might be given to the use of a hotline where advice can
be given by a small panel of experts (see Green and Spoettl 2009) who
have been involved in the task development cycle and who have access
to appropriate databases, such as thesauruses, dictionaries, and language
corpora. Where possible such a panel should include a native speaker.
Based on the data collected from the field trial, it is also possible to check
for any type of bias which the items might have in terms of gender, test
taker location, first language and so on. For example, the data resulting
from a task based on a topic which might advantage female students over
6 How do weknow if thelistening task works? 149
male ones can be checked to ascertain whether this is indeed the case.
Items that are found to suffer from any kind of bias should be dropped
from the task as they suggest an unfair playing field and bring into ques-
tion thevalidity and reliability of the test score (see also 7.5).
All test takers (as well as other stakeholders) benefit from access to sample
tasks. Such tasks should provide as accurate a picture as possible of the
type of tasks the test takers will meet in the live test in terms of what is
being tested (construct), how it is being tested (method, layout, sound
file characteristics and so on) and how their performance will be assessed
(grading criteria). The tasks that become sample tasks must comply in all
these respects and have good psychometric properties (see6.3.3 below).
In order to ensure this is the case, they need to be trialled.
Sample tasks should not be selected from those tasks which have been
discarded for some reason; on the contrary, they should be good tasks
that will stand up to close scrutiny. This is sometimes seen as a sacrifice,
but it is one which is well worth it in terms of the likely increase in stake-
holders confidence in the test. In addition to the sound file and the task
itself, a detailed key with justifications should be provided for each test
method as well as information about the targeted listening behaviour (see
also 7.4). It is important to publish as wide a range of sample tasks as pos-
sible so as to avoid any negative washback on the teaching situation, in
other words, to minimise any teaching to the test. It should also help the
test from becoming too predictable, for example, every year the test will
be made up of an X + Y + Z task.
Another reason why field trials are useful is that they make it much easier
for the test developer to select tasks with good statistics, as well as posi-
tive test taker feedback which can then be put forward to standard setting
sessions (see 7.2). This qualitative and quantitative evidence reduces the
possibility of the tasks being rejected by the judges.
150 Designing Listening Tests
In order to ensure that the trial takes place under optimal conditions, it
is important to develop administration guidelines. This becomes even
more important when the trial takes place in a number of different ven-
ues. If tasks are administered in different ways for example, if the time
provided to complete the tasks is inconsistent between locations, if the
instructions are changed, or if the recording is paused in one test venue
and not another, these differences will obviously impact on the confi-
dence one can have in the trial data. Therefore, even before the trial takes
place it is important to develop such guidelines and hold a test adminis-
tration workshop with those who people are going to deliver the trial to
make sure that the guidelines are clearly understood.
In developing test administration guidelines, a number of issues need
to be decided upon:
(See Drnyei 2003; Haladyna and Rodriguez 2013 for further exam-
ples of feedback questionnaire.)
6 How do weknow if thelistening task works? 153
Even more useful insights into how test takers perceive the test can be
obtained if it is possible to link their opinions with their test performance.
This is not always possible due to anonymity. (See Green 2013,Chapter5
for more details on how to analyse feedback questionnaire data.)
In a high-stakes test there are many stakeholders who would welcome fur-
ther insights into how the trialled listening tasks are perceived by the test
takers. These stakeholders include students, teachers, parents, school inspec-
tors, school heads, ministry officials, moderators, curriculum developers,
university teachers, teacher trainers, textbook writers and external judges
(standard setters), among others. Trialling makes it possible to share test
takers perceptions with these interested parties through stakeholder meet-
ings which can, in turn, provide other useful insights for the test developers.
The analysis of the qualitative and quantitative data resulting from the tri-
alled tasks can help the test developers to reassess the test specifications in
an informed way and make changes where necessary. For example, the tri-
als may show that the amount of time allocated for reading the questions
prior to listening to the sound file was insufficient or that a particular test
method was less familiar than expected. In light of this feedback, these
time parameters can be reassessed and changes made to the test specifica-
tions and the decision regarding the use of the test method re-visited.
6.1.12 Summary
It should be clear from all the arguments given above that field trials are
immensely useful to the test developer. Without them s/he is, to a certain
extent, working blind, as s/he has no evidence that the tasks will work appro-
priately. Given the possible consequences of using test scores from untrialled
tasks, there is really no argument for not putting test tasks through field trials.
154 Designing Listening Tests
It is crucial that the test takers used in the trial be representative of the
test population to whom the tasks will ultimately be administered. For
obvious reasons, the test population which is used cannot be drawn from
the pool of actual test takers themselves, but the population should be
as close as possible in terms of factors such as ability level, age, regional
representation, L1(s), gender and so on. How can this be done? Let us
take, for example, a final school leaving examination situation. The best
way to obtain valid and reliable test data is to administer the field trial
in such a way that the test takers see it as a useful mock examination. In
such a scenario, the school leavers would be at approximately the same
stage in their schooling as the target test population. Having field trialled
the tasks on these school leavers, the successful tasks can then be kept and
used after two or three years when the test takers have already left school.
In order for this to happen, test development teams need to trial their
tasks at least one year in advance of the date they are actually needed and
preferably more on a range of school types, regions and locations.
As mentioned above, the trial should take place at roughly the same time of
year as the live test is to be administered so as to simulate similar conditions
in terms of knowledge gained. This is not always possible, of course, as the
period when the live tests are administered will be a very busy time for all
involved (test takers, teachers and schools). However, if there is too large a
gap between the date when the field trial is administered and that when the
live test is normally sat, this can have the effect of depressing the item results.
6 How do weknow if thelistening task works? 155
In other words, the tasks may seem more difficult than they actually are. In
such circumstances, the test developers would need to take this factor into
account when deciding on the suitability of the tasksdifficulty level which
is obviously likely to be less reliable as it will involve second-guessing as to
how the tasks would have worked if the trial dates had been more optimal.
How large does the trial population need to be? The answer to this question
depends on how high-stakes the test is and how the test scores are going
to be used. If the test results are likely to have high consequential validity
(Messick 1989) for example, the loss of an air traffic controllers licence
then clearly the larger and more representative the test population, the bet-
ter as the test developer is likely to have more confidence in the results.
For many test developers, however, and especially for those who work with
second foreign languages, large numbers are not always easy to find. The
minimum number of cases that might usefully be analysed is 30 but with so
small a number it is very difficult to generalise in a reliable way to a larger
test population. Having said that, it is better to trial a listening task on 30
test takers than none at all, and for many schoolteachers this is likely to be
the most they are able to find. At least with 30 test takers it will be possible
to see whether they have understood the instructions and the teacher should
be able to gain some feedback about the task itself. Where large test pop-
ulations and/or high-stakes tests are involved it is strongly recommended
that data from a minimum of 200 test takers be collected, and if the data
are to be analysed using modern test theory through such programmes as
Winsteps or Facets(Linacre 2016), then 300 test takers would be better as
the results are likely to be more stable and thus more generalisable.
of listening behaviour. This approach is also better for the test takers as a
way of minimising fatigue and possible boredom. Secondly, a selection of
test methods should be included so as to gather information on the dif-
ferent methods, to encourage interest as well as to minimise any possible
test method effect. Thirdly, the total number of tasks has to be carefully
thought through too many and performance on the last one(s) may be
affected by test fatigue; too few and the trial becomes less economical.
The age and cognitive maturity of the test takers need to be factored into
this decision as well.
Fourthly, once the tasks have been identified, the order they appear in
the test booklet must be agreed upon. The convention is to start with the
(perceived) easier tasks and work towards the (perceived) more difficult
ones. This is also true with regards to the test methods. Those thought to
be more familiar and more accessible should come first, followed by those
which may be more challenging. For example, SAQ tasks are generally
seen as more challenging because the test takers are required to produce
language rather than just selecting one of the options on offer. Ideally,
putting tasks with the same test methods next to each other helps the test
taker save time, but this may not always be possible if the difficulty level
varies to a great extent. It is also important to take the topics into consid-
eration; having two or three tasks all focusing on one particular subject
area could have a negative washback effect on the test takers interest
level.
Fifthly, the layout of the test booklet itself needs careful consider-
ation. As already mentioned in 4.2, it is good testing practice to use
standardised instructions; where a task requires two pages these should
face each other so that the test taker does not need to turn pages back and
forth while listening. The size and type of font also needs to be agreed
upon so that these can be standardised. Although colour would be attrac-
tive, few teams can afford this and so black and white tends to be the
norm. If pictures are used, then care must be taken that they are repro-
duced clearly.
Sixthly, as part of the test booklet preparation, it may be necessary to
produce multiple CDs or other types of media. The quality of these CDs
mustbe checked before being used.
6 How do weknow if thelistening task works? 157
One of the first issues which needs to be resolved when holding a field
trial is the actual location (for example, school, university, ministry) and
how suitable it is likely to be in terms of layout, acoustics, light, noise,
heat and so on. These aspects need to be checked by a responsible person
well in advance of the trial itself and changes made as necessary.
Secondly, administrators need to be clear of their responsibilities dur-
ing the trial. Ideally, they should be trained and provided with a set of
procedures to follow regarding invigilation well before the trial takes
place so that any issues can be resolved in advance.
Thirdly, if the test materials have to be sent to the testing venue, this
needs to be organised in a secure way: the materials need to be checked
by someone on arrival and then locked away until the day of the trial in
order to ensure the highest level of security. The equipment used for play-
158 Designing Listening Tests
ing the sound files must be checked and a back-up machine (and batteries
if necessary) made readily available just in case.
Fourthly, in high-stakes trials, the use of a seating plan showing test
taker numbers is to be recommended. This enables the test developer
to check the location of the test taker(s) in question if anything strange
emerges (for example, a number of tasks left completely blank) during
data analysis. Desks should be set at appropriate distances from each
other to discourage cheating; where two test takers have to sit at the same
desk (and this is the case in a number of countries), different versions of
the test paper must be used.
Fifthly, great care must be taken to ensure that no copies of the test
booklet or feedback questionnaire leave the testing room, and that no
notes have been made on any loose pieces of paper. Inevitably, there is
some risk that test takers will remember the topic of a particular sound
file. The risk should be minimal, however, provided the trial takes place
well in advance of the live test so that the test takers who took part in the
trial have already left the school, and also if a large number of tasks can be
trialled (particularly with high-stakes examinations) so that nobody can
predict which tasks will be selected for a future live test.
Finally, all mobile phones should be left outside the testing room. This
is obviously crucial during listening tests.
6.2.6 Marking
Great care must be taken in marking the trialled tasks, particularly those
which might involve subjective judgement such as short answer ques-
tions. For large-scale test administrations, it is recommended that an
optical scanner should be used for the selected response type items and
markers should grade only the constructed response items. However, this
is not practical in the case of small-scale testing. Where a number of
markers are involved in grading the trial results, the following procedure
has been shown to be useful:
Correct answer = 1
Incorrect answer = 0
No answer = 9
4. Selected response items can also be marked this way (0, 1 and 9) but
the actual letter chosen by the test taker (A, B, C or D in MCQ
items, for example), should be entered into the data spreadsheet so
that a distracter analysis can take place (see6.3.2.1below).
5. It is recommended that the group as a wholeworks together on one
task at the beginning; an SAQ task is probably the most useful in
terms of learning how to deal with unexpected answers/anomalies.
6. The markers may have to listen to the sound files to determine
whether a particular answer (not in the key) is correct. Therefore,
copies of the sound file must be made available together with an
appropriate playing device.
7. Where an alternative answer to those appearing in the key occurs, the
marker must call this to the attention of the group leader and a con-
sensus should be reached as to whether it is acceptable or not. Where
it is accepted, all groups should add the new answer to their key.
8. If there is any chance that such an answer has come up before but has
not been mentioned, back papers much be checked and corrected
accordingly in all groups.
9. It is recommended that the group as a whole work as much as pos-
sible on the same task so that any queries can be dealt with while still
fresh. However, markers will inevitably work at different rates so this
will lead to different tasks being marked by people in the same group.
160 Designing Listening Tests
10. When all the listening tasks in the test booklet have been marked, it
is useful if the raters can calculate the total score for each test taker
and place this on the front of the test booklet, for example, Listening
Total = 17. This will help when checking data entry (see6.3.2below)
and the markers calculations can later by corroborated by the statis-
tical programme used.
11. From time to time, it is useful for the person(s) running the marking
workshop to check a random sample of marked test booklets for
consistency. Any anomalies found should be discussed with the
group as a whole.
12. Where there is clear evidence of an insincere (test taker)response pat-
tern, for example, a long string of nonsense answers unrelated to the
task, the test booklet should be set aside in a separate box for the ses-
sions overall administrator to judge whether or not it should be marked.
13. Once all the listening tasks have been marked, and a random sample
of test booklets has been checked,data entry can begin.
A number of people reading this book will probably quail at the idea of
getting involved in any kind of statistical analysis however simple it may
be. As mentioned in Green (2013), the most important thing to remember
is that the results of the analyses you carry out can be directly applied to
the tasks you have painstakingly developed. This makes understanding the
numbers so much easier. By spending copious amounts of time on devel-
oping and trialling tasks, but then leaving the data analyses to others who
have not been involved in the test development cycle, you will lose immea-
surably in terms of what you can learn about your tasks, your test develop-
ment skills and subsequent decision making. Conversely, you will gain so
much more by taking on the challenge that data analyses can offer you.
Item analysis is one of the first statistical procedures that you as a
test developer should carry out on your trialled tasks once data entry
6 How do weknow if thelistening task works? 161
is complete and the data file has been checked for errors. (See Green
2013, Chapters1 and 2 for more details regarding these procedures.)
This is because it provides information on how well the items and the
tasks have performed in the trial. It does this, firstly, by telling us which
items the test population found easy and which they found difficult.
This information should be compared to your expectations; where dis-
crepancies are found for example, where a task which you expected
to be easy turned out to be one of the more difficult ones or vice versa
the findings need to be investigated and a reason for any differences
found.
Secondly, item analysis enables us to see how particular test methods
are working. For example, we can see how many items are left blank across
the various test methods. Thirdly,the data can also show us the extent
to which the distracters in the multiple choice and multiple matching
tasks are working. Fourthly,item analysis can tell us which kind of test
takers (stronger/weaker) are answering the items correctly and which are
not. In other words, it will tell us whether the items are discriminating
appropriately between the test takers, with the stronger ones answering
the items correctly, and the weaker ones not. Fifthly,item analysis can tell
us to what extent the items are working together, that is, whether all the
items seem to be tapping into the same construct (for example, listening
for specific information and important details) or whether some appear
to be tapping into something else (for example, the test takers knowledge
of geography, mathematics and so on) and thereby introducing construct
irrelevant variance into the test.
All of the above helps the test developer immensely in determining
whether their items are performing as they had hoped and to what extent
they are providing an accurate picture of the test takers ability in the
targeted domain.
attracted more than 7 per cent of the test population. Interestingly, how-
ever, 14.7 per cent of the test population have selected no answer at all.
This relatively high (more than 10 per cent) degree of no answers needs
investigating. There is a similar pattern in question 4, though the item is
slightly easier (facility value = 48.4 per cent).
In question 5, the item has a facility value of 50 per cent, but one of
the distracters (A) is not working. Only 2.7 per cent of the test takers
failed to answer this question. Question 6 follows a similar pattern with
a slightly easier facility value (58.7 per cent) with only 3.3 per cent no
answers.
The test takers found question 7 much easier (facility value = 74.5 per
cent), but again two of the distracters (C and D) were quite weak (3.3
and 2.7 per cent, respectively). In question 8, the facility value was 36.4
per cent, but more test takers choose B (37 per cent), suggesting that the
distracter was working too well and needs investigating. Distracter A was
also weak (2.2 per cent) in this item.
166 Designing Listening Tests
Summary
The facility values in the task range from 82.1 per cent to 35.9 per cent. If
this task is supposed to be targeting one ability level, say CEFR B1, these
findings would suggest that some items are not at the appropriate level. A
number of the items have weak distracters (attracting less than 7 per cent
of the test takers) and there are two items that have more no answers
than one might expect. One distracter was stronger than the key (item 8),
though at this stage we do not know who chose B and whether these were
the weaker or the stronger test takers. All of the above needs to be inves-
tigated but first let us turn to stage two of the item analysis to see what
else can be learnt before making any final decisions regarding these items.
Discrimination tells us about the extent to which the items in a task are able
to separate the stronger test takers from the weaker ones. What we are hoping
to see is that the better test takers answer more items correctly than the weaker
ones; this is what is referred to as positive discrimination. Discrimination is
calculated by looking at how well a test taker performs on the test as a whole
compared with how s/he performs on a particular item. For example, if a test
taker does well on the test as a whole, one would expect such a test taker to
answer an easy or average item correctly and probably get only some of the
most difficult ones wrong. When this does not happen, when good test takers
answer easy items incorrectly (perhaps due to a flaw in the item or through
simple carelessness), we might find a weak discrimination index on those
particular items. On the other hand, if a test taker does poorly on the test as
a whole, it is more likely that such a test taker will answer a difficult or an
average item incorrectly and probably get only the easier ones correct. Again
where this is not the case, we might find weak discrimination on the particu-
lar items concerned. (Obviously, in either of the above scenarios, where this
happens with only one or two test takers in a large test population, there is
likely to be little impact on the discrimination index of the items involved.)
Discrimination is measured on a scale from 1 to +1. A discrimina-
tion figure of +0.3 is generally accepted as indicating that an item is dis-
criminating positively between the stronger and the weaker test takers.
6 How do weknow if thelistening task works? 167
Depending on how the scores are to be used (high stakes versus low stakes
tests) a discrimination index of 0.25 may also be seen as acceptable (see
Henning 1987). Where the discrimination figure is below 0.3 (or 0.25),
the item should be reviewed carefully as it might be flawed. For example,
the item may have more than one answer (MCQ), no answer, be guessable
by the weaker test takers or have ambiguous instructions. Alternatively,
the item may be tapping into something other than linguistic ability. In
this case the item should be checked for construct irrelevant variance.
It should be remembered that in an achievement test, the discrimina-
tion figures may be low simply because all the test takers have under-
stood what has been taught and have performed well on the test. In other
words the items cannot separate the test takers into different groups, as
the amount of variability between them is too small. Popham (2000)
offers this useful table regarding levels of discrimination:
.20 to .29 Marginal items, usually needing and being subject to improvement
Let us have a look at the same eight MCQ listening items as in 6.3.2.1
and see what this stage of item analysis can tell us. In IBM-SPSS, dis-
crimination is referred to as corrected item-total correlation (or CITC):
Corrected Item-
Total Correlation
Q1 .314
Q2 .340
Q3 .223
Q4 .312
Q5 .280
Q6 .249
Q7 .251
Q8 .203
What can we learn from Figure 6.7? If we use the lower parameter of
0.25 (Henning 1987), we can see that there are two items that fail to
reach this level items 3 and 8 (item 6 when rounded up would result
in 0.25). You will remember from Stage 1 that item 3 was the item that
nearly 15 per cent of the trial population failed to answer. This suggests
that perhaps the item and/or that part of the sound file was problematic
in some way for the test population. This finding again suggests that the
item needs to be investigated. In item 8, more test takers chose distracter
B than the key C, and the weak CITC in Figure 6.6 suggests that at
least some of these were the better test takers. Again this finding needs
exploring.
Summary
All but two of the items have satisfactory discrimination values (above
0.25). Items 3 and 8 need examining to reveal the reasons behind their
weak statistics.
will not be so closely related in terms of what is being targeted (the con-
struct). In other words, a test taker may do well when his/her linguistic
knowledge is being targeted but when s/he also has to use mathematical
knowledge, s/he may respond in a different way to the item. This will be
reflected in the Cronbach Alpha value for that item if a significant pro-
portion of the population has experienced this problem (see Green 2013,
Chapter 3 for more on this issue).
Figure 6.8 shows us the Cronbach Alpha values for the task as a whole
(top table) and for the eight individual MCQ items (bottom table). In
order to understand the figures in the second table we need to look at the
two Cronbach Alpha values together.
Reliability Statistics
Cronbach's Alpha N of Items
.561 8
Cronbach's Alpha
if Item Deleted
Q1 .518
Q2 .503
Q3 .543
Q4 .513
Q5 .524
Q6 .535
Q7 .534
Q8 .550
Summary
We have now analysed how the items perform in terms of their facility
values, discrimination indices and internal consistency. What conclusions
have we come to? At the facility value stage, item 3 appeared to be more
difficult, which might be interpreted as suggesting that it does not belong
to the same level of difficulty as the other items. Its discrimination power
was also a little weak (0.223) and it contributed little to the overall alpha.
This suggests that the item should be reviewed. Item 8 was also seen to
be problematic at the facility value stage where one of the distracters was
selected by more test takers than the key. In terms of discrimination it
was the weakest (0.203) of all the items and contributed least to the tasks
internal consistency. It should also be reviewed.
One final statistic which provides useful insights into how your task is
performing is the average score that the test takers achieved; in other
words, the mean. IBM-SPSS provides this information as part of the reli-
ability analysis and the figure is shown in Figure 6.9 below:
Mean N of Items
4.46 8
This table tells us that the average score among the 184 test takers who
took the task was 4.46 out of a possible 8, or, in percentage terms, 55.7
per cent, suggesting that the task was neither very easy nor very difficult
for this test population. This statistic should be matched against your
expectations of how difficult or easy you expected the test takers to find
the task.
In light of the outcomes of the item analysis, there are usually three pos-
sible routes the task can take: it can be banked for future test purposes;
it can be revised; or it can be dropped. Quantitative and qualitative data
from test taker feedback questionnaires (see6.1.9) should also be taken
into account when making this decision. Where it is felt that an indi-
vidual item should be dropped due to weak statistics, care must be taken
to ensure that this does not impact on the other items by, for example,
creating a lengthy unexploited gap in the sound file which could lead, in
turn, to possible confusion or anxiety in the test takers performance. Any
revisions which are made to the task will need to be re-trialled as solving
one issue could result in creatinganother unforeseen problem.
It goes without saying that item analysis should take place not only at
the field trial stage but also after the live test administration to confirm
the decisions taken about the items and tasks, and to provide further use-
ful feedback to all stakeholders including the test developers.
6.4 Conclusions
The wealth of insights that trialling and data analyses offer to the test
developer is immeasurable. In your own test development situation, you
might not be able to do everything that has been discussed in this chap-
ter, but the more you can do, the more confidence you will have in the
tasks that you and your colleagues create and the test scores that they
produce.
6 How do weknow if thelistening task works? 173
DLT Bibliography
Bachman, L. F. (2004). Statistical analyses for language assessment. Language
Assessment Series. Eds. J.C. Alderson & L.F. Bachman. Cambridge: CUP.
Buck, G. (2009). Challenges and constraints in language test development. In
J.Charles Alderson (Ed.), The politics of language education: Individuals and
institutions (pp.166-184). Bristol: Multilingual Matters.
Carr, N.A. (2011). Designing and analysing language tests: A hands-on introduc-
tion to language testing theory and practice. Oxford Handbooks for Language
Teachers. Oxford: Oxford University Press.
Drnyei, Z. (2003). Questionnaires in second language research. Mahwah, NJ:
Lawrence Erlbaum Associates.
Green, R. (2013). Statistical analyses for language testers. New York: Palgrave
Macmillan.
Green, R., & Spoettl, C. (2009). Going national, standardised and live in Austria:
Challenges and tensions. EALTA Conference, Turku Finland. Retrieved from
http://www.ealta.eu.org/conference/2009/docs/saturday/Green_Spoettl.pdf
Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test
items. Oxon: Routledge.
Henning, G. (1987). A guide to language testing: Development, evaluation,
research. Cambridge, MA: Newbury House.
Linacre, J.M. (2016). WINSTEPS Rasch measurement computer program version
3.92.1. Chicago, IL: Winsteps.com.
Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd
ed., pp.13-103). NewYork: Macmillan.
Popham, W. J. (2000). Modern educational measurement (3rd ed.). Boston:
Alleyn & Bacon.
7
How do wereport scores andset pass
marks?
The first decision you need to make when considering how scores should
be reported is whether your listening test results will be reported as an
individual skill, or as part of a total test score including other skills such
as reading, language in use, writing and speaking. Your answer needs to
take into account such factors as the purpose of the test and how the test
results are to be used. For example, if the purpose of the test is diagnos-
tic, placement or achievement, there are good reasons for the skills to
be reported separately. In a diagnostic test, the more information you
can obtain about a test takers strengths and weaknesses the better; col-
lapsing the scores will result in a lot of useful information being hidden.
The results of a placement test are generally used as the basis for deter-
mining which class is appropriate for a test taker. Clearly having more
details will help particularly if the classes are subdivided for the teaching
of different skills. The results of an achievement test are usually fed back
into the teaching and learning cycle. Receiving information on individual
skills would help the teacher to decide which particular skills need further
attention.
If the test has been designed to assess a test takers proficiency, however,
a global score might be more useful. This is especially true if it is to be
sent to end-users such as tertiary level institutions or prospective employ-
ers. Having said that, if a particular course is linguistically demanding,
the receiving department might well be more interested in the profile of
the test takers abilities so they can more easily judge whether the student
will be able to cope with various aspects of the course.
Having access to both types of results (separate and overall) seems
to be the most practical option and is the approach which some inter-
national examinations take. For example, in IELTS (the International
English Language Testing System) the test taker is awarded a band from
1 to 9 for each part of the test listening, reading, writing and speaking.
The bands are then averaged to produce the overall band score. All five
scores (four individual and one overall) appear on the certificate the test
takers receive. Some examination boards also report scores at the sub-skill
level. For example, the Slovenian Primary School National Assessment
in English reports performance on listening for main ideas, listening for
details and so on.
Some professions also prefer a breakdown of results and go so far as to
advertise job openings citing the specific linguistic requirements neces-
sary in each skill. For example, to qualify for posts within SHAPE (the
Supreme Headquarters Allied Powers Europe), candidates need to show
that they have the required SLP (Standardized Language Profile) for that
particular post. If the necessary SLP were 3332, for instance, this would
mean that the candidate would need a Level 3in Listening, a Level 3in
Speaking, a Level 3in Reading and a Level 2in Writing. (STANAG Level
2 = Fair: limited working; STANAG Level 3 = Good: minimum profes-
sional see Green and Wall 2005: 380.)
Whether you choose to report both sets of scores, or just the global
result, you will also need to decide whether a compensatory approach
should be allowed. This is where a test takers weak performance in one
skill can be helped by a stronger performance in another skill. Let us take,
for example, a test taker whose performance across the four skills, based
on the CEFR language descriptors, was as follows: C1in reading, B1in
7 How do wereport scores andset pass marks? 177
into the final results. Some educational systems provide an online cal-
culator into which schoolteachers can feed the raw numbers for each of
the various skills being tested. The calculator then takes those figures and
produces the final result, having factored in any necessary weighting.
Numbers alone will have no meaning unless they are accompanied by some
informed expert judgement about what the numbers actually mean given a
typical population and bearing on different aspects of the testing process.
Other stakeholders advocate the use of letters when reporting test tak-
ers scores but are these really any better? For example, what does A
mean? Is the difference between A and B the same as the difference
between C and D? And the perennial question: is a performance which
is awarded a grade A on X test the same as a grade A awarded on Y test?
In other words, we seem to be in a similar predicament to that of scores
being reported as numbers above. Without some accompanying state-
ment as to what A means in the context of a given examination, we are
really none the wiser. What about scores which are reported as percent-
ages? Do they provide a clearer picture? Unfortunately, if a test taker gets
75 per cent on a test, you still need to know what the 75 per cent relates
to in terms of content in order to allocate some meaning to that figure.
7 How do wereport scores andset pass marks? 179
Which leaves us with the crucial question: who determines the stan-
dard? Having been involved in the development of the test items, it is
quite difficult for the test developer to do this in an objective way. This
means that ideally the decision makers need to come from outside the
task development team and yet they also need to have a clear understand-
ing of the context in which the standard is to be applied. No single per-
son can do this reliably; this is where procedures such as standard setting
can help enormously (see 7.2).
The pass mark in a test is perhaps a more traditional way of talking about
the standard. It is no easier to set, however. A decision still needs to be
made regarding what constitutes sufficient evidence to state with confi-
dence that a test taker has reached the required level, and therefore can
be awarded a pass. The actual pass mark in many school examinations
seems to be somewhat arbitrary; personal experience has shown that this
can range from as low as 32 per cent, up to 65 per cent. As Alderson
etal. (1995: 155) remark, the pass mark is usually simply a matter of
historical tradition.
Depending on the type of examination you are involved with, you may
have to identify not just one pass mark or cut score, but several within
one test. For example, if you have developed a multi-level test, target-
ing CEFR A1-B2, you will need to decide on the cut scores between A1
and A2, A2 and B1, and B1 and B2 as well as what is considered to be a
performance which is below A1 and which thus cannot be awarded that
CEFR level.
The above scenario would entail making decisions about four cut
scores. This is not an easy task. Some examinations leave such decisions
to the end-users, and simply report the raw score. For example, the most
prestigious universities in a given country may set a very high thresh-
old on a university entrance test for students wishing to study there.
In the Slovenian Primary National Assessment Tests, by contrast, there
is no pass mark; the students receive a report telling them their score
and how well they have done in comparison with the whole population.
182 Designing Listening Tests
Some international English language tests also leave the decision to the
end user. For example, IELTS reports the results of a test takers perfor-
mance, but it is lefttothe receiving department at a university to decide
whether the bands are sufficient for the particular coursefor whichs/he
is applying.
For many people working in the assessment field, leaving the deci-
sion to the end-user is not an option. Stakeholders expect informed deci-
sions to be made regarding whether test takers should pass or fail, and/
or whether they have reached the required standard(s). One possible
solution to this dilemma is to carry out a standard setting procedure as
described in 7.2 below. This procedure is of particular relevance to those
who are involved in high-stakes testing but hopefully will be of interest
to all involved in setting standards in their tests.
Standard setting refers to the process of establishing one or more cut scores
on a test (Cizek and Bunch 2006: 5). It is a procedure that enables those
who are involved to make decisions about which test takers p erformances
7 How do wereport scores andset pass marks? 183
There are a number of reasons why test development teams should put
their tasks through standard setting. Firstly, the decisions made by the
external judges (see 7.2.4) concerning the appropriateness of the tasks for
measuring the targeted criteria are invaluable in helping the facilitators,
who are in charge of the standard setting session, to determine the stan-
dard required by the test takers. In other words, the procedure makes it
possible for the facilitators to identify the minimum cut score which a test
taker needs to reach in order to be at the required standard or level in a
particular examination (see 7.2.9). (Unfortunately, these minimum cut
scores are not always put into practice by the relevant educational systems.)
A second reason for putting the tasks through this procedure is that the
judges can provide informed feedback on the quality of the tasks. This
can include insights into the appropriateness of the sound files in terms
of the accents used, the speed of delivery and the topics. Information
about the suitability of the task methods with respect to the test tak-
ers level of familiarity, and the relationship between the tasks and the
targeted construct, can also be obtained. In addition, feedback on the
level of difficulty of both the sound file and the task, and how well they
reflect the targeted standard, are further useful benefits such sessions can
produce. All of these insights can be channelled back into the task devel-
opment cycle (see 1.7.1) by the sessions facilitators after the standard
setting procedure is complete.
184 Designing Listening Tests
If you are thinking of carrying out a standard setting session, you should
be aware that there is a substantial amount of preliminary work to be
done before it can take place. First of all, identifying experts who can ful-
fil the requirements of being a standard setting judge is time-consuming,
and this work must be carried out well before the session takes place.
(See 7.2.4 for a discussion regarding the pre-requisites of being a judge.)
Putting this phase into effect a year in advance is really not too soon as
the people you will probably want to invite as judges are likely to be busy.
As mentioned in 7.1.5, it is not recommended that test developers be
called upon as judges due to the difficulties they would face in remaining
objective during the standard setting procedure.
Once the judges have been identified, they need to be contacted and
their availability for the whole of the standard setting session must be
confirmed. A judge who wants to leave halfway through the sessions, or
dip in and dip out, causes mayhem for the final decision-making pro-
cess. Moreover, such judges leave with only a partial picture of not only
their own role in the process, but of the purpose of standard setting as a
whole.
Second, it helps to appoint an administrator who will be in charge of
such issues as the venue where the standard setting sessions will be held,
hotel accommodation, travel, per diem and so on.
Third, members of the testing team need to decide which tasks should
be presented at the standard setting session. These tasks should have
good qualitative and quantitative statistics, have been banked after field
186 Designing Listening Tests
t rialling (see Figure1.2) and reflect the targeted standard. Including tasks
which fail on any of these criteria would be an extremely inefficient use
of resources (the tasks are likely to be rejected by the judges) and lead to
reliability issues in terms of cut score decisions (see 7.2.9).
Once appropriate tasks have been identified, a judgement needs to be
taken regarding how the task will appear in the test booklets. This will depend,
of course, on which standard setting method is to be used (see 7.2.6). For
example, if the Bookmark Method is to be followed, the tasks need to be
placed in order of difficulty; if a modified Angoff method is selected, it is
usually more practical to organise the tasks by test method to save time.
In addition to creating the test booklets, the testing team will need to
prepare the following documents:
Copies of the sound files in the order in which the tasks appear in the
judges test booklets. These should include the task instructions. The
amount of time provided should replicate the conditions under which
the test takers completed the tasks.
The key for each of the tasks in the test booklets.
The language descriptors and global scale tables against which the tasks
are to be standard set.
The rating sheets which the judges will use to record their judgements
including those which contains the field trial statistics (see 7.2.7).
Copies of the familiarisation exercise (see 7.2.5).
Copies of the evaluation sheets for judges to provide feedback to the
facilitators on the session, including their confidence in the ratings
they have given.
Copies of a confidentiality agreement (high-stakes situations).
It is crucial that those judges who are selected to attend the standard set-
ting session have the necessary qualities to carry out that role. They should
be regarded as stakeholders and be as representative as possible in the
given context. For example, in a school leaving examination, the judges
are likely to include some or all of the following: school and university
teachers, teacher trainers, school inspectors, headmasters and ministry
officials. Where the test is a national one, selecting judges from various
parts of the country is also recommended so as to avoid any question of
possible bias. Finally, if resources permit, it is useful to invite an external
participant, that is, someone from outside the immediate context (pos-
sibly from another country) who can bring an external perspective to the
session.
Finding such a range of judges is not easy as they need to have not
only a certain level of ability in the targeted language (at least one level
188 Designing Listening Tests
higher than that being targeted and preferably more), but also a sound
knowledge of the relevant system within which the tasks they are to judge
are situated. For example, the judges mentioned above would need to
be familiar with the educational context the tasks will be used in. The
judges also need to be familiar with the language descriptors against
which the test items are to be measured, for example, the CEFR, ICAO,
or STANAG among others.
In addition to the above prerequisites, judges must also be able to fill
the role of a judge. To do this, they must have the capacity to set aside
their own ability in the language being targeted. In other words, they
must ignore what they personally find easy or difficult, as well as what
their own students might, and focus purely on the scale against which the
tasks are to be measured.
Finally, as mentioned in 7.2.3.1, they must be able to devote sufficient
time to the procedure. Standard setting sessions can last up to five days
depending on the number of skills and tasks being tabled, and judges
who cannot commit to the whole period of standard setting should not
be invited (see Cizek and Bunch 2006, Chapter 13 for more insights on
the participant selection process).
tators to confirm that the judges are indeed familiar with the language
descriptors that are to be used in the session as it is their judgements
which will be factored into the cut-score decisions after the standard set-
ting procedure is complete (see 7.2.9). This confirmation is normally
achieved by asking the judges to complete a familiarisation exercise on
the first morning of the procedure. The exercise can take various forms,
but one of the most popular ones involves the judges being given a list
of randomised descriptors taken from the scales they are to set the tasks
against. Equipped with rater numbers to protect their anonymity, judges
are then asked to put one scale level against each of the descriptors.
Figure 7.1 below shows an extract from such an exercise based on the
CEFR.
Your Key
Answer
programmes.
standard speech.
native speed.
Once the judges have completed the column with their responses, the
papers should be collected in. The judges responses are then entered into
a spreadsheet and projected onto a screen so that all participants can
see how the descriptors have been rated. Discussion of the various rat-
ings, as well as clarification regarding any perceived ambiguities in the
descriptors, then follows with the key being revealed at the end. Where
any of the judges are shown to have an unacceptable lack of familiarity
with the descriptors, the facilitators must decide whether they should
remain in the pool of raters. (Where a number of skills are being standard
set within one session, this familiarisation procedure should be repeated
with descriptors from each skill.)
Where a pool of standard setting judges can be established, and can
be called upon on an annual basis, this is obviously of great benefit to
the facilitators as it cuts down on the amount of time needed for train-
ing, and familiarisation in, the standard setting session. It also makes it
possible to compare the difficulty level of tasks year on year, and even
across languages where there are a sufficient number of multilingual
judges available (see Green and Spoettl 2011). Ideally, all tasks which
are used in high-stakes tests should go through some form of external
review which ultimately means holding a standard setting session every
year. For practical reasons, unfortunately, this does not happen in many
countries.
Trial statistics provide a useful measure against which the judges can
compare the ratings they have assigned to each item once their judge-
ments have been completed (see step 16 below in 7.2.8). Although it is
the language descriptors which should be the final arbiter in deciding the
difficulty level of an item, judges are sometimes unwittingly influenced
by some characteristic of the task and/or the sound file. The field trial
statistics provide empirical evidence of how the tasks performed which,
in turn, should help highlight any personal reaction to an item or task
and prompt the judge to review their rating(s).
192 Designing Listening Tests
When revealing the statistics, the judges are usually supplied with
information about how many test takers answered the item correctly
(facility values), how the test methods performed and, where feedback
questionnaire data are available, how the test takers perceived the tasks.
Details about the test takers are also supplied including the numbers
involved, their representativeness of the target test population as a whole,
their appropriateness in terms of targeted ability level and the time of
year the field trial was administered in case this has had any impact on
the difficulty level of the items (see 6.2.2).
10. The judges are reminded that the purpose of standard setting is not
to discuss the quality of the items they are going to judge, but simply
to place each of the items at a particular CEFR level. (At the discre-
tion of the facilitators, time may be set aside for task discussion once
the ratings are complete and have been submitted so as not to disrupt
the procedure.)
11. The judges are provided with the first test booklet and asked to apply
a level to each test item in each task based on the sound files they will
hear and using the language descriptors and global scales. This is
known as Round 1.
12. The keys to the items are distributed. The judges check their answers
and, where necessary, review the CEFR levels they have assigned.
13. The judges ratings from Round 1 are entered into a spreadsheet.
14. The levels awarded by the judges are looked at globally (and anony-
mously) on screen.
15. The average ratings per item across the judges are discussed as well as
any outliers (those who have assigned extreme levels in comparison
with the rest of the judges). During the discussion individual judges
can provide their rationale for assigning a particular level if they so
desire but this is not compulsory.
16. The statistics from the field trial are provided and discussed in rela-
tion to the judges ratings.
17. The judges are given an opportunity to make adjustments to their
Round 1 judgements if they so wish in light of the discussion and the
field statistics. There is no obligation to do so. These become the
Round 2 ratings.
18. The Round 2 ratings are entered into a spreadsheet for use in the cut
score deliberations after the standard setting procedure is complete.
19. The judges repeat the above process with further test booklets as
necessary.
20. The judges complete a final evaluation form providing feedback on
their level of confidence in, and agreement with, the final recom-
mended level of the items.
21. The standard setting facilitators review the judges decisions regard-
ing the difficulty level of the items and their feedback on the
session.
194 Designing Listening Tests
Once standard setting is complete, the data entry from Round 2 should
be checked and analysed to ascertain the overall level of each task. Once
this has been done, those tasks which have been judged to be above or
below the targeted level should be set aside. The facilitators then need to
make an initial selection from the remaining tasks as to which ones might
be the most appropriate for use in the live test.
In making this selection the facilitators need to factor in the field sta-
tistics in light of their suitability: the time of year when the trial took
place and hence the test takers motivation, as well as how well they rep-
resent the target test population. The facilitators also need to take into
consideration the degree of confidence they have in the judges ratings.
For example, they should take into account the judges knowledge of
the language descriptors used, their previous exposure to standard setting
procedures, the judges own confidence in the levels they have awarded,
and the relationship between their judgements and the available empiri-
cal data.
The above procedure should result in identifying the most eligible
tasks. Sometimes, however, even these tasks might contain one or two
items on which the judges did not completely agree. For example, some
judges may have given an item a B2 rating, while others gave it a B1
rating. As mentioned in 2.5.1.4, it is not unusual for a task to include
an item which is either slightly easier or slightly more difficult than the
others. However, when such items are to be included in a live test, further
deliberation is necessary to decide how these might affect the cut score.
Lets look at an example. In a B2 listening test made up of four stan-
dard set tasks, the judges ratings have indicated that there are five B1
items, and 25 B2 items. If we work on the hypothetical basis that a test
taker who is at B2 should be able to get 60 per cent of the B2 items cor-
rect, as well as 80 per cent of the B1 items, this would mean that the test
taker would need to answer 19 items correctly (15 at B2 plus 4 at B1) in
order to be classified as a B2 listener. A score of 19 out of 30, or 63.3 per
cent, would therefore be the cut score which would divide the B2 listen-
ers from the B1 listeners on these four particular tasks.
7 How do wereport scores andset pass marks? 195
Website materials
2. Test specifications
d. Assessment criteria
The more students know about the content and aims of a test, the more likely
they are to be able to do themselves justice in the examination hall.
Even though all the listening tasks which appear in the live test book-
lets should have gone through field trials, statistical analyses, and ideally
some form of standard setting prior to being selected, it is still impor-
tant to analyse their live test performance. This is because the field trials
will necessarily have been carried out on test takers who were differently
motivated and therefore it is possible that the facility values might have
changed.
It is recommended that the same analyses be carried out on the live test
results as those described in 6.3.2, that is frequencies, discrimination and
reliability analyses. Since the test population is likely to be much larger
than at the field trial stage, it should prove both useful and insightful to
198 Designing Listening Tests
with a score of 58, for example, could have a real score of between 56 and
60; a test taker with a score of 60 could have a real score of between 58
and 62, and so on. In order to be fair, all such borderline cases need to be
reviewed and their results confirmed before the final test scores are released.
7.5.2 Recommendations
In addition to providing insights into how the tasks have performed, the
post-test report should provide a list of recommendations. These might
include observations about the tasks themselves in terms of the test meth-
ods used, the topics, the amount of time provided to read and complete the
task, the level of difficulty inter alia. Although such issues will have been
analysed and reported on after the field trials, it is still useful to revisit these
aspects of the test if only to confirm that they are all working as expected.
The report might usefully include details about any test administration
issues which have come to light. For example, concerns regarding the acous-
tics at the test venue(s), the delivery of the test material, timing issues, and,
where possible, feedback from the test administrators and test takers. The
marking of the live test might also result in further recommendations regard-
ing grading issues including online support, for example, hotline or email.
Final thoughts
The main objective behind developing good listening tasks is to produce
valid and reliable test scores. As Buck reminds us (2009: 176):
There is no such thing as a perfect test, but infollowing all the stages
outlined in this book, I would argue thatwe have a much better chance
of getting it right than if we had not done so.
DLT Bibliography
Alderson, J.C., Clapham, C., & Wall, D. (1995). Language test construction and
evaluation. Cambridge: CUP.
Bhumichitr, D., Gardner, D., & Green, R. (2013). Developing a test for diplo-
mats: Challenges, impact and accountability. LTRC Seoul, Korea: Broadening
Horizons: Language Assessment, Diagnosis, and Accountability.
Buck, G. (2009). Challenges and constraints in language test development. In
J.Charles Alderson (Ed.), The politics of language education: Individuals and
institutions (pp.166-184). Bristol: Multilingual Matters.
Cizek, J.G., & Bunch, M.B. (2006). Standard setting: A guide to establishing and
evaluating performance standards on tests. Thousand Oaks, CA: Sage
Publications, Inc.
Council of Europe. (2009). Relating language examinations to the common
European framework of reference for languages: Learning, teaching, assessment. A
Manual.
Figueras, N., & Noijons, J. (Eds.) (2009). Linking to the CEFR levels: Research
perspectives. Arnhem: CITO.
Fulcher, G. (2016). Standard and frameworks. In D. Tsagari & J. Banerjee
(Eds.), Handbook of second language assessment (pp. 29-44). Boston: De
Gruyter Mouton.
Geranpayeh, A. (2013). Scoring validity. In A.Geranpayeh & L.Taylor (Eds.),
Examining listening. Research and practice in assessing second language listening
(pp.242-272). Cambridge: CUP.
Green, R. (2013). Statistical analyses for language testers. New York: Palgrave
Macmillan.
Green, R., & Spoettl, C. (2011). Building up a pool of standard setting judges:
Problems solutions and Insights C.EALTA Conference, Siena, Italy.
Green, R., & Wall, D. (2005). Language testing in the military: Problems, poli-
tics and progress. Language Testing, 22, 379.
Martyniuk, W. (Ed.) (2010). Relating language examinations to the Common
European framework of reference for languages: Case studies and reflections on the
7 How do wereport scores andset pass marks? 201
Alderson, J.C. (2009). The politics of language education: Individuals and institu-
tions. Bristol: Multilingual Matters.
Brunfaut, T., & Rvsz, A. (2013). The role of listener- and task-characteristics
in second language listening. TESOL Quarterly, 49(1), 141-168.
Buck, G. (2009). Challenges and constraints in language test development. In J.
Charles Alderson (Ed.), The politics of language education: Individuals and
institutions (pp. 166-184). Bristol: Multilingual Matters.
Council of Europe. (2001). Common European framework of reference for lan-
guages: Learning, teaching, assessment. Cambridge, UK: Cambridge University
Press.
Ebel, R. L. (1979). Essentials of educational measurement (3rd ed.). Englewood,
NJ: Prentice-Hall.
Green, R., & Wall, D. (2005). Language testing in the military: Problems, poli-
tics and progress. Language Testing, 22, 379-398.
Harding, L. (2015, July). Testing listening. Language testing at Lancaster summer
school. Lancaster, UK: Lancaster University.
Hinkel, E. (Ed.) (2011). Handbook of research in second language teaching and
learning. NewYork: Routledge.
Linn, R. L. (Ed.) (1989). Educational measurement (3rd ed.). New York:
Macmillan.
Pallant, J. (2007). SPSS survival manual (6th ed.). Maidenhead: Open University
Press.
Tsagari, D., & Banerjee, J. (2016). Handbook of second language assessment.
Boston: De Gruyter Mouton.
Index
A purpose, 45
assessment security, 146, 150, 157-8
criterion-referenced, 180-1, 183 size, 155
grading, norm-referenced, task selection, 23
179-80 test booklet preparation,
pass marks, weighting, 49 155-7
test population, 154, 192, 197
time needed, 48, 80
F
field trials
administration, 17, 22, 45, 107, I
146, 148, 150-1, 157-8, 172, input
197, 199 authentic, 17, 37-8
administration guidelines, 150-1 background noise, 15, 38
dates, 154-5 copyright, 35-46, 99
feedback questionnaires, 151-3, density, 60, 82
157, 172 discourse type, 40-1
marking, 23, 24, 48, 140, 158-60, EFL sources, 37
199 invisible, 2
test specifications (cont.) results, 61, 71, 72, 78, 102, 140
types of listening, 33, 41, 44, 47, re-textmapping, 82
180, 196 selective, 68, 82
versions, 28, 50 SIID, 68-76, 135
working, 22, 38, 45, 60, 98, 153 silence, 60, 66
test taker, 16, 17, 28, 34, 39-41, 44, target, 38, 64, 83, 128, 136,
47, 49, 70, 80, 88, 90, 92-8, 140
101-10, 116-19, 120-2, textmapper, 58, 60, 62, 76-82
124-30, 132-5, 138-40, 146, textmap table, 64-6, 75
147, 149, 150, 156, 158-60, textmap time, 59, 60, 66, 71, 75
166, 172, 175-81, 183, 194, transcript, 57-9, 83
196, 198, 199. See also listener unexploited, 72, 75, 102
textmapping
by-products, 82-3
careful, 76 V
co-construction, 57-9 validity
collate, 61, 71 cognitive, 7, 8, 11, 36, 43, 89, 90,
communalities, 62-3, 67 101, 126
consensus, 57-9, 77-9, 81, 83 concurrent,
context, 60, 61, 83 construct, 30, 97, 148
distance, 84 construct irrelevant-variance, 30
distribution, 72, 78, 129 construct under-representation,
entries, 69, 75 30
exploit, 57-9, 74-5, 83 evidence, 28, 30
face-to-face, 58, 84 face, 37, 105, 106, 141
gist, 61, 66-7 predictive, 148, 158
instructions, 60n1, 61, 85, 100,
108-9
interpretation, 57, 58, 60, 67 W
key words, 64 website sample tasks
listening behaviour, 22, 57-9, justifications, 149, 197
76-7, 82, 140 keys, 141-3, 196
MISD, 80-1 written language
multiple files, 67-8, 76 clauses, 9
negotiation, 61 complex, 9
numbers, 70, 128 gerunds, 9
procedure, 7-5, 22, 38, 58, 60, participles, 9
64, 66-7, 80-1, 108 permanent, 11
redundancy, 45, 57-9, 83, 139 syntax, 9, 126