Testing and Evaluation

22
Testing and evaluation

There are many reasons why we might want to test students, and many types o test. Those
that are at the ore ront o most students’ and teachers’ minds are the public exams which
candidates take in order to get a quali cation, and the university entrance exams or which
students diligently prepare in order to gain entry to prestigious colleges. Important though
they are, these exams are only two types o assessment.
Assessment can, and should, be an integral part o what teachers do. When used
appropriately, it helps the students to understand what they can and can’t do, and by doing
this, helps them move orward and see clearly what they need to do next. At its most basic
level, this assessment for learning (see 22.1) is the kind o thing that teachers do all the
time when they give eedback on what their students say or write (see 8.1). This eedback
is designed to help the students to improve their per ormance, rather than just giving a
snapshot o a student’s abilities at a particular time.
Summative and formative assessment

Snapshot exams, which simply give an idea o what a student can do at any given time,
are a regular eature o the lives o schoolchildren and those in higher education. They are
examples o summative assessment, which measures the product o a student’s learning.
They may be used to nd out how much a candidate knows or can do at the age o 11 or
16, or example.
Formative assessment, on the other hand, measures the students’ abilities as part o
a process. Crucially, the students as well as the teacher are involved in this assessment.
Formative assessment is part o the learning process itsel and looks to the uture, rather than
ocusing exclusively on what has been achieved up to a given point in time. For this reason,
it is sometimes called assessment for learning (AFL). In the same way that teachers give a
di erent kind o eedback on student writing when it is part o a process than they do to a
nished piece o work, so ormative assessment ocuses on helping the students progress to
the next level, rather than simply judging them on what they can do now.
The Assessment Re orm Group (see chapter notes on page 424), a UK-based organisation
which promotes innovation in testing, suggests ten principles or AFL:
Assessment or learning should:
• be part o e ective planning o teaching and learning, where both teachers and students
can measure progress towards learning goals.
• ocus on how students learn. The students themselves should consider this and
understand more about it (see 5.5.1).
• be a key pro essional skill or teachers. We should be able to analyse and interpret
what we observe.
• be sensitive and constructive because any assessment has an emotional impact. Doing
well or badly can have pro ound e ects on test takers.
408
• take account o the importance o student motivation. The way we give results and the
way assessments are given can a ect how students eel about learning.
• promote understanding o goals and criteria.
• include student consultation about the criteria or assessment. It is essential that students
understand what such criteria mean.
• help the students know how to improve.
• develop the students’ capacity or sel -assessment so that they become refective
and sel managing.
• recognise the ull range o achievement o all learners.
As we saw in 8.1, teachers assess student per ormance all the time. Our main aim when
doing this in class is to help our students to do better. So it is with assessment or learning. But
AFL has a deeper underlying aim, and this is that the students should be able to measure their
own progress. That is why the CEFR ‘can do’ statements and the more precise Global Scale
o English – and other descriptors (see 5.4) – are so important. I students can clearly identi y
their own strengths and weaknesses, then their learning can be put into their own hands.
Qualities of a good test

I we are to spend time testing our students (and i they are to have con dence in the tests
they are being encouraged to take), then the tests – whether written by us or by some testing
authority – need to have three essential characteristics:
Transparency This means that anyone concerned with the test should have access to clear
statements about what the test is supposed to measure.
Validity A test is valid i it tests what it is supposed to test. It will only be valid ‘i the test
o ers as accurate as possible a picture o the skill or ability it is supposed to measure’ (ILTA
guidelines – see chapter notes on page 424). Thus, i a test doesn’t give us an accurate picture
o what we are trying to evaluate (the knowledge o and ability to use English), then it isn’t
much good. We call this kind o validity construct validity.
I we try to test writing ability in English with an essay question that requires specialist
knowledge o history or biology – unless it is known that all the students share this knowledge
be ore they do the test – our test (as a test o written English) will be invalid. We call this kind o
validity content validity.
A test is valid i it produces similar results to some other measure which is designed to test
the same abilities, that is, i we can show that Test A gives us the same kind o results as Test B.
We call this kind o validity criterion validity.
Tests need face validity, too. This means that the test should look and seem, on the ace o it,
as i it is valid. A test which consisted o only three multiple-choice items would not convince
the students o its ace validity, however reliable or practical teachers thought it to be.
Reliability Reliability re ers to the consistency o the test results. Given the same conditions,
a test should always give the same results.
In practice, reliability is enhanced by making the test instructions absolutely clear,
restricting the scope or variety in the answers and making sure that the test conditions
remain constant.
Reliability also depends on how tests are marked and who marks them. This is a signi cant
concern, whether the tests are marked digitally or by human scorers (see 22.5.2).
409
chapter 22
22.2.1 Washback
Test designers and teachers know that tests have a really powerful effect on what happens
in classrooms. Obviously, teachers will want their students to pass the tests they take, so
teaching and learning often reflect what the tests contain. This results in what is usually
referred to as the washback or backwash effect.
Good tests have a very positive washback effect. For example, if the students were
preparing to take a test which included the item in Figure 1, their teacher would almost
certainly include the skill of summarising in classroom practice. If we believe that
summarising is a useful technique for students to acquire, then the washback from this
test has been good.
Figure 1 Sample from the writing paper from the Pearson Test of Academic English
However, public exams in some countries are still focused almost exclusively on grammar-
based multiple-choice items (see 22.4.1). The washback from these exams can be
problematic since the temptation for teachers to overuse such items in their teaching – under
pressure, perhaps, from the students and their parents – may well override their own beliefs
about what good learning and teaching should be like.
Clearly, test designers need to have the washback effect in their minds when they design
tests, but, equally importantly, teachers need to think carefully about how to counteract the
negative effects of washback when preparing their students to take exams (see 22.6).
Types of test
There are five main categories of test which teachers and learners of English are likely to
come into contact with:
Placement tests When students sign up for a language course in a private language school,
for example, they usually do a placement test to determine which class they should go into.
Such tests usually try to measure grammar and vocabulary knowledge, as well as evaluating
the students’ reading and listening ability and, where practical, how these correlate with
speaking ability.
Some schools ask students to assess themselves as part of the placement process, adding
this self-analysis into the final placement decision.
410
Progress and achievement tests These tests are designed to measure the students’
language and skill progress in relation to the syllabus they have been ollowing. How well have
they learnt what they have been studying, and, as a result, what more still needs to be done?
Progress tests are o ten written by teachers. They can and should have a ormative purpose
(see 22.1) so that, based on the students’ per ormance in the test, teachers can decide what
needs to be done in the uture.
Achievement tests are given at the end o a course o study to see how well the students
have learnt what they have been studying. Teachers and other test designers who construct
these tests need to bear in mind the potential bene ts and dangers o the washback e ect
(see 22.2.1). The tests need to refect not only the language, but also the type o learning that
has been taking place.
Profciency tests Pro ciency tests give a general ‘snapshot’ picture o a student’s knowledge
and ability. They are requently used or high-stakes public exams where a lot depends on
how well the candidates do. They are used as goals that people have to reach i , or example,
they want to be admitted to a oreign university, get a particular job or obtain some kind
o certi cate.
As we discussed in 22.2.1 pro ciency tests have a pro ound washback e ect.
Port olio assessment Achievement tests and pro ciency tests are both concerned with
measuring a student’s ability at a certain time. Students only get ‘one shot’ at showing
how much they know. The pressures this puts candidates under can make some o them
anxious and they do not do their best in exam conditions. For this reason, many educators
claim that ‘sudden death’ testing is un air and does not give a true picture o how well some
students could do in other circumstances. As a result, many educational institutions allow
their students to assemble a port olio o their work over a period o time (a term or a year, or
example). The student can then be assessed based on three or our o the best pieces o work
produced during this period.
Port olio assessment o this kind has clear bene ts. It provides evidence o student e ort.
It helps students become more autonomous, and it can ‘ oster student refection (and) help
them to sel -monitor their own learning’ (Nunes 2004: 334). Especially with written work, the
students will have had a chance to edit what they have done be ore submitting their work,
and this approach to assessment has an extremely positive washback e ect.
However, port olio assessment is not without its pit alls. In the rst place, it is time-
consuming or students to build up their port olios, and it suggests longer hours o evaluation
or the teacher. Secondly, teachers will need clear training in how to select (or help the
students to select) items rom the port olio and how to grade them. But, above all, when
students work on their own, away rom the classroom, it is not always clear that the work
refects their own e orts or whether, in act, they have been helped by others. This has
made some people reluctant to trust such orms o assessment. Students themselves can be
reluctant, too. Ricky Lam and Icy Lee ound that although their students responded positively
to the ormative aspects o port olio assessment and they enjoyed selecting work rom their
port olios or summative assessment, they still pre erred graded summative tests (Lam and
Lee 2009). However, as with process writing (see 20.2.1), i we build cycles o revision, sel -
refection and, perhaps, peer-assessment into port olio tasks, they can orm a good basis
not only or grading, but also as assessment or learning (see 22.1). Such sel - and peer-
assessment, based on success indicators such as the CEFR ‘can do’ statements (see 5.4.2) and
the Global Scale o English (see 5.4.3), involve students in the whole process o assessment
and, as a result, encourage them to be more autonomous in their learning (see 5.5).
411
chapter 22
Test item types

There is a wide variety o di erent test item types available to language testers. These range
rom indirect test items, which target the knowledge o , or example, speci c items o
grammar or vocabulary, to more direct test items, which ask the students to per orm direct
language tasks, such as writing a letter.
Test experts have requently made a distinction between discrete point and integrative test
items. Whereas the ormer only test one thing at a time, the latter test a number o language
points and skills in one test item. The summarising example in Figure 1 on page 410 is clearly
an integrative test item since it measures not only the students’ ability to understand what
they have read, but also their ability to put their understanding into words.
A test item which asks the students to ll in a blank with either a, the or nothing is clearly a
discrete point item since it ocuses exclusively on the use o articles.
22.4.1 Some typical test item types

Many public language tests are now administered digitally and, as a result, there have been
some changes to test design. Prominent among these are the time limits which are set or
various items. For example, the students may be given 40 seconds to read something and
25 seconds to respond to it in a spoken test (where they speak into a microphone and their
response is recorded). In writing tests, an automatic word count (see, or example, the test
item in Figure 1 on page 410) tells the students how well they are keeping to the length
requirements. More important than these, perhaps, is the automated grading, which means
that the tests are scored digitally and, it is claimed, more reliably (see 22.5.2). Digital tests
also give an absolute reliability o test-taking conditions, since all the candidates will get
exactly the same treatment.
O course, many other tests are still pencil-and-paper a airs, especially where technology
or digital assessment is unavailable. In the ollowing examples o (mostly) indirect test item
types, both digital and pencil-and-paper items are included.
Multiple-choice questions A traditional vocabulary multiple-choice question (MCQ)
looks like this:
The journalist was ______________ by enemy fire as he tried to send

a story by satellite phone.
a wronged b wounded c injured d damaged
MCQs are one o the most popular test instruments or measuring students’ knowledge o
grammar and vocabulary, especially because they are easy to mark.
MCQs present a number o challenges, however. In the rst place, they are extremely
di cult to write well, especially in terms o the design o the incorrect choices (known as
‘distractors’). These distractors may actually put ideas into the students’ heads that they did
not have be ore they read them. Secondly, while it is possible to train students so that their
MCQ abilities are enhanced, this may not actually improve their English. There is always the
danger that a di erence between two student scores may be between the person who has
been trained in the technique and the person who has not, rather than being a di erence o
language knowledge and ability.
412
MCQs are used or discrete-point testing (as in the example above) but they are also
requently used in more integrative tests such as testing reading or listening comprehension
– where the students have to choose the correct answer rom one o our possibilities.
Sometimes (especially in comprehension tasks), students may be asked to select a number o
alternatives. For example:
Read the text and answer the question by selecting all

correct responses. More than one response is correct.
Gap fll Many test items ask the students to complete sentences with words or phrases.
For example:
Would you like ……….. to the cinema tonight?
Candidates need to be told whether they should write only one word or whether more
than one word is possible/expected. In some cases, the words or phrases required might be
listed in a box. In digital tests, candidates o ten have to drag and drop the appropriate items
rom the box into the correct blanks.
Trans ormation and paraphrase This is a common test item that asks the candidates to
rewrite sentences in a slightly di erent orm, retaining the exact meaning o the original.
For example, the ollowing item tests the candidates’ knowledge o verb and clause patterns
that are triggered by the use o I wish:
I’m sorry that I didn’t get her an anniversary present.

I wish –––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– .
In order to complete the item success ully, the candidate has to understand the rst
sentence, and then know how to construct an equivalent which is grammatically possible.
As such, these items do tell us something about the candidates’ knowledge o the
language system.
Other trans ormation test types ask the students to rewrite sentences using (a orm o )
words given. For example:
We offer a ––––––––––––– of different types of coffee in our restaurant. SELECT
Reordering Getting students to put a set o jumbled words in the right order to make
appropriate sentences tells us quite a lot about their underlying knowledge o syntax and
lexico-grammatical elements. The ollowing example is typical:
Put the words in order to make correct sentences.

called / I / I’m / in / sorry / wasn’t / when / you
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
413
chapter 22
The biggest challenge or test designers is to nd sentences where only one sequence
is correct.
Reordering is sometimes used (on a bigger scale) or reading tests where the candidates
have to put sentences or paragraphs in order.
Summarising As we saw in 22.2.1, summarising is a way o testing a student’s ability to
understand and put that understanding into words. Another type o summarising test is
administered through MCQs. Students read or listen to something and choose the correct
summary rom a number o alternatives. For example:
Figure 2 From the sample listening paper of the Pearson Test of Academic English
Other typical test items include choosing the correct verb orm in sentences and passages
(I (have arrived/arrived) yesterday), and nding errors in sentences (She noticed about his
new jacket). All o these techniques o er items which are quick and e cient to score, and
which aim to tell us something about a student’s underlying knowledge.
22.4.2 Skill-focused tests

When test designers and teachers construct items – especially direct test items – to test
students’ skill knowledge, it is important that they create a level playing eld. For example,
candidates in a writing or speaking test might well complain i they were given the ollowing
essay question since it un airly avours those who have sound scienti c knowledge and it
presupposes a knowledge o twentieth-century scienti c history:
Why was the discovery of DNA so important for the science of the twentieth century?
414
The ollowing topic, on the other hand, comes close to ensuring that all the candidates
have the same chance o success:
Some people think that children should wear school uniforms while others
believe that children should be able to choose what to wear to their lessons.
Discuss the advantages of both approaches and then give your opinion.
Testing the receptive skills o listening and reading also needs to avoid making excessive
demands on the student’s general or specialist knowledge. Students should not be tested on
their ability to understand technical in ormation, or example – unless, perhaps, it is a CLIL
test (see 1.2.3) – but, rather, on their understanding o English.
Apart rom the test item types we detailed in 22.4.1, there are a number o other ways in
which a candidate’s abilities in the our skills can be tested. In the ollowing lists, the skills are
called mostly speaking, mostly writing, etc. because skill testing is always, to some extent, an
integrative mix.
Mostly speaking
• An interview during which the examiner questions a candidate about themselves.
• Information-gap activities where a candidate has to find out information either from
an interlocutor or a fellow candidate.
• Decision-making activities, such as showing paired candidates ten photos of
people and asking them to put them in order of the best- and worst-dressed for a
particular occasion.
• Compare-and-contrast activities in which candidates can both see a set of pictures
or where (as in many communication games) they have to find similarities and
differences without being able to look at each other’s material.
• Role-play activities where candidates perform tasks such as introducing themselves
or calling a theatre to book tickets.
• Description of a previously unseen image.
• Reading aloud: candidates have, say, 40 seconds to read a passage which they then
have to read aloud.
• Repeating a sentence: candidates have to repeat a sentence they hear as
accurately as they can.
Mostly writing
• Compositions and stories.
• ‘Transactional’ letters and emails, where candidates reply to a job advertisement,
write a complaint to a hotel or answer a friendly email, based on information given
in the exam paper.
• Information leaflets about their school or a place in their town.
• A set of instructions for some common task.
• Newspaper articles about a recent event.
• Reviews of plays, films, etc.
415
chapter 22
Mostly reading
• Multiple-choice questions to test comprehension of a text.
• Matching written descriptions with pictures of the items or procedure they describe.
• Transferring written information to charts, graphs, maps, etc. (though special care
has to be taken not to disadvantage non-mathematically-minded candidates).
• Choosing the best summary of a paragraph or a whole text.
• Matching jumbled headings with paragraphs.
• Inserting sentences or paragraphs provided by the examiner in the correct
place in the text.
Mostly listening
• Completing charts with facts and figures from a listening text.
• Identifying which of a number of objects (pictured on the test paper) is
being described.
• Identifying who (out of two or three speakers) says what.
• Identifying whether speakers are enthusiastic, encouraging, in
disagreement or amused.
• Following directions on a map and identifying the correct house or place.
• Listening to a sentence and then writing it as accurately as possible. (Dictation)
• Filling in the missing words of an audio text.
22.4.3 Young learner test item types

The test items in 22.4.1 and 22.4.2 are all designed or older children and adults.
Testing young learners demands a di erent approach, whether the tests are computer
or paper-based.
Pictures The most obvious de ning characteristic o young learner testing is the use o
pictures (see Figure 3). Candidates can be asked questions about them; they can draw lines
between objects in them; they can colour things, etc.
Figure 3 From Cambridge Young Learner Starter sample listening test
Ticks, crosses and smiley aces Young learner tests requently ask the students to put ticks
or crosses against pictures to identi y them or the in ormation in them. Test designers can
ask them to choose smiley or rowny aces, etc.
Dragging, dropping, clicking Young learners can be asked, in digital tests o ered
on computer or mobile plat orms, to drag and drop items into pictures or to click on
appropriate pictures. They can select and click to colour items, or select rom dropdown yes
and no answers.
416
Writing and marking tests

At various times during our teaching careers, we may have to write tests for the students
we are teaching, and mark the tests they have completed for us. These may range from a
progress test at the end of a week to an achievement test at the end of a term or a year.
22.5.1 Writing tests

Before we do anything else, there are three main issues we need to address:
Objectives We need to be clear in our minds about why we will be asking the students to
take a test. If we wish to find out how well they have learnt what they have been studying,
we may well write a progress test (see 22.3). If we want information to help us to decide
what to do next, our test will be designed to find the students’ strengths and weaknesses, or
perhaps to see how well they will be able to cope with the work that we have planned to do.
In such cases, we will not base our test on what the students have studied, but on what they
will study in the future.
Our students need to have a clear understanding of the test objectives, too, and the
criteria for success. In other words, they need to know how the test is scored and what they
have to do to get good grades.
Context We need to remind ourselves of the context in which the test takes place. We have
to decide how much time can and should be given to the test-taking, when and where it
will take place, and how much time is available for marking. For example, there is no point
in designing a sophisticated and multi-faceted test if there is not enough time for it to be
graded properly.
Future action We need to have an idea of what we are going to do with the test results
once the test has been completed.
Once we are clear about the objectives of our test, the situation it will take place in
and what we will do with the results, there are a number of other things we need to
take into account:
Test content We have to list what we want to include in our test. This may mean taking a
conscious decision to include or exclude skills such as reading comprehension or speaking
(if speaking tests are impractical). It means knowing what syllabus items can be legitimately
included (in an achievement test), and what kinds of topics and situations are appropriate
for our students.
Just because we have a list of all the vocabulary items or grammar points the students have
studied over the term, this does not mean we have to test every single item. If we include a
representative sample from across the whole list, the students’ success or failure with those items
will be a good indicator of how well they have learnt all of the language they have studied.
Balance If we are to include direct and indirect test items, we have to make a decision
about how many of each we should put in our test. A 200-item multiple-choice test with a
short real-life writing task tacked on the end suggests that we think that MCQs are a better
way of finding out about the students than more integrative writing tasks would be.
Balancing elements also involves estimating how long we want each section of the test
to take, and then writing test items within those time constraints. The amount of space and
time we give to the various elements should also reflect their importance in our teaching.
417
chapter 22
Scoring However well we have balanced the elements in our test, our perception o our
students’ success or ailure will depend upon how many marks are given to each section o
the test. I we were to give two marks or each o our ten MCQs, but only one mark or each
o our ten trans ormation items, it would mean that it was more important or the students
to do well in the ormer than in the latter.
Trialling tests It is a good idea to try out individual items and/or whole tests on colleagues
and other students be ore administering them to real candidates. This is especially important
i the students’ grades are going to be recorded, or i the scores are going to count towards
their nal grades, or example. It is obviously less important when we give students short
snap tests, especially those which have a primarily ormative purpose (see 22.1).
In an ideal situation, we can ask ellow teachers to try out (or look at) items that we write.
Frequently, these colleagues will spot problems which we are not aware o and/or will come
up with possible answers and alternatives that we had not anticipated.
Later, having made changes based on our colleagues’ reactions, we can try out the test
on some students. We will not do this with the students who are going to take the test,
o course, but i we can nd a class that is roughly similar – or a class one level above the
proposed test – then we will soon nd out which items cause unnecessary problems. We can
also discover how long the test takes.
Such trialling is designed to avoid disaster and to yield a whole range o possible answers/
responses to the various test items. This means that i other people nally mark the test, we
can give them a list o possible alternatives and thus ensure reliable scoring.
22.5.2 Marking tests

Tests (especially public exams) are, increasingly, administered and graded digitally. Based on
extensive trialling and measuring, using experienced scorers coupled with digital analysis,
it is claimed that such grading is as reliable as – i not superior to – human marking. And, o
course, it is in many ways more e cient, too.
One o the problems with human graders is that di erent people mark/score tests
di erently. There is o ten a great deal o marker subjectivity involved: where one person
might give a particular candidate 8 out o 10 or a composition, another might give the
same piece o writing only 6. Sometimes, this is due to di erent perceptions about what
a good piece o writing should be. At other times, it may be because, as Sharon Hartle
suggests, ‘assessors have their bad days, too, where they are tired, ill or worried about other
matters’ (2009: 71).
But where human markers are still needed – when, or example, teachers are called upon
to grade tests in their schools and colleges – there are a number o ways o making the
scoring more reliable.
Training Scorers can be trained to grade candidates’ work e ectively. In the rst place,
we can show them examples o candidates’ work at di erent levels (whether this involves
written submissions or, or example, videos o oral tests) and suggest what score should be
given in each case. They can analyse the scoring scales and rubrics (see below). We can get
teachers into groups and give them all the same candidates’ work to grade. By comparing
their scores with each other and then, subsequently, with the suggested grades which we
may o er them, the graders can come together to establish a common understanding o
how to score the tests appropriately.
418
More than one scorer Reliability can be greatly enhanced by having more than one scorer.
The more people who look at a script, the greater the chance that its true worth will be
located somewhere between the various scores that are given. Two examiners watching an
oral test are likely to agree on a more reliable score than one.
Many public examination boards use moderators, whose job it is to check samples
o each individual scorer’s work to see that it con orms with the general standards laid
down or the exam.
Using scales One way o speci ying scores that can be given to productive skill work is to
use pre-de ned descriptors o per ormance such as the CEFR (see 5.4.2) or the Global Scale
o English (see 5.4.3). We can then design tests which ask the students to do the things
which the descriptors suggest, and we can then grade them on whether they succeed. As
we saw in 22.1, these scales also allow students to rate their own abilities and progress. But
i , or whatever reason, we decide not to use published descriptors such as those mentioned
above, we can design our own grading scales. These say what the students need to be
capable o in order to gain the required marks, as in the ollowing Global assessment scale
or oral ability:
Score Description
0 The candidate is almost unintelligible, uses words wrongly and shows no sign o
any grammatical understanding.
1 The candidate is able to transmit only very basic ideas, using individual words
rather than phrases or uller patterns o discourse. Speech is very hesitant and
the pronunciation makes intelligibility di icult.
2 The candidate transmits basic ideas in a airly stilted way. Pronunciation is
sometimes problematic and there are examples o grammatical and lexical
misuse and gaps which impede communication on occasions.
3 The candidate transmits ideas moderately clearly. Speech is somewhat hesitant
and there are requent lapses in grammar and vocabulary use. Nevertheless, the
candidate makes him/hersel understood.
4 The candidate speaks airly luently, showing an ability to communicate ideas
without too much trouble. There are some problems o grammatical accuracy
and some words are inappropriately used.
5 The candidate speaks luently with ew obvious mistakes and a wide variety o
lexis and expression. Pronunciation is almost always intelligible, and there is
little di iculty in communicating ideas.
Global assessment scales only give a general picture o a student’s ability, however.
In order to try to ensure a more reliable measurement, we need to add more detailed
descriptors to make our assessment more speci c.
Analytic profles With analytic pro les, marks are awarded or detailed elements which
contribute to global scale descriptions.
For oral assessment, we can judge a student’s speaking in a number o di erent ways,
such as pronunciation, fuency, use o lexis and grammar and intelligibility. We may want
to rate their ability to get themselves out o trouble (repair skills) and how success ully they
completed the task which we set them.
419
chapter 22
The resulting analytic pro le might end up looking like this:
Criteria Score (see analytic scales)

Pronunciation
Fluency
Use o vocabulary
Use o grammar
Intelligibility
Repair skills
Task completion
For each separate criterion, we can now provide a separate ‘analytic scale’, as in the
ollowing example or fuency:
Score Description
0 The candidate cannot get words or phrases out at all.
1 The candidate speaks hesitatingly in short, interrupted bursts.
2 The candidate speaks slowly with requent pauses.
3 The candidate speaks at a com ortable speed with quite a lot o pauses
and hesitations.
4 The candidate speaks at a com ortable speed with only an occasional
pause or hesitation.
5 The candidate speaks quickly with ew hesitations.
A combination o global and analytic scoring gives us the best chance o reliable marking.
However, a pro usion o criteria may make the marking o a test extremely lengthy and
cumbersome; test designers and administrators will have to decide how to accommodate
the competing claims o reliability and practicality.
Scoring and interacting during oral tests Although speaking tests are increasingly being
administered digitally, with claims being made or their superior e cacy and the reliability
o their grading, the majority o oral tests still take place ace to ace. Scorer reliability o
such ace-to- ace tests is helped not only by global assessment scores and analytic pro les,
but also, perhaps, by separating the role o scorer (or examiner) rom the role o interlocutor
(the examiner who guides and provokes conversation). This may cause practical problems,
but it will allow the scorer to observe and assess, ree rom the responsibility o keeping the
interaction with the candidate or candidates going.
In many tests o speaking, students are now put in pairs or groups or certain tasks. It is
elt that this will ensure genuine interaction and will help to relax the students in a way that
interlocutor–candidate interaction might not. Some commentators, however, have worried
that pairing students in this way leads them to per orm below their level o pro ciency, and
that when students with the same mother tongue are paired together, their intelligibility
to the examiner may su er (Foot 1999: 52). Some students themselves have exactly these
worries (Mok 2011), but there is considerable evidence to suggest that who a candidate
is paired with does not, in act, a ect his or her ability to take the test e ectively or to be
scored appropriately (Figueras 2005, Bennett 2012).
420
Teaching for tests

Many teachers are amiliar with the situation where their own belie s in communicative
language teaching, or example, are at odds with a national exam which uses an almost
exclusively discrete-item indirect testing procedure to measure grammar and vocabulary
knowledge. This is similar to what Robin Walker and Carmen Pérez Ríu called (when discussing
writing tests) ‘the incoherence between a process-oriented approach to teaching and a
product-based approach to assessment’ (Walker and Pérez Ríu 2008: 18). There is always
the danger that the washback e ect o such tests (see 22.2.1) will give the students – and,
perhaps, their parents – expectations about what teaching and learning should be like, and
this may be di cult or teachers to deal with.
Many modern tests do not cause these kinds o problems, however, since they are
grounded ar more in mainstream classroom activities and methodologies than some earlier
examples o the genre were. In other words, there are, as we saw in 22.1, many test items
which would not look out o place in a modern lesson, anyway. And besides, even i preparing
students or a particular test ormat is a necessity, ‘it is as important to build variety and un
into an exam course as it is to drive students towards the goal o passing their exam’ (Burgess
and Head 2005: 1).
And we can go urther: many teachers nd teaching exam classes to be extremely satis ying
in that where the students perceive a clear sense o purpose – and are highly motivated to do
as well as possible – they are, in some senses, ‘easier’ to teach than students whose ocus is
less clear. When a whole class is working towards a particular exam, it can give the students ‘a
target to aim or and is a great motivator’ (Naunton 2014: 31). Furthermore, in training our
students to develop good exam skills (including working on their own, reviewing what they
have done, learning to use re erence tools – e.g. dictionaries, grammar books, the internet
– keeping an independent learning record or diary, etc.), we are encouraging exactly those
attributes that contribute towards autonomous learning (see 5.5).
Good exam-preparation teachers need to amiliarise themselves with the tests their
students are taking, and they need to be able to answer their students’ concerns and worries.
They need to come up with classroom tasks that will best help their students to be success ul
when they take the test. This may involve making compromises in the ways they like to teach
or, alternatively, it may involve explaining to the students the relationship between what they
are doing in class and the positive impact it will have on their ability to pass the test.
But however much tests and ideal classroom practice do or do not match each other, there
are a number o things that exam class teachers will want to do:
Train or test types We can give our students training to help them approach test items
more e ectively. As an example, or speaking tasks, we will equip them with appropriate
negotiating language to help them get over awkward moments. When training our students
to handle reading test items, we will discuss with them the best way to approach a rst
reading o the text, and how that can be modi ed on a second reading to allow them to
answer the questions asked.
I the students are going to be asked to read aloud in a speaking test, they should be given
chances to do this be ore they take the test. I short dictations are part o a listening test,
candidates need to know about this and try dictations out.
In all this work, our task is to make the students thoroughly amiliar with the test items they
will have to ace so that they give o their best, and so that, in the end, the test discovers
their level o English, rather than having it obscured by their un amiliarity with the test items.
421
chapter 22
Train or test rubrics Some candidates have problems with exam rubrics (the instructions
about what to do or a question). This can happen whatever subject is being tested. We
need to remind our students about the importance o reading the rubrics care ully and give
them chances to practise this.
Discuss general exam skills Most students bene t rom being reminded about general test
and exam skills, without which much o the work they do will be wasted. For example, they
need to read through the questions care ully so that they know exactly what is expected.
They need to pace themselves so that they do not spend a disproportionate amount o time
on only one part o an exam. In writing, or example, they need to be able to apply process
skills (see 20.2.1) to the task. As they build up to an exam, they need to be able to organise
their work so that they can revise e ectively.
Do practice tests Some students get very anxious about taking tests. We can talk to them
about this, and, by returning to the issue at intervals in the lead-up to the test, we can
di use the tension. One o the best ways o making students eel more relaxed about the
experience is to give them opportunities to practise taking the test or exam so that they get
a eel or the experience, especially with regard to issues such as pacing. At various points
in a course, there ore, the students can sit practice papers or whole practice tests, but this
should not be done too o ten since not only will it give teachers horri c marking schedules,
but it will also be less productive than other test and exam preparation procedures.
Have un As we said above, just because students need to practise certain test types does
not mean this has to be done in a boring or tense way. There are a number o ways o having
un with tests and exams.
David Coniam, Mandy Lee Wai Man and Kerry Howard, or example, designed a board
game to help students practise or a new oral test in Hong Kong (Coniam, Lee Wai Man and
Howard 2011). When the students land on certain squares, they have to per orm speaking
test tasks; but they might also land on squares which have ‘ un’ tasks quite unrelated to
the exam. Students can also have un with practice tests by changing the gender o all the
people in direct and indirect test items to see i the items still work and i not, why not.
They can be encouraged to write their own test items, based on language they have been
working on and the examples they have seen so ar. These new test items can now be given
to other students to see how well they have been written and how di cult they are. This
helps the students to get into the minds o their test and exam writers.
The examples above (and activities like them) show that teaching or tests need not be an
endless and soul-destroying round o practice tests but can, instead – i we apply our usual
pedagogical principles – be engaging and enjoyable.
422
423
chapter 22
424
425

Testing and Evaluation

Uploaded by

Copyright:

Available Formats

You might also like

Testing and Evaluation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Testing and Evaluation

Uploaded by

Copyright:

Available Formats

22

Testing and evaluation

Summative and formative assessment

Qualities of a good test

Test item types

22.4.1 Some typical test item types

The journalist was ______________ by enemy fire as he tried to send

Read the text and answer the question by selecting all

Would you like ……….. to the cinema tonight?

I’m sorry that I didn’t get her an anniversary present.

We offer a ––––––––––––– of different types of coffee in our restaurant. SELECT

Put the words in order to make correct sentences.

22.4.2 Skill-focused tests

22.4.3 Young learner test item types

Figure 3 From Cambridge Young Learner Starter sample listening test

Writing and marking tests

22.5.1 Writing tests

22.5.2 Marking tests

The resulting analytic pro le might end up looking like this:

Criteria Score (see analytic scales)

Teaching for tests

You might also like