Professional Documents
Culture Documents
Research Notes 27
Research Notes 27
Research Notes 27
Research Notes
Contents
Editorial Notes 1
Cambridge ESOL and tests of English for Specific Purposes 2
Publishing vocabulary lists for BEC Preliminary, PET and KET examinations 4
Using simulation to inform item bank construction for the BULATS computer adaptive test 7
The comparability of computer-based and paper-based tests: goals, approaches, and a 11
review of research
Modelling facets of the assessment of Writing within an ESM environment 14
Broadening the cultural context of examination materials 19
Research and development update 22
Conference reports 23
IELTS Masters Award 23
Editorial Notes
Welcome to issue 27 of Research Notes, our quarterly publication reporting on matters relating to
research, test development and validation within Cambridge ESOL. This first issue of 2007 brings
a ‘new look’ for Research Notes. We hope you will find the refreshed design attractive to look at
and easy to read, as well as a more convenient reference resource now that the contents list
appears on the back as well as the inside front cover. We would welcome your comments on the
new approach.
In this issue we focus on the theme of testing English for business and other work-related
contexts. In his opening article, David Thighe discusses Cambridge ESOL’s response to the
changing assessment requirements that are resulting from globalisation and migration. He
describes the growing demand for English language tests that are tailored to the needs of
populations in various work-oriented contexts, outlining some of the principles that underpin the
domain-related tests we offer, such as BEC, BULATS, ILEC and ICFE. Key issues include the notion
of specificity, the nature of authenticity and the role of content knowledge. Questions of
specificity and domain-related content also surface in the article by Jason Street and Kate Ingham,
who describe the process of compiling, validating, and publishing word lists for our BEC
Preliminary, PET and KET examinations.
Just as computer and other technologies are revolutionising the workplace and international
business communications, so they are impacting increasingly on the testing and assessment of
English within these contexts. The role of technology in developing a computer based test such as
CB BULATS is explored by Louise Maycock, who describes the development of a tool that allows
us to examine how an adaptive computer test functions. On a related theme, Neil Jones and
Louise Maycock address the issue of comparability between the computer based mode and paper
based mode of tests such as BEC and BULATS. Stuart Shaw describes efforts to conceptualise
Cambridge ESOL’s Writing assessment as a workflow in terms of different facets within an
Electronic Script Management (ESM) environment; he shows how this developing technology
draws on databases and can benefit the assessment of writing performance.
Steve Murray reports on Cambridge ESOL’s recent initiatives to ensure the cultural accessibility
of our examination materials in a constantly changing international context. Issue 27 includes a
brief update on recent developments in the Asset Languages project as well as a conference
report on Cambridge ESOL’s involvement in the 2006 UK Language Testing Forum, which took
English for Specific Purposes as its theme. Finally we report news of the winner of the 2006 IELTS
Masters Award.
Editorial team for Issue 27: Lynda Taylor, Louise Maycock, Fiona Barker and Rowena Akinyemi.
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
2 | C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7 | 3
include contexts that are appropriate for all people working Generalisability
in that domain. For example, in the development of the
International Legal English Certificate it was necessary to
Increase in test of content knowledge
ensure that lawyers were not at a disadvantage if they were
only trained in the Anglo-American common law system and
not civil law. As this content element of test tasks is either Non-Specific Degree of Specificity Very Specific
Clapham (1996) also noted, and it is referred to in the O’Sullivan sees the key for assessing situational and
quote above, that the role of candidates’ content possibly interactional authenticity in defining test
knowledge was related to the specificity of the test; in specificity as multi-dimensional and, importantly,
other words, the more specific an ESP test is, the more measurable. He provides a hands-on method for examining
content knowledge and language knowledge are test specificity that Cambridge ESOL is currently exploring in
interconnected. This concept of a cline of levels of relation to our own ESP tests. This will allow us to ascertain
specificity is taken up by O’Sullivan and is illustrated in whether certain components or tasks within a test have the
Figure 1. Here a completely specific language test is defined required level of specificity from a number of different
as one that would focus only on language unique to a facets such as, for example, lexical precision, structural
specific use domain. As the specificity of the test increases range and text length.
so does the degree to which content knowledge is tested.
Also, as specificity increases, the generalisability of the test
results decreases. We may posit that examinations such as Conclusion
the First Certificate in English are non-specific, while This outline of current thinking on ESP testing has shown
examinations with a business orientation such as BEC are that developing such tests raises a number of critical
more so. An examination of English in a legal environment issues, such as the role of content experts in test
such as ILEC is likely, if it is well constructed, to be even production, levels of test specificity, and the inseparability
more specific. of content and language knowledge; although
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
4 | C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7
methodologies for addressing these issues are steadily Douglas, D (2001) Three problems in testing language for specific
emerging, they continue to present language testers with purposes: authenticity, specificity, and inseparability. In Elder, C,
Brown, A, Grove, E, Hill, K, Iwashita, N, Lumley, T, McNamara, T,
theoretical and practical challenges. Other issues such as
and O’Loughlin, K (Eds), Experimenting with Uncertainty: Essays in
the role of content specialists in devising marking criteria honour of Alan Davies (Studies in Language Testing, volume 11),
(Douglas’s indigenous assessment criteria, Douglas 2000) 45–52, Cambridge: UCLES/Cambridge University Press.
and placing ESP tests on a largely non-context based Graddol, D (2006) English Next: Why Global English may mean the
framework such as the CEFR remain to be dealt with. As the end of ‘English as a Foreign Language’, The British Council,
demand grows for tests of ESP, there will be plenty for Available at: www.britishcouncil.org/learning-research-
language testers to engage with. englishnext.htm
Ingham, K and Thighe, D (2006) Issues with developing a test in LSP:
References and further reading the International Certificate in Financial English, Research Notes
25, 5–9.
Bachman, L F and Palmer, A (1996) Language Testing in Practice
Oxford: Oxford University Press. O’Sullivan, B (2006) Issues in Testing Business English: The revision
of the Cambridge Business English Certificate (Studies in Language
Clapham, C M (1996) The Development of IELTS: A study of the effect
Testing) volume 17, Cambridge: Cambridge ESOL/Cambridge
of background knowledge on reading comprehension (Studies in
University Press.
Language Testing, volume 4), Cambridge: UCLES/Cambridge
University Press. Thighe, D (2006) Placing the International Legal English Certificate on
the CEFR, Research Notes 24, 5–7.
Corkill, D and Robinson, M (2006) Using the global community in the
development of ILEC, Research Notes 25, 10–11. Weir, C J (2005) Language Testing and Validation: An Evidence-Based
Approach, Hampshire: Palgrave Macmillan.
Douglas, D (2000) Assessing Languages for Specific Purposes,
Cambridge: Cambridge University Press.
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7 | 5
and Waystage lists, as was the BEC Preliminary list, with the ESOL working group was therefore set up in 2005 with the
inclusion of business vocabulary considered appropriate to aim of reviewing the BEC Preliminary, PET and KET item
the level. The business domain lexis was added following writer vocabulary lists prior to their publication on the ESOL
work carried out by external specialists in the field of Teaching Resources website.
Business English. The three lists have been gradually
extended over the years through an annual review process,
with the addition of items felt to be of need to candidates
Initial aims and considerations
as a result of changes in usage in the language, and are A number of issues were discussed within the working
directly informed by reference to the CLC (see description of group. Members of the group were in agreement on the
the process below). Usage in vocabulary can change need for simplicity: the word lists would need to be
quickly, as is shown by the growth in the last decade of the accessible to teachers from a wide variety of backgrounds.
use of technology-related language and associated lexis. At the same time, the lists should be optimally helpful and
display consistency of convention from list to list. To ensure
that the lists were simple to use, lexical items were chosen
The annual review procedure as headwords. Only base forms would be listed and not
New words proposed for inclusion in the lists are subject to their inflections. Common prefixes and suffixes would be
an established procedure used in the annual review of the listed in an appendix and words formed with them not
content of all the word lists. (For a full description of the included in the body of the main list itself. No examples of
annual word list review procedure, see Ball 2002.) Firstly, word use would be given, except where these served to
the frequency of occurrence of these words is established in distinguish words with multiple meanings.
a range of native speaker corpora. For BEC Preliminary, the There was lengthy discussion on how compounds and
following corpora are used: multi-verb or phrasal verbs could be most helpfully listed. It
• British National Corpus (BNC) (written and spoken native was finally decided that where the meaning of the
speaker English including business oriented words) compound was literal and formed from two words which
already appeared on the list (e.g. sports centre and bank
• Business Texts Corpus (based on US and UK business
articles) account), they would not be listed separately. Compounds
where the meaning was not transparent (e.g. shortlist) or
• Trial Web Corpus (words of contemporary Business
where it might be difficult to grasp because of cultural
English taken from the internet).
trappings (e.g. left luggage) were listed separately. Any
Next, the list of suggested words is compared against multi-part verbs which were felt to be ‘sum-of-parts’, i.e.
CLC-derived lists that illustrate the written production of a where their meaning could be clearly understood from their
large number of candidates taking BEC at different levels. constituent parts (e.g. pay back) and where these
Each word is tested against various frequency criteria and constituent words were already included in the lists, were
the resulting data guide the final discussion as to whether not included in the main list. Non-literal phrasal verbs (e.g.
words are added to the list or not. The significance of the put off meaning postpone) were listed in the main body of
data is that if the CLC shows that candidates at BEC Vantage the list. It was also agreed that the name of products which
level are using a word productively in writing, this were, in effect, trademarks (e.g. Coca-Cola) should be
strengthens the case for its inclusion on the BEC removed.
Preliminary vocabulary list. A number of closed vocabulary sets, such as numbers,
Nonetheless, data from learner corpora, if used as the days of the week and names of countries would be included
principal basis for a word list, does have some limitations. in appendices to all lists, rather than appearing within the
Firstly, the CLC consists of the written output of exam main lists. There was also a decision to be taken about
candidates and such productive vocabulary does not topic lists (such as The Weather) which existed as
provide a reliable guide to receptive vocabulary appendices to the PET and KET vocabulary lists. It was
knowledge, which is likely to be different in size and nature. agreed that topic lists should be retained as appendices to
Secondly, the frequent occurrence of some words in the the LMS lists as they may be found to be helpful to
corpus may well be a result of ‘task effect’ – if candidates teachers. The BEC Preliminary list, however, did not include
are asked to describe their family and home in a PET any topic lists as it was based mainly on the vocabulary of
Writing task or make meeting arrangements in a BEC business, which was considered to be a specific domain in
Preliminary Writing task then, naturally, we would expect to itself (Horner and Strutt 2004).
find a high occurrence of vocabulary related to these topics Each list would be prefaced by an introduction
in the data gathered from such tasks. However, the CLC is highlighting a number of points. Users would need to know
only one of the corpora used to inform the vocabulary that the lists were intended as reference tools, as guides to
included in the lists. Furthermore, as the CLC is growing at the vocabulary a candidate might need to know and
a very fast rate, with data from recent candidate scripts recognise, and were not intended to be a definitive
added on a regular basis, any ‘task effect’ decreases with collection of all the lexical items an elementary or pre-
time. intermediate student of English should know or encounter.
Until early 2006 the lists were for reference use of A further caveat was that the vocabulary included covered
question paper materials writers only. With the decision to that needed for both receptive and productive tasks on the
make the lists publicly available, there was a concern to particular examination. There was also a risk that some
ensure regularity of format across all three. A Cambridge users might view the word lists as forming a vocabulary
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
6 | C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7
syllabus1: some teachers might, for example, teach their European Framework of References (CEFR) for Languages.
students only the words the lists contained. Users were to This can partly be explained by the factor of domain
be reminded that language learners need to be equipped knowledge. Business English (BE) covers a range of
with approaches to deal with unfamiliar vocabulary, and situations, and the domain knowledge which is assumed in
were referred to the teaching resources section on the ESOL Business English varies. Lexis is an important difference
website which contains advice on strategies learners should between the two and is partially related to this question.
use in order to understand and learn new vocabulary when Business English includes lexis not included in General
reading texts. Learners also need to develop their own English (GE), firstly, words and phrases which refer to
personal vocabularies: to learn specific words and phrases general concepts but which would not normally be used in
which might not be in the word lists in order to talk about everyday non-business situations, e.g. purchase, objective,
their individual lives and particular personal circumstances. consumer. These words can be considered as General
English vocabulary at a higher level, for example, purchase
is a BEC Preliminary word for Business English, but
Methodology intermediate (FCE) for General English. Secondly, there are
The initial work on reviewing the word lists was carried out words for general business concepts which arguably fall
by external consultants from the BEC Preliminary, PET and outside the domain of General English at
KET item writer teams: people with the greatest familiarity lower/intermediate levels, e.g. invoice, export, freight.
with the lists and their contents. The content of the existing Furthermore, the lexis found only on the BEC Preliminary list
item writer vocabulary lists was checked carefully for appeared to fall within a fairly limited number of semantic
inconsistencies: individual items were cross referenced categories. These included institutions and places of
against decisions on the format of the main list that had business, money, business people and events, modes of
been reached, and if necessary, removed. It was discovered, communication, products and companies and lexis
for example, that a number of words included within one of connected with technology. In addition, the overlap exercise
the lists was covered by the appendix in which affixes revealed that the BEC list included very little lexis used to
appeared. describe feelings, society, family, home, personal issues
Finally, a comparison study involving the draft word lists and personal activities but was concerned to a far greater
was done using WordSmith Tools to check the three lists for extent with the ‘impersonal’. The 1004 words which occur
overlap. The project aimed to investigate the degree of in the BEC Preliminary list only could be an indicator of
overlap between each list. Overlap was anticipated between business-focused B1 level vocabulary, but further research
all three lists, but given the different development of all is needed to substantiate this.
three lists over time, the degree of overlap was unknown. Conversely, the PET list included lexis not covered by BEC
Preliminary, in that there is a much more extensive range of
Findings vocabulary for clothes, food, animals, sport, health and
entertainment. The following are examples of words
The comparison study in fact revealed a significant overlap appearing on the PET vocabulary list but not BEC
between the KET, PET and BEC Preliminary lists. As shown in Preliminary: carrot, camel, volleyball and opera.
Table 1, 1061 words were common to the KET and PET lists There were also cases of the same word appearing on
and 1949 were common to PET and BEC Preliminary lists. both lists but which needed to be exemplified differently to
This was not unexpected, as the lists did have common indicate that only a certain use of it would be expected in
origins. the respective examination. Market on the BEC list features
It did, however, leave a substantial minority of the words as noun and verb parts of speech with examples as There’s a
on the BEC Preliminary list without any overlap on the PET good market for fax machines here and They marketed their
list. (55 words occurred only on the KET list, 754 words goods overseas. On the PET list, market appears as noun
occurred only on the PET list, and 1004 occurred only on only with the example of buying something from a market.
the BEC Preliminary list.) This may appear surprising as both In March 2006, the three word lists were posted on the
the latter examinations represent B1 level on the Common relevant exam page of the Cambridge ESOL Teaching
Resources website. The BEC Preliminary list consisted of
2995 headwords; PET had 2708 and KET had 1111 (plus
1. The wordlists for the Young Learners’ English (YLE) Tests do form a syllabus, but
teachers are recommended nonetheless to teach words beyond the lists. See
topic lists). The published lists include the vast majority of
forthcoming Research Notes 28 for an article on the YLE wordlist review. lexical items expected in these exams and, albeit to a more
Total words per list 1111 2708 2708 2995 1111 2995 1111 2708 2995
% of each list 95.6 39.2 69.5 65.1 75.1 27.8 72.7 29.8 27.0
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7 | 7
limited extent, the likely output of candidates. The selection A possible area to explore, one which emerged as a result
of words that they contain is the result of the collective of the comparison of the final word lists, is whether words
experience of ESOL external consultants and writers, occurring in one of the lists and not in another could help to
feedback from examiners and course providers over a better define ‘Business’ English at the B1 level. This could
number of years and the analysis of data from native inform the on-going validation of the construct
speaker corpora and the learner corpus, the CLC. As such, underpinning Cambridge ESOL Business English tests and
the word lists may be considered useful reference tools in examinations.
the preparation of BEC Preliminary, PET and KET candidates,
and in the development of course materials for these References and further reading
examinations. Ball, F (2002) Developing wordlists for BEC, Research Notes 8, 10–13.
Barker, F (2006) Corpora and language assessment: trends and
prospects, Research Notes 26, 2–4.
Future developments
Cambridge ESOL Teaching Resources website:
The content of the lists will continue to be reviewed www.cambridgeesol.org/teach
annually and items which are possibilities for inclusion or Hindmarsh, R (1980) Cambridge English Lexicon, Cambridge:
removal will, as the CLC is developed further, be subject to Cambridge University Press.
an evermore sophisticated corpus-based analysis. More Horner, D and Strutt, P (2004) Analysing domain-specific lexical
generally, the work on the word lists contributes to on-going categories: evidence from the BEC written corpus, Research Notes
Cambridge ESOL research into the differences between 15, 6–8.
‘receptive’ and ‘productive’ vocabulary, across different Marshall, H (2006) The Cambridge ESOL Item Banking System,
levels and for different types of English. Using a number of Research Notes 23, 3–5.
different corpora, including the CLC, and with reference also McCarthy, M and Carter, R (2003) What constitutes a basic spoken
to ESOL’s computerised bank of items for receptive data, vocabulary? Research Notes 13, 5–7.
LIBS (see Marshall 2006), it is hoped that it may be Van Ek, JA & Trim, J L M (1991a) Threshold 1990, Strasbourg: Council
possible in the future to identify more clearly the range of of Europe Publishing.
words that an average candidate at any level should be able Van Ek, J A & Trim, J L M (1991b) Waystage 1990, Strasbourg: Council
to recognise or produce (see Ball, 2002). This work will, in of Europe Publishing.
turn, feed into a larger collaborative project (English Profile) West, M (1953) A General Service List of English Words, London:
to produce a comprehensive reference level description for Longman.
English. WordSmith Tools website: www.lexically.net/wordsmith
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
8 | C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7
have been estimated (calibrated) and mapped onto a to the validity and practicality of the test which must be
common scale. The techniques which make this possible dealt with by the algorithm. For example, balancing the
derive from Item Response Theory (IRT), which Wainer range of task types administered is an important aspect of
(2000:13) refers to as ‘the theoretical glue that holds a CAT construct validity, and balancing the exposure of material is
together’. IRT comprises a family of mathematical models equally important for security of the item bank. In order to
which are used to calculate the probability of a candidate of account for these considerations, the CB BULATS task
given ability answering a particular item correctly. The selection procedure involves selecting a subset of potential
simplest of the IRT models is the Rasch model, which tasks, which are of the right type and fall within a certain
calculates this probability as a function of the difference difficulty range, and multiplying together a series of weights
between the ability of the candidate and the difficulty of the which work to control material exposure, maximise targeting
item. The use of Rasch analysis makes item banking as far as possible and so on. A task is then selected
possible by enabling all items to be calibrated on the same randomly from the top few tasks, sorted in decreasing order
difficulty scale. Any combination of items can then be of overall weight.
selected from the bank and, because the difficulty of each
of those items is known with some precision, it is possible
to estimate the ability of the candidate based on their
Construction of CAT item pools
responses to that selection of items. This is an essential The adaptive algorithm implemented in CB BULATS is clearly
requirement for CATs because, since each candidate very complex but, as Flaugher (2000:38) points out ‘the
receives a different selection of items, comparisons cannot best and most sophisticated adaptive program cannot
be made on the basis of a ‘number correct’ type score. function if it is held in check by a limited pool of items, or
As well enabling ability to be estimated on completion of items of poor quality’. This article is not intended to
the test, IRT is also used throughout a CAT to estimate address the quality of items: CB BULATS material goes
ability after each task has been administered, in order to through the same rigorous quality control procedures as
inform selection of the next task. other Cambridge ESOL examinations, as detailed elsewhere
(e.g. Marshall 2006). However, construction of item pools
for CB BULATS is an important issue. The term item pool is
The CB BULATS test algorithm used here to distinguish the material selected for a
particular test version from the master item bank, which
The CB BULATS test algorithm comprises three main
contains all of the material available for selection. New
components: task selection, stopping criteria, and ability
versions of CB BULATS are produced periodically, containing
calculation. As already explained, the ability of the
different test material and the potential to manipulate
candidate is estimated throughout the test, after the
parameters of the adaptive algorithm if it is considered
response to each task has been submitted. As more
necessary.
information is obtained the ability estimate is gradually
Constructing an item pool for an adaptive test poses
refined, so that the Standard Error of Measurement (SEM) of
interesting problems not faced in the construction of linear
the estimate is reduced throughout the duration of the test.
tests where, given a sample size, the exposure of each item
Theoretically, a CAT would end once a desired level of
in the test is pre-determined and equal. In an adaptive test
precision has been reached, characterised by the SEM
this is not the same because the exposure of a given task
falling below a certain level. However, for practical reasons
depends on a range of factors relating to the likelihood and
this criterion cannot be the only method of stopping the
frequency of its selection.
test as it would result in an indeterminately long test for
The primary considerations when constructing a new pool
some candidates who perform more erratically or have an
of material are that there is a sufficient range of task types
ability level at either of the extremes of the distribution.
across each of the levels for the test to perform well, and
The CB BULATS algorithm employs several criteria, so that
that the exposure of material is controlled by ensuring more
the current section comes to an end when any one of the
material is available for the task types and levels which are
stopping rules have been satisfied. These criteria include a
likely to be utilised more often. Sometimes these
target SEM, maximum number of items and also a time limit
considerations may be conflicting when constructing a pool
which test administrators may choose to impose if they so
of predetermined size. For example, Flaugher (2000:42)
wish.
suggests that ‘a satisfactory pool of items for adaptive
testing is one characterized by items with … a rectangular
Task selection distribution of difficulty’. Whilst this would ensure that
The selection of the next task to be administered is the sufficient material was available such that the test would
most complex part of the CB BULATS algorithm. The most perform equally well across all levels, this does not take
common methods involve selecting the task contributing into account constraints on exposure of material. Even if the
the maximum amount of information at the current ability test taker population had a rectangular distribution of
estimate of the candidate, or some similar statistical ability, exposure would still be greater for the tasks in the
approach. For the Rasch model, the maximum amount of middle of the difficulty range because this is where the
information is contributed by items with difficulty estimates tasks at the beginning of the test are selected from, until
closest to the ability estimate. However, whilst this strategy more information is obtained about the candidate’s ability.
would be most efficient in terms of producing a test with If the population were normally distributed then exposure
maximum precision, there are other considerations relating levels of mid-range tasks would be further increased
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7 | 9
NO
5. Is stopping
4. Revise proficiency estimate
rule satisfied?
YES
NO
6. End of 7. End of 8. Administer
section test? next section
YES
9. Stop
because most of the candidates would fall in the middle of response if the generated number is less than 0.6. This
the range. Hence, to control exposure levels it would be simulation approach is similar to that described by Luecht
preferable to construct the item pool such that most of the et al (1998).
tasks fall in the middle of the difficulty range, with fewer Aside from the generation of candidate responses,
tasks of extremely high or low difficulty. It is essential that everything else in the simulator – task selection, stopping
these competing issues be considered when constructing a rules, ability estimation and so on – is exactly the same as
new item pool. in the actual test. Simulating a large number of candidate
performances therefore allows us to predict how the test
will perform and how the item pool will be utilised.
The CAT Simulator
When a new item pool has been constructed, checks need Use of simulation information for item
to be made to ensure that the test will perform sufficiently
pool construction
well for candidates of different levels and that exposure
rates for any individual task will not be too high. Since the After running a simulation a range of information can be
utilisation of tasks cannot easily be predicted, the only way extracted from the simulator in order for recommendations
to address these issues is by the use of simulation. The to be made regarding amendments to the item pool – for
Cambridge ESOL CAT Simulation Tool has been designed to example, specific task types or levels where there is
mimic precisely the algorithm implemented in CB BULATS. insufficient material or, indeed, a surplus of material (where
It works by generating a sample of candidate abilities, it may be appropriate to retain some tasks for a future item
based on population characteristics specified by the user pool).
(i.e. the probability distribution and parameters of that
distribution). The test is then simulated for each of the Assessing the precision of the test
‘candidates’ generated. Because their ‘true’ ability is known Targeting plots like that in Figure 2 are produced for
(since it was generated) the Rasch model can be used to candidates at each level, to show how the average difficulty
calculate the probability of a correct response to a given corresponds to the average estimated ability during the test
item. In order to model the probabilistic nature of a sequence. Error bars are plotted to represent the standard
candidate’s response, a random number between 0 and 1 error of the ability estimates. Note that the test is in two
is generated. If the generated number is less than the sections: the Reading and Language Knowledge section
probability of a correct response then the candidate is ends after around 32 items and is then followed by the
deemed to have answered the item correctly; otherwise an Listening section (hence, the noticeable dip in the middle of
incorrect response is recorded. For example, say the the plot as the current ability estimate returns to the centre
probability of the candidate responding correctly was of the distribution). Figure 2 shows the ability estimate
calculated to be 0.6. Next, generate a random number steadily increasing for this group of candidates, and the item
between 0 and 1. The probability of this number being less difficulties can be seen to be increasing correspondingly.
than 0.6 is 0.6, so this has the same probability of Also, the error bars show how the SEM of the ability
occurring as does a correct response: thus, record a correct estimate is reduced during the course of each section.
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
10 | C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7
3 Figure 2:
Targeting graph for
B2 level candidates
1
Logits
Average Difficulty
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67
Average Ability
-1
-2
-3
Item Sequence
The distribution of the SEM is compared for candidates at receive an adequate range of task types. Again, deficiencies
each level, and classification statistics, which indicate the in the item pool may be detected if the information
percentage of candidates correctly classified into their ‘true produced is not as expected.
grade’, are also produced. A plot of SEM over time (i.e. from
the first candidate to the last) is also produced in order to
ensure that constraints on material exposure, which Conclusion
effectively mean that there is more freedom for the system Simulation plays a vital role in the construction of item
to select the better targeted items for earlier candidates, pools for adaptive tests, allowing careful assessments of
do not impact on the reliability of the test for later likely test performance and material use to be made so that
candidates. Together this information gives an overall the contents of an item pool may be revised a number of
picture of how well the test is performing for candidates at times before being produced as a new CB BULATS test
each of the levels and highlights if there is insufficient version. The simulator also has a number of wider uses
material at certain levels. such as allowing the effects of amendments to algorithm
parameters to be assessed, or even testing the effects of
Algorithm performance proposed changes to the algorithm itself, and also in
Features of algorithm performance are studied to ensure informing the design of research projects by enabling
that the test will function as expected for a given item pool. recommended sample sizes to be estimated. The CAT
This information includes summarising how the stopping Simulation Tool has therefore become an essential part of
criteria are used (e.g. what proportion of tests end because the CB BULATS test production toolkit.
the desired level of precision is reached), the average
number of items per section, how often a line of the test References and further reading
specification had to be skipped because there were Beeston, S (2000) The UCLES EFL Item Banking System, Research
insufficient tasks of the specified type available, and how Notes 2, 8–9.
far the selection of task types reflected the preference of a Flaugher, R (2000) Item Pools, in Wainer, H (Ed.) Computerized
given task type in each part of the test specification. Adaptive Testing: A Primer, New Jersey: Lawrence Erlbaum
Associates, 37–58.
Use of material Luecht, R, de Champlain, A and Nungester, R (1998) Maintaining
Content Validity in Computerized Adaptive Testing, Advances in
An important reason for simulation is to check the expected
Health Sciences Education 3, 29–41.
exposure levels for different tasks. Information on the
Marshall, H (2006) The Cambridge ESOL Item Banking System,
minimum, maximum and average rate of exposure of tasks
Research Notes 23, 3–5.
of each task type and level are produced. Types and levels
Thissen, D and Mislevy, R (2000) Testing Algorithms, in Wainer, H
where these figures are high represent areas where more
(Ed.) Computerized Adaptive Testing: A Primer, New Jersey:
material may be required in order to prevent over-exposure Lawrence Erlbaum Associates, 101–132.
and reduce the risk of material security being compromised.
Wainer, H (2000) Computerized Adaptive Testing: A Primer, New
Graphs of the distribution of task types for each level of Jersey: Lawrence Erlbaum Associates
candidate are also produced to ensure that all levels
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7 | 11
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
12 | C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7
Comparability and fairness still necessary to see whether in practice there are
construct-irrelevant, format-specific effects.
The discussion so far suggests that maximizing
Projection is an option where constructs are differently
comparability and minimizing format-specific effects is the
specified – tests do not measure ‘the same thing’. It aims at
best guarantee of fairness; however, this is to overlook the
predicting learners’ scores on one test from another. It
fact that comparability has a human dimension too. There is
might be exemplified by the way that the US testing agency
a largely generational gap between those who prefer to
ETS have linked the internet-based, computer-based and
write on computers and those who prefer to write on paper.
paper-based formats of TOEFL by providing tables of score
This profound difference undermines a traditional principle
comparisons. These are offered as a guide, with the caveat
of assessment: that fairness is ensured by carefully
that ‘differences in the tests can make it difficult to
standardising procedures. Imposing a standardised
establish exact comparisons’ (ETS 2005). In such a
approach to the testing of extended writing may be held to
situation both construct-relevant and format-specific effects
effectively advantage or disadvantage a proportion of
combine to complicate comparison, and ETS rightly
candidates. Standardisation is thus no longer a guarantee
encourage receiving institutions to set standards which
of fairness. Offering alternative CB and PB formats of a test
work satisfactorily for their specific situation.
may thus be seen in some circumstances as enhancing
fairness and validity.
There is therefore a strong argument to be made that the Empirical evidence
traditional formulation of the comparability question – that it
Historically, studies relating to comparability of test forms
should be ‘a matter of indifference to the examinee whether
have taken the view that it should be possible for tests to
the test is administered on computer or paper’ (Bennett
be used interchangeably, with a focus on attempting to
2003:i)– is outdated, at least for tests of certain skills.
demonstrate that candidates obtain the same scores on CB
Generally, we should agree that fairness will be best served
and PB forms of the same examination, subject to
by ensuring that candidates can choose the test format
measurement error. For example, Paek (2005) reviewed a
which they believe allows them to demonstrate their ability
large body of comparability research focused primarily on
to the full – a ‘bias for best’ approach. Adopting such an
the American schools sector, concluding that ‘in general,
approach does not free us from the responsibility of studying
computer and paper versions of traditional multiple-choice
format-specific effects, trying to remove or mitigate them
tests are comparable across grades and academic subjects’
where possible and desirable, and reporting our research to
(Paek 2005:17). Two particular areas where there tends to
end users. But it is part of the realistic view we need to take
be less clarity are where tests involve extensive reading
of the practical and conceptual limitations on comparability.
passages or composition of long essay-type responses.
In the next section we discuss the nature of comparability
Paek (ibid) notes that CB tests with extended reading
as a function of the relatedness of tests. Then in the
passages have tended to appear more difficult than their PB
remainder of the paper we review the history of empirical
counterparts and she suggests that this may be related to
comparability studies and the findings of one recent
differences in the reading comprehension strategies that
Cambridge ESOL study concerning CB and PB formats of
candidates employ. For example, on paper, candidates may
IELTS.
highlight certain relevant lines of text or use visual
awareness techniques to remember where particular
Levels of comparability information is located on the page. It is argued that the use
of scrolling (as opposed to pagination) to navigate
Mislevy (1992) makes the point that differences in the electronic text inhibits the use of this ‘visual learning’
construct of what is tested and how it is tested limit the because the sense of placement on the page is lost (Paek,
extent to which tests can be compared in a purely statistical 2005:18). To some extent, this is likely to be true, but as
way. Mislevy describes four levels of linking, of which we people become more used to navigating electronic text on
mention three here: the Internet by scrolling, perhaps reading strategies will
Equating allows tests to be used interchangeably, but is adapt accordingly and this issue will become redundant. In
possible only if two tests have been constructed from the fact, scrolling was found to be a more popular method of
same test specification to the same blueprint. Calibration navigating through text than pagination in trials of the
can link two tests constructed from the same specification computer-based version of PET (Hacket, 2005). Paek
but to a different blueprint, which thus have different concedes that, particularly with new tools such as
measurement characteristics. Relating this to the CB-PB electronic highlighters being introduced in order to facilitate
comparison, equating should be possible where both interaction with electronic reading passages which more
formats administer the same range of tasks from the same closely mimics the way candidates interact with text on
item bank and in the same manner, e.g. as a linear test with paper, there is further promise in this area.
a particular time limit. CB and PB BEC should be For Writing assessment there is concern, not only for the
comparable in this way. Calibration would cover the way that candidates may interact differently with the test on
situation where, for example, the same item bank is drawn computer, but also for the way examiner marking may be
from but the CB test is adaptive and the PB test linear. This influenced by the presentation of scripts. It has been
is the case with CB and PB BULATS. These are both strong argued that the increased legibility of typed scripts over
forms of statistical linking which allow a priori setting of handwritten scripts have effects on examiners’ perception
comparable standards. Nonetheless, empirical studies are of the writing, though there is a lack of consensus over the
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7 | 13
direction of this difference: some have argued that untidy, handwriting or typing their responses to the Writing
illegible responses which have obvious crossings-out and component.
editing are marked more severely, while others have found Cambridge ESOL have carried out a number of
that handwritten scripts are marked more leniently, and comparability studies in relation to the introduction of CB
have suggested that this could be because errors are easier IELTS (Green and Maycock 2004, Maycock and Green 2005,
to locate in typed scripts (see Shaw, 2003, for a more Blackhurst 2005) and the computer adaptive CB BULATS
detailed account). Examiner training and standardisation test (Jones 2000). However, if we acknowledge the view that
clearly has an important role to play in ensuring that such we should not expect candidates to obtain the same score
effects are minimised as far as possible, so as not to impact regardless of test format, but accept that they are likely to
unfairly on candidate scores. maximise their performance in the format which suits them
Many studies of the effects of test mode on Writing best, then studies of this kind become limited in the
assessment have attempted to account for preference in amount of information they contribute. Studies similar to
test format and computer familiarity, by comparing scores of those conducted by Russell et al (2003), comparing
candidates taking PB and CB assessments while controlling candidates taking PB and CB assessments while controlling
for some general measure of ability. Studies of this nature for some measure of general ability, may prove more useful.
conducted by Russell and his colleagues (Russell 1999, This has recently been attempted with data from live CB
Russell et al 2003, Russell and Haney 1997, Russell and IELTS administrations, and the evidence appears to support
Plati 2002) provided evidence that tests requiring the view that performance is not affected by test format,
candidates to complete assessments comprising open- once ability is accounted for. Band scores for candidates
ended and essay type responses on paper tended to taking the PB and CB forms of the test were compared using
underestimate the performance of those more used to Multivariate Analysis of Covariance (MANCOVA): no
writing on computer. They recommended that candidates be significant differences were found between the performance
given the option of writing in the medium of their choice, of the PB and CB groups on the Reading, Listening and
arguing that test conditions would then better reflect real- Writing components, once Speaking score was controlled
world practice, making for a more authentic and valid for (Speaking is the same for PB and CB tests, and was
assessment. included in the analysis in an attempt to control for general
proficiency). A separate Analysis of Covariance (ANCOVA)
In essence, the notion of allowing each examinee to customise
their testing experience, such that they are able to demonstrate was then carried out on CB candidates only, to assess the
their best performance, is consistent with the practice of providing effect of opting to complete the Writing component on
accommodations to students with special needs. To be clear, we paper or computer. Again, there were no significant
are not advocating that examinees be provided with access to any differences in Writing score between those who typed and
tool or test format such that the accommodation itself leads to a handwrote their essays, once Reading, Listening and
higher test score. Rather, we suggest that examinees should be Speaking band scores were controlled for. It seems that this
able to customise the environment in which they perform the test
may be a useful approach to adopt for comparability work
so that the influence of factors irrelevant to the construct being
measured is reduced. (Russell et al 2003:289)
on other CB products.
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
14 | C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7
Bennett, R E (1998) Reinventing assessment: Speculations on the Mislevy, R J (1992) Linking Educational Assessments Concepts,
future of large-scale educational testing, New Jersey: Educational Issues, Methods, and Prospects, Princeton: ETS.
Testing Service Policy Information Center, retrieved from Paek, P (2005) Recent Trends in Comparability Studies, PEM Research
www.ets.org/Media/Research/pdf/PICREINVENT.pdf Report 05-05, Pearson Educational Measurement, August 2005,
Bennett, R E (2003) Online Assessment and the Comparability of retrieved from
Score Meaning, Educational Testing Service Research www.pearsonedmeasurement.com/research/research.htm
Memorandum RM-03-05, retrieved from Pellegrino, J W, Chodowsky, N and Glaser, R (Eds.) (2001) Knowing
www.ets.org/Media/Research/pdf/RM-03-05-Bennett.pdf what students know: The science and design of educational
Blackhurst, A (2005) Listening, Reading and Writing on computer- assessment, Washington, DEC: National Academy Press.
based and paper-based versions of IELTS, Research Notes 21, Russell, M (1999) Testing on computers: A follow-up study comparing
14–17. performance on computer and on paper, in Education Policy
ETS (2005) TOEFL Internet-based Test Score Comparison Tables, Analysis Archives 7 (20), retrieved from
retrieved from: www.ets.org/Media/Tests/TOEFL/pdf/TOEFL_iBT_ http://epaa.asu.edu/epaa/v7n20
Score_Comparison_Tables.pdf Russell, M, Goldberg, A and O’Connor, K (2003) Computer-based
Green, T and Maycock, L (2004) Computer-based IELTS and paper- Testing and Validity: a look back into the future, Assessment in
based versions of IELTS, Research Notes 18, 3–6. Education 10, (3).
Hacket, E (2005) The development of a computer-based version of Russell, M and Haney, W (1997) Testing writing on computers: an
PET, Research Notes 22, 9–13. experiment comparing student performance on tests conducted
Horkay, N, Bennett, R E, Allen, N, Kaplan, B and Yan, F (2006) Does it via computer and via paper-and-pencil, Education Policy
matter if I take my writing test on computer? An empirical study of Analysis Archives 5 (3), retrieved from
mode effects, in NAEP, Journal of Technology, Learning and http://epaa.asu.edu/epaa/v5n3
Assessment, 5, (2), retrieved from www.jtla.org Russell, M and Plati, T (2002) Does it matter what I write? Comparing
Jones, N (2000) BULATS: A case study comparing computer based performance on paper, computer and portable writing devices,
and paper-and-pencil tests, Research Notes 3, 10–13. Current Issues in Education [Online] 5, (4), retrieved from
http://cie.ed.asu.edu/volume5/number4
Jones, N (2006) Assessment for learning: the challenge for an
examination board, in Excellence in Assessment, 1, retrieved from Seddon, P (2005) An overview of computer-based testing, Research
www.assessnet.org.uk/ Notes 22, 8–9.
Maycock, L and Green, T (2005) The effects on performance of Shaw, S D (2003) Legibility and the rating of second language writing:
computer familiarity and attitudes towards CB IELTS, Research the effect on examiners when assessing handwritten and word-
Notes 20, 3–8. processed scripts, Research Notes 11, 7–10.
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7 | 15
writing should allow Cambridge ESOL to investigate and associated with the enabled work step(s). Figure 1 shows
address the behaviour of the individual examiner when the relationship between these elements.
confronted with particular candidate responses or particular The potential for electronic data capture within a
writing tasks. workflow in an ESM environment is of particular value in the
This article describes efforts to conceptualise the following areas:
Cambridge ESOL Writing assessment setting as a workflow • workflow conceptualisation
in terms of the facets of the setting within an ESM
• identification and manipulation of facets which constitute
environment. the data in the workflow
• subsequent movement and reconciliation of data within
Writing assessment as a ‘marking model the workflow.
workflow’
A ‘marking model workflow’ is a procedural approach for
The marking model workflow: basic
processing a candidate’s response. At its very simplest, a
candidate in response to an input task in the testing
building blocks
instrument provides a specified output and the output is Marking models can, for example, be built from basic types
the mark awarded to the candidate as an estimate of the of marking: single; double; multiple; review; and gold
ability which is being tested. So, the input to a marking standard. Figures 2–6 show the basic work and data flows
model is a candidate’s work (e.g. a script) and the intended (the types of marking can be integrated into full marking
output is marks of known, consistent accuracy and models).
precision that are independent of the marker who marked With single marking, one rater records his or her marks
the script (‘script’ is the traditional Cambridge ESOL word and annotations, which are loaded into a data store for
although the term being used for ESM is ‘response set’, subsequent processing. With double marking, a response is
covering CBT responses, audio recordings, etc.). Various independently marked by two raters, who both record their
marking and quality assurance processes take place marks and any annotations. Both sets of marks and
between the input and output stages. This process annotations are loaded into the data store for subsequent
constitutes a basic workflow which can be built into the processing. A variant on the double marking model is the
organisation’s standard procedures (and can be quality- multiple marking model, i.e. multiple observations (n) of
assured). the same sample of performance.
In workflow terms, a work item is an instance of a With review marking, a candidate’s response and the
workflow (e.g. the processing of a particular response such original rater’s marks and annotations are presented to the
as a complete script, single question or part question) on reviewer – such as a TL – who enters their own marks and
which the actions of the workflow will be carried out. These annotations for loading into the data store. Both sets of
actions or work steps are carried out/initiated by a user (a marks and annotations are stored for subsequent
designated class of user such as Principal Examiner (PE), processing, though generally the reviewer’s mark will take
Team Leader (TL) or Assistant Examiner (AE)). A role defines precedence. With gold standard seeding, responses with
a class of user that has access rights and other properties pre-agreed but secret marks are introduced into a rater’s
affecting how users in that class can interact with the work queue at certain intervals. A gold standard script is a
workflow system. Work introduction can be thought of as clean copy of a script previously marked by a group of PEs.
the initial state for a work item in the workflow system. A Principal Examiner would determine, in consensus with a
Responses are imported into the workflow system and small group of other PEs or senior TLs and in advance of
located in a marking work queue (a place where a work item marking, what the score should be on a sample number of
is stored awaiting a work step action to be carried out). scripts. These scripts would then be introduced on a
Workflow routing is the logical linkage between two or more periodic basis throughout the marking period for marking
work steps such that the completion of one work step by AEs. The rater is unable to distinguish these gold
causes one or more other work steps to become enabled standard responses from other responses and marks them
and the work items to be loaded into the work queue as normal, his or her marks and annotations being loaded
Figure 1: Elements of a
WORKFLOW marking model workflow
Workflow Routing
Work Step 1 Work Step n
Workflow Queue
Score
(responses await
Data
work steps)
Store
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
16 | C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7
Independent
Rater 1
Mark 1
Figure 3:
Response Item Data Double marking model
Store
Independent
Rater 2
Mark 2
Independent
Rater 1
Mark 1
Figure 4:
Independent Data Multiple marking
Response Item Rater 2 model
Mark 2 Store
Independent
Rater n
Mark n
Candidate
Response
Data Figure 5:
Reviewer Re-marks Review marking model
Store
Response
Reviewer's Marks
and Annotations
Reviewer's Marks stored together with first marks
A
Rater Introduce Rater's
Gold Standard Marking Comparison Figure 6:
Marking GS Marking
(GS Responses) Gold standard seeding
marking model
Within Continue
A Tolerance? Marking
YES
NO
Supervisor
Intervention
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7 | 17
into the data store for subsequent comparison with the gold facets may exert a possible influence on the outcome of a
standard marks. In this way a rater’s marks may be test score. It is thus possible to collect information on the
compared with a gold standard that is independent of any impact of any one of these facets (or any specific
particular TL or supervisor. combination of them).
Each facet (or group of facets), assembled in a variety of
ways, represents a potential source of data collection in the
Conceptualising the writing assessment assessment context. If, for example, it is necessary to
setting in terms of ‘facets’ investigate finer-tuned aspects of the interaction of
Writing tests are highly complex events involving multiple particular facets then the key facets must first be identified.
‘facets’, some of which may be difficult to predict or control, These will constitute a focus for subsequent analyses.
which co-operate to produce a test score. Consideration of Facets of scoring validity, for example, can be identified and
the various facets in writing assessment can reveal constructed for a particular assessment scenario, i.e. a
interactions between different facets of the assessment particular kind of rating for a particular type of rater on a
context which have a systematic influence on scores. The particular rating occasion. Suitable mechanisms for data
principal facets of the assessment context can be collection and storage can be built into the workflow
categorised into three groups: candidate (ability), task systems and it would be necessary to ensure that adequate
(difficulty) and rater (severity/leniency). Clearly, there are data is both collected and stored for retrieval. Decisions as
systematic facets of rater behaviour when confronted with to whether data is required in real time for grading
particular candidates, particular tasks or particular test purposes or whether it is needed for subsequent validation
formats. purposes will need to be taken.
The modelling of the assessment characteristics – made A facet approach enables the researcher to deconstruct
possible by a faceted approach – has three primary any assessment setting into relevant constituent facets in
functions: order to address specific research questions relating to
1. A practical function – estimates of candidate ability may facets of the rating assessment context. In this way, facets
legislate for the features of both the rater and task can be assembled/re-assembled in a variety of different
thereby generating comparable candidate abilities which ways offering a number of key benefits:
can be generalised across a universe of raters and tasks. • score matching through tasks to best reflect both the
2. A planning and developmental function – for systems knowledge and ability of candidates, i.e. an effective
development, for example. scoring/procedural system
3. A research function – raising a host of research questions • knowledge and ability of candidates mediated through
relating to facets of the rating assessment context. people (e.g. raters) and systems (e.g. scaling)
• introduction of stable and consistent scores
A clearly articulated research agenda would be required • ability to demonstrate an optimum marking model for
in order to investigate the significance for specific score dependability
measurement variables that any proposed model may
• greater control for assessment interactions
suggest are likely to be of some importance. Examples of
research questions relating to the facet of scoring might • introduction of control mechanisms through data
collection.
include:
• In what ways do raters differ? Is there a gender effect? An argument can be made for conceptualising facets of
(Facets of Rater Status, Rater Profile and Rater Behaviour.) the assessment setting in terms of the various constituent
• Is it possible to identify distinct rater types and certain validity parts of Weir’s Socio-Cognitive Validation
patterns of rater behaviour? (Facet of Rating Behaviour.) Framework (2005) described in Research Notes 21 (Weir
• What amount of training/re-training is required? Can and Shaw 2005). The framework has been developed with
training improve raters’ self-consistency? (Facets of Rater Cambridge ESOL and offers a perspective on the validity of
Behaviour and Rating Training.) Cambridge ESOL Writing tests (Shaw and Weir forthcoming).
• How does assessment differ when marking electronically Of particular interest here, are the a priori validation
as opposed to paper-based marking? (Facets of Rater components of context and cognitive validity and the a
Profile and Rater Behaviour.) posteriori component of scoring validity (which together
constitute what is frequently referred to as construct
Conceptualising writing assessment in terms of facets validity). Cambridge ESOL follows this socio-cognitive
offers the potential for garnering complex data on the approach in relation to the Main Suite examinations where
influence of rater and task characteristics on candidate attention is paid to both context validity and to cognitive
ability estimates in the assessment setting (McNamara validity in terms of the cognitive processing and resources
1996). For example, it is conceivable that certain raters that are activated by test tasks. The ‘superordinate’ facets
variably respond to candidates of particular L1 backgrounds of context, scoring validity and the test taker can thus be
or that gender effects may exist – where the gender of rater deconstructed into sub-facets illustrated in Figures 7,
and candidate may influence scores (facet of Rater 8 and 9.
Behaviour or Rater Profile). It may be that the physical
setting (which provides a context for the assessment) has
an influence (facet of Rater Setting). In fact, any or all of the
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
18 | C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7
Figure 7:
Facets of scoring
Facet of Rating Facet of Rating Timing Facet of Rating Status validity
• Automatic • Time of day • Expert (AE/TL/PE)
• Semi-automatic • Session • Experienced
• Clerical • Period of year • Inexperienced
• Semi-skilled subject expert • Period into marking episode • New/Novice
• Expert • Subject expert
Facet of Rating Training • Clerical
Facet of Rating Setting • Recruitment
• At home • Screening Facet of Rating Behaviour
• On-site • Induction • Severity
• Individual • Training • Fit statistics
• Group • Qualification • Mean absolute difference from GS
• Monitoring • Skew
Facet of Rating Profile • Grade review • Correlation
• Gender • Grade (last session)
• Experience • Duration
• Age
• Background
• Qualifications
Figure 8:
Facets of context
Task Data Facet Output Facet Channel Facet validity
• Dichotomous • Transactional Letter • Receptive
• Polytomous (without or • Essay • Productive
ignoring judge meditation) • Report
• Polytomous (taking judge • Discursive composition Delivery Facet
meditation into account)
• Short story • On-demand
Response Facet • Article • Fixed date
• Proposal • Paper-and-pencil
• Paper-based answer booklet
• Word completion • Computer-based
• Screen
• Open cloze
• Multiple choice Input Facet
• Information transfer
• True/False
• Sentence transformations • Instructions
• Short Answer Question
• Review • Rubric
• Extended Answer
• OCR/CIE task types • Task information
Figure 9:
Facets of the test
Candidate Skills Candidate Profile taker
• Knowledge • Gender
• Experience • Age
• Ability • Nationality
• Use of computers • L1
• World knowledge • School type
• Cultural background • Years of study
• Exam preparation
• Exams taken
• Reasons for taking exam
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7 | 19
modelling. In this sense, an iterative process of research and 3. Ascertain what it is appropriate and realistic to both
development is advocated, that starts with a simple model consider and assess in any given assessment context
and subsequently adds complexity until the business is (position/stance adopted and a supporting rationale,
satisfied with the balance it has achieved. Such a model will feasibility and practicality of assessment proposals, etc.).
probably explicitly model candidate, task and rater facets.
To meet the requirements of the Cambridge Assessment
There would be a need for the model to collect rich, robust
organisation as a whole, the outcomes of ESM would need
data that facilitate the investigation and (if necessary)
to include a practical, practicable, feasible and
ongoing operationalisation of any issue considered relevant
manageable business system which builds on existing
to understanding the nature of the assessment and
structures, but supports sufficient flexibility to
promoting fairness. These data will need to embrace:
accommodate the needs set out above and can also
• details of the assessment (e.g. task parameters, manage interactions between constantly changing systems
administration, linguistic demands) (e.g. technical considerations, issues related to examiner
• candidate characteristics (e.g. gender, age, nationality, payment and contracts). The harnessing of technology and
L1, years of study, knowledge, experience) corpus approaches to obtaining, storing and rating
• examiner characteristics (e.g. qualifications, experience) candidate performance is a major challenge that lies ahead
• rating details (clerical or ‘expert’ examiners, at home or for Cambridge ESOL.
residential rating, examiner behaviour, aspects of
scoring). References and further reading
Much of what is needed can already be found in existing Bachman, L F (2004) Statistical Analyses for Language Assessment,
Cambridge: Cambridge University Press.
examiner databases, test banks, candidate information
sheet data stores and elsewhere. These sources would McNamara, T F (1996) Measuring second language performance,
Harlow: Longman.
need to be related to test performances to facilitate
Milanovic, M and Saville, N (1996) Performance testing, cognition
investigations. As the scoring model is refined, some of this
and assessment: Selected papers from the 15th Language Testing
data might come to inform the operational management Research Colloquium, Cambridge and Arnhem (Studies in
and calculation of test scores. The system would need to be Language Testing volume 3), Cambridge: UCLES/Cambridge
designed with this level of flexibility in mind. University Press.
Capitalising fully on ESM would require a research Shaw, S D and Weir, C J (forthcoming) Examining Writing: Research
agenda designed to: and practice in assessing second language writing (Studies in
Language Testing volume 26), Cambridge: Cambridge
1. Explore the significance for measurement of facets that
ESOL/Cambridge University Press.
the model may suggest are of importance (e.g. the effect
of the characteristics of the rater on test scores). Weir, C J (2005) Language Testing and Validation: An evidence-based
approach, Palgrave Macmillan.
2. Focus on the nature of interactions of the facets of
Weir, C J and Shaw, S D (2005) Establishing the validity of Cambridge
assessment (e.g. the candidate/rater – rater/task ESOL Writing tests: towards the implementation of a socio-
interaction) especially given the interactional nature of cognitive model for test validation, Research Notes 21, 10–14.
performance assessment.
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
20 | C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7
of exam material in terms of the cultural context it could be Rationale for broadening the cultural
considered to exemplify; the commissioning of ESOL
context
consultants to author and conduct awareness raising
activities for writers of our exam materials; and an For some time, there had been a growing awareness that
evaluative stage, which aimed to gather data to describe the increasing use of English as an international language,
any impact on the materials which the initiatives were felt in which the target language use of candidates could be in
to have achieved, from the perspective of the item writers. a local or international context, was a need which language
This report begins by summarising the context and testers had a responsibility to ensure was appropriately
method of producing examination material at Cambridge reflected in testing materials. This perception was
ESOL. It goes on to describe the rationale, categorisation, reinforced by data gathered by Cambridge ESOL on the
training and impact of the initiatives and concludes with a purposes for which candidates were taking exams; such
brief comment on possible future directions. data indicated that candidates undertaking English-
language examinations often do so for reasons other than
work or study in the United Kingdom. It seemed there was
The context and method of examination some evidence that English-language test takers in our
production at Cambridge ESOL contemporary context may have no common need to
Cambridge ESOL offers a range of ESOL examinations to a engage with, visit or work in the culture from which their
worldwide candidature of around 1.8 million candidates in English-language test originated. Further impetus for the
over 135 countries. Test material passes through four main Broadening the Cultural Context initiatives arose partly out
stages in order to ensure a rigorous and quality driven of a current focus in testing theory which stresses the
approach to the production of examinations: commissioning importance of the appropriacy of examination materials for
and editing; pretesting; test construction; and question the target language use of the candidates, and partly from
paper production. This final stage consists of the external study of the data gathered by Cambridge ESOL’s ongoing
vetting, checking and printing of the constructed papers or investigations into the needs and wants of the
tasks. The stages of commissioning/editing, pretesting and contemporary international candidature. All these
test construction are the key stages during which the choice, considerations underpinned the rationale which led
consideration and editing of materials for assessment Cambridge ESOL to develop and implement a number of
purposes is made. In terms of the commissioning of initiatives aimed at ensuring the continuing cultural
material, small teams of item writers, who may work on one accessibility of examination materials, and at raising the
or more of the components of the exams, are commissioned awareness of the issues among the item writer cadre.
to source materials and write tasks according to the Additionally, and by no means a minor point when
individual paper specifications. When these materials are considering the cultural context of examination materials, it
submitted, they are initially evaluated at a pre-editing stage is an important shared assumption at Cambridge ESOL that
where they are accepted for editing or are rejected. Following in order to elicit the best performance from candidates in
editing, material is constructed into pretests to enable terms of their linguistic ability, examination materials
trialling on a representative sample of the international ought, as far as possible, to be accessible, relevant and
candidature; this ensures as far as possible that material interesting to the target international candidature.
and tasks are appropriate for the level and for the intended
candidates. After completing the pretests, the candidates Preliminary considerations in broadening
and teachers also have the opportunity to give written the cultural context
feedback on the topics and texts contained in them. When
As a preliminary initiative in order to investigate the issues,
the materials are reviewed, candidate/teacher feedback on
Cambridge ESOL established a working group of key
the materials is carefully considered alongside the statistical
external consultants and internal staff who had roles in
data which is gathered as a measure of its linguistic
managing the production of examination materials. This
difficulty. Feedback which indicates material may not be
group had the primary goal of identifying what principles or
suitable for any part of our international candidature informs
objectives could be said to underpin any subsequent
the decision at this stage as to whether material progresses
initiatives in this context which might be developed and
beyond this point into an actual examination paper. Material
implemented within the examination production process.
which meets the criteria, both statistically and in terms of
Following a series of consultations, this group identified,
appropriacy for the candidature, becomes available for test
among others, the following objectives:
construction. At test construction, the constructed papers
are carefully considered to ensure that, as well as fulfilling • It should not be assumed that candidates are
knowledgeable about, or interested in, British culture.
the statistical criteria at the level for which they are
designed, each paper is accessible and appropriate for the • Texts and other task input material should not assume
situations, needs and wants of the target international that candidates enjoy the lifestyles of particular income
candidature. Finally, there is an overview stage, at which the groups or nationalities.
tests are considered alongside one another; this provides a • Exam materials should be drawn from as wide a range of
global check on the overall appropriacy, quality and sources as possible, including examples of non-British
suitability of the examinations. English.
• Good exam materials are often drawn from sources that
have the widest possible original target audience.
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7 | 21
• Where cultural assumptions might impede submission of material to Cambridge ESOL should consider
understanding, materials should be edited to gloss or, if and identify, according to the agreed broad criteria, the
necessary, remove cultural allusions or references. cultural context exemplified in each case. When material is
• Assumptions should not be made about a candidate’s submitted for consideration, according to ongoing
location, cultural background and/or lifestyle. Cambridge ESOL policy, material which is felt to exhibit
cultural bias or too narrow a cultural context does not
With these objectives defined and agreed, two further
proceed beyond the pre-editing stage. At the other end of
initiatives were undertaken by Cambridge ESOL: the training
the production cycle, it was agreed that material should be
of item writers, and the categorisation of materials.
checked at the overview stage, which reviews the complete
exams prior to their going forward to external vetting and
Training of item writers printing. These categorisations aim to ensure that all new
material is classified according to the cultural context it
It was agreed that training should be rolled out to all item exemplifies, and that all existing material which currently
writers, partly as an awareness raising initiative regarding goes into tests would be subject to the same scrutiny.
the cultural accessibility of exam materials, and partly to
ensure that the issues were communicated and discussed
as widely as possible. A number of experienced writers and Impact of training and categorisation
consultants, including some of those from the original In terms of the impact of the categorisation initiatives at
working group, were commissioned by Cambridge ESOL to pre-editing and overview stages, a positive observation has
author skill- and exam-specific training materials for the been that item writers’ awareness of issues such as the
widest possible dissemination and use. In order to ensure cultural accessibility of materials has been raised; the
these training materials were appropriate and useful for materials submitted to Cambridge ESOL at the pre-editing
each of the contexts in which Cambridge ESOL offers stage are observed to exhibit a broad range of widely
international examinations, the training packs were tailored accessible cultural contexts. The implications for future
to particular exam contexts, including Business, Young tests may also be positive, in that Cambridge ESOL can feel
Learner and General English. At their core, all training confident that its item writers are generating materials
materials contained the exemplification of the fundamental which will be, as far as possible, accessible and appropriate
principles for broadening the cultural context, which were for the wide variety of international contexts in which its
identified and developed by the initial working group, and tests are taken.
which it had been agreed should underpin any subsequent In order to assess the impact of the training initiatives
initiatives. from the perspective of item writers, a survey was designed
To assist the item writers, two sets of guidelines were and administered by email to a representative sample of
developed using external and internal resources. These Cambridge ESOL’s item writing cadre. The representative
guidelines were designed to be companion documents to sample selection was obtained by a systematic method of
the Broadening the Cultural Context training, as a source of selecting every third member of each item-writing team. The
further reference for item writers. One document arose out responses from the sample group indicated the following:
of requests from item writers for assistance with technical
• The objectives for Broadening the Cultural Context were
issues associated with using the internet as a materials
perceived by the item writers to be an appropriate and
resource. The second document was written to offer writers
useful articulation of the frame of reference.
assistance and guidance in their search for materials with a
• The training and supporting documents were felt to be,
broader cultural context. This document draws on both
on the whole, effective, adequate and useful.
published materials and the internet as potential sources,
discusses some of the potential sources available, and • The training, which aimed to raise the awareness of
contains sample texts which are commented on by issues of the cultural context of examination materials,
experienced item writers and chairs in terms of their cultural has been perceived positively by the item writers and is
felt to have had a positive impact on the materials.
perspective and suitability for use as examination material.
• The training was felt to have been applied consistently
and logically across the areas, such as General and
Categorisation of materials Business English, in which Cambridge ESOL offers
examinations.
Following a process of consultation and consideration,
certain broad categories under which material could be The comments with which some of the sample group
classified were agreed; these would be used for a supported their answers were, in general, quite positive, for
monitoring process to ensure an appropriate variety of example: ‘I think the training helped to dispel some myths
material was progressing onto the examinations. It was and made the whole area more objective and concrete’.
considered important that the criteria for categorising Finally, in the majority view of the sample group, it was
materials should not be superficial, i.e. simply the names considered that the training initiatives were appropriate to
used, or the topic or location dealt with in a text; instead, the contexts of the current candidature, and as a corollary
the classification should be made at a fundamental level that the training initiatives aimed at raising the awareness
which considered the cultural context or overall perspective of issues of cultural accessibility and fairness were having
exemplified by any particular piece of examination material. and should continue to have a positive impact on
In practice, it was decided that each item writer prior to the Cambridge ESOL’s international examinations.
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
22 | C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7
Future directions for broadening the receive training in the importance of the cultural
accessibility and relevance of materials, efforts are
cultural context
continuing at all stages of the production of examination
Currently, although it is felt by Cambridge ESOL that the materials to ensure the cultural accessibility and
Broadening the Cultural Context initiatives have had a appropriacy of materials. The overall aim of the approach is
positive impact in increasing the cultural accessibility, to ensure that each test has a balanced variety of materials,
relevance and so, it is hoped, the appeal of the examination thus enabling the international candidature to come to all
materials, it is understood that efforts will continue to be the tests as equally as possible in terms of content and
ongoing in order to ensure the content of the exams cultural perspective. Over time, further initiatives in this
remains appropriate and relevant with regard to the and other areas may be identified and developed should
international candidature. Even as new item writers will the nature of the candidature change.
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7 | 23
Conference reports
Language Testing Forum – November described the nature of working with specialist groups.
Ardeshir Geranpayeh detailed the use of Structural
2006
Equation Modelling (SEM) to inform the revision of FCE.
This year’s Language Testing Forum was held at the Ardeshir began with a history of FCE test development and
University of Reading and was hosted by the Centre for revision before outlining some of the empirical procedures
Applied Language Studies (CALS). The introductory lecture used in the current revision. When revising any test it is
entitled Spoken fluency – theory and practice was given by essential that any changes in format do not significantly
Prof Mike McCarthy (Nottingham University). Common alter the underlying test constructs; Ardeshir described two
perceptions of fluency were considered and questioned in studies in which the construct of the FCE exam were
light of corpus evidence which has shown that native explored. The first involved the production of a number of
speakers are often dysfluent and uneven in their spoken construct models, based on data from one major FCE
performance. After listening to a dialogue between a native examination session. The viability of each model was then
and non-native speaker the group discussed the fluency tested using SEM and confirmatory factor analysis (CFA). A
and relative contribution of each speaker. The idea of model which best described the underlying constructs was
conversations rather than speakers being fluent was chosen, and then verified in a second study using data from
suggested, in that speakers communicate strategically and two further FCE sessions. This model, which is based on a
cooperatively and are responsible for the continuation of componential view of language testing, is now being used
the conversation. The implications for oral assessment were to influence the current revision process.
discussed, particularly the role of interlocutor. There were a variety of interesting papers and posters
Two papers were presented by members of Cambridge given by delegates from a number of institutions including
ESOL. Martin Robinson described the role of content Lancaster University, the University of Bristol, Roehampton
specialists in both the development and production of ESP University, the University of Essex and the University of
tests, in particular the International Legal English Certificate Reading. Topics covered included: language support in EAL
(ILEC) and the International Certificate of Financial English classrooms, effect of task on group speaking tests, reading
(ICFE). In order to deliver high quality testing instruments, it to summarize and the TOEFL impact study. In addition Prof
is vital that there is a full understanding of the language use Alan Davies (University of Edinburgh) led an open
domain and the degree of task authenticity. When testing discussion on the ILTA draft Code of Practice (CoP). In this
for specific purposes or contexts, experts may be enlisted workshop participants were invited to discuss whether it is
to provide some of this information. Traditionally, content necessary for organisations such as ILTA to have a CoP in
specialists have been used only at the needs analysis stage addition to a Code of Ethics (CoE) and to comment on the
of test development; however, in the development of ILEC draft CoP which has recently been developed. Issues
and ICFE these specialists have been used throughout the discussed included whether it would be more appropriate
design and development process in an integrated to have an expanded CoE, whether a CoP should be
approach. In addition, this collaboration is continuing with mandatory or voluntary, and who the intended audience
ILEC through out the item and test production process. would be.
Martin outlined the issues involved with this approach and
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.
24 | C A M B R I D G E E S O L : R E S E A R C H N OT E S : I SS U E 2 7 / F E B R U A R Y 2 0 0 7
Call for Entries for IELTS Masters Award October/November Meeting of IELTS Research Committee.
2007 November/December Announcement of award.
Each year the IELTS partners sponsor an annual award of Please note that submission details may change from year
£1000 for the Masters level dissertation or thesis which to year and it is therefore important that the most current
makes the most significant contribution to the field of procedures are consulted. Details of the application process
language testing. The entry procedures and timetable for for the IELTS Masters Award 2007 can also be found on the
the 2007 award are given below. IELTS website: www.ielts.org
©UCLES 2007 – The contents of this publication may not be reproduced without the written permission of the copyright holder.