Intro To Corpus - Feb2012

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Introduction to the

Cambridge English Corpus

February 2012
What is a corpus?
What can you do with a corpus?
What’s in the Cambridge Corpus?
Why use corpora in ELT publications?
How has the Corpus previously been used?
What other support is there?
What is a corpus?
A corpus is a large collection of
written and/or spoken language
that can be used for research into
how a language is used.
What can you do with a corpus?
Create market-specific versions of your products

Test your intuitions

Get inspiration for creating exercises and tasks

Build vocabulary lists

Examine the language that your target market finds difficult


Authentic language
- Corpus results vs. “textbook” English.

SHOP ASSISTANT: What can I do for you Madam?


JANAKI: We are new to this store, can you tell us where the groceries are?
SHOP ASSISTANT: For the groceries turn right, madam.
JANAKI: Thank you very much for your help.
(Spoken English - A Self Learning Guide To Conversation Practice, 2007:27)

ALAN: Er <pause> I want, I want to make a casserole have you got either a
neck of lamb or an ox tail or something like that?
IDA: I've got some ox tail, erm neck of lamb <pause> Philip have you got one?
PHILLIP: No, nope, neck of lamb's off, sorry
ALAN: Well I'll have an ox tail then <laugh>
IDA: Ok, well that's one thirty for the whole one, is that too much for you?
ALAN: No no, I'll use that I think
(CEC – docID: 35925)
Using the Corpus, we can find patterns in language
We can target specific groups of learners by looking at the most
frequent errors they make in their language use.
What’s in the Cambridge English Corpus?
Cambridge Corpus

Native speaker Cambridge Learner Corpus


British and American - Learner exams scripts -
written and spoken material written material
More than 1.8 billion words in total!
• More than 1 billion words of written data
• 75m words of spoken data
• 40m word Learner Corpus
Where does the Cambridge English
Corpus come from?
Breakdown of Cambridge Corpus Resources (million words)

20.8
21.3
British Written
358.2
American Written

British Spoken
823.91
56.32 American Spoken
18.1
Other corpora (e.g. business,
academic)
Learner corpus (coded)

Learner corpus (uncoded)


503.5
Specialist corpora:

• Cambridge Business English Corpus

• Cambridge Financial English Corpus

• Cambridge Legal English Corpus

• Cambridge Academic English Corpus


The Cambridge Learner Corpus
The Cambridge Learner Corpus (CLC)
The CLC is made up from student answers to
Cambridge ESOL exams

There are currently over 40m words in the


corpus – this is growing each year.

Data from over 180,000 students from 173 countries

Around 20m words of the Learner Corpus have been error


coded.
What do we know about the candidates and
their scripts?

• First language • Educational level


• Nationality • Age
• Exam • Years of English study
• CEFR level • Gender
• Year of taking exam • Pass or fail
Learner Error Coding System
Error Codes have been developed by the Press to add further
information about the types of errors students make.

There are around 100 different codes

The codes can mark features such as agreement errors,


spelling errors, and missing words

Error codes are added manually

We can use the error codes to look at what students have


problems with, and what they do well.
Without error coding
(i.e. exactly what the student wrote)

‘After not very promising start, his play


became stronger and more confident .
He played beatifully with his eyes closed
and with smile on his face.’

C2 level, CPE, Poland, 2000


With error coding

After not <#MD> | a </#MD> very promising


start, his <#FV> play | playing </#FV> became
stronger and more confident . He played <#S>
beatifully | beautifully </#S><#MP> | , </#MP>
with his eyes closed and with <#MD> | a
</#MD> smile on his face .

C2 level, CPE, Poland, 2000


Using the Cambridge Learner Corpus we can look at, for
example…

The most frequent errors made by Hindi speakers

Which students most frequently miss out ‘the’ or ‘a’.

Verb errors made by students taking particular exams

…and much much more!


Why use the Corpus in ELT
publications?
The Corpus makes Cambridge books:

• market-specific
• focused on the areas that students find difficult
• up to date
• accurate
• relevant and lively

STAND OUT FROM THE COMPETITION


Using the Corpus we can publish…

– course materials designed specifically for


learner needs
e.g. based on their level of learning, first language, nationality

– materials that are genre or register-specific


e.g. Business English, spoken English, English for Academic Purposes
Using the Corpus to increase sales

The International versions of


face2face and Kid’s Box
were adapted for the
Spanish markets using data
from the Learner Corpus.

The ESS versions showed a 72% increase in expected sales.


How has the Corpus been used?
Common learner errors taken directly from the Corpus
and used in CALD
Error information
from the Learner
Corpus used to
formulate exercises
in Objective FCE.
Use of the Corpus as the “touchstone” of Touchstone!
Frequency and genre information
used in the Cambridge Grammar
of English
How is the Cambridge Corpus better than
the competition?
¾ Bigger- so can find examples even of rarer
words/constructions etc

¾ A wider range of source data

¾ Learner Corpus is bigger – more exam scripts

¾ Learner Corpus is error coded – NO OTHER publisher


has this
What other support is there?
• Software support
– Corpus software (Sketch Engine) training, user guides and
support.

• Research reports
– Commission tailor-made research to suit your brief or budget

• Top Error lists


– Request further information to work from, e.g. lists of top
errors.
• Corpus Bulletin
– Corpus users will automatically receive our quarterly mailshot
with Corpus news and tips.

• Connections
– Find useful documents, tips and discussions on our
Connections community.

• Intranet
– Find useful information on our Intranet page.

• Skype surgery
– Get Corpus help via Skype every other Thursday at 3pm (GMT).
Add: cambridge_claire.dembry.
In summary…
¾ Cambridge Corpus is at present the biggest corpus
currently used and promoted for ELT materials.

¾ Cambridge Learner Corpus (CLC) is the biggest


learner corpus in the world.

¾ Error coding system is unique and NO other


publisher has this.

Using the corpus improves publications by making them


interesting, targeted and relevant to students.
Join our Connections groups: search for
Corpus and also Sketch Engine

Visit our website: www.cambridge.org/corpus

Find resources on the Intranet:


http://intranet1.cup.cam.ac.uk/information/
groups/publishing/elt/corpus/

You might also like