WK 3 Key Issues For Corpora Selection

Key Issues For Corpora Selection
BBI5411 AP. DR AFIDA MOHAMAMAD ALI
ADD A FOOTER 1
Authenticity and
validity – corpus text
must be genuine
Representativeness (Tognini-Bonelli,
2001; McEnery
and Wilson
Sampling
1996)
Balance
Questions on ethics
Representative of what?
‘it is not easy to be confident that a sample of texts can be

thoroughly representative of all possible genres or even of a
particular genre or subject field or topic’ Any attempt at corpus
creation is therefore a compromise between the hoped for and the
achievable.
(Kennedy 1998: 62).
ADD A FOOTER 3
A linguistic corpus should provide material for research which allows
for impartial description of language use – corpus-based research is
not prescriptive in nature.
As a corpus user, you need to know WHAT’S in the corpus before you
start looking for and interpreting linguistic data.
 An example: emails in British National Corpus 
ADD A FOOTER 4
At the time of compilation, emailing as means of communication – marginal
(academic and government contexts)
However, the BNC contains 7 files totaling 214,018 words
 Issue 1 : formality of those 7 emails as against today’s informal style in
emails
 Issue 2 : the word scum (342 instances in emails, whereas 540 instances in
the whole 100 million word corpus, the 342 included in that number . How is
that possible?
 All the emails were taken from the Leeds United mailing list and the ‘scum’
refers to Manchester United
 Consequence – BNC not representative for email communication
ADD A FOOTER 5
• Example 1 (McEnery & Hardie):
Language of service interactions in shops in the UK in the late 1990s
, the sampling frame is clear – we would only accept data into our corpus which represents
service interactions in UK shops in the 1990s.
However, if we only collected data gathered in coffee shops, we would not get a balanced set
of data for that population.
Relatively context-specific lexis, such as latte and frapuccino, would be likely to occur much
more frequently than they do in service interactions in general.
Phrases which are typical of other kinds of service interactions, such as Should I wrap that for
you?, might not occur at all.
ADD A FOOTER 6
A corpus is representative if
…the findings based on its contents can be

generalized to the said language variety (Leech,
1991);
…its samples include the full range of variability in a
population (Biber 1993)
Corpus Representation Limit the population
by sampling frame –
E.G. LANGUAGE OF
SERVICE
INTERACTIONS IN
SHOPS IN MALAYSIA
Demographic
sampling – narrower
– i.e. female writers
Defining the genres or

mode – i.e. written,
romantic, medical,
legal.
Determine sample
size in length and
number of texts.
Representativeness
It changes over time (Hunston 2002): if a

corpus is not regularly updated, it rapidly
becomes unrepresentative.
e.g. Bank of English (Uni. Of Birmingham

1980s) is a monitor (dynamic) corpus
that is continually expanded since it was
created.
ADD A FOOTER 9
Representativeness
Criteria to select texts for a corpus:
External criteria Internal criteria

• (Biber’s situational • (Biber’s linguistic perspective):
perspective): defined defined linguistically, taking into
account the distribution of linguistic
situationally, e.g. genres, features.
registers, text types, etc. What
kind of texts? Number of texts?
• CIRCULAR – because a corpus is
typically designed to study linguistic
distribution, so there is no point in
analysing a corpus where distribution
ADD A FOOTER 10 of linguistic features is
predetermined (planned and fixed).
Representativeness
2 main types (for the range of text categories represented):
General corpora Specialized corpora

• – a basis for an overall description of a • – domain- or genre specific corpora; their
language (variety); their r. depends on r. can be measured by the degree of
the sampling from a broad range of closure or saturation (lexical richness).
genres.
ADD A FOOTER 11
Balance
The range of text categories included in the
corpus:
The acceptable balance is determined by the

intended uses.
A balanced corpus covers a wide

range of text categories which are supposed
to be representative of the language (variety)
under consideration.
Example 1 again:
• For balance, we have to characterise the range of shops whose language we wanted to
sample, and collect data evenly from across that range.
• The shops samples are typical, that we gathered data from them in such a way as to avoid
introducing skew into our dataset.
• Let’s say we include bookshops, so we must not just choose 1 kind of bookshop (that sells
only antiquarian books).
• We must ensure the proportions of data in our corpus reflect, in some way, the numbers of
each type of interaction of interest that actually occur.
• Locations of the shops should also be balanced. (cover many parts of Malaysia)
ADD A FOOTER 13
The Case of Published
Materials Corpus (PMC; Nelson 2000)
A representative sample of Business English materials was needed
7 major distributors of EFL materials were contacted
provide a list of their best-selling (popularity rank) Business English titles

of 1996
thirty-eight books were obtained for the final list
Further parameters included in terms of type of book, included, gender of

author and those books focusing on one or more of the ‘four skills.
ADD A FOOTER 14
Balance
There is no scientific measure for balance.
It is more important for sample corpora (static)

than for monitor corpora (dynamic)
LOB is a sample (static)
corpora (1 million words)
• Corpora which seek balance and
representativeness within a given
sampling frame are snapshot corpora.
• LOB represents a ‘snapshot’ of the

standard written form of modern British
English in the early 1960s.
• For each category, samples of data were

gathered, with each sample being of
roughly similar length (2,000 words)
• Span of 30 years.
• Counterpart is Brown (American English

1961)
16
• The range of texts and linguistic distributions are interdependent:
if the text range is not representative, the distribution will fail in
representativeness too.
ADD A FOOTER 17
DIACHRONIC STUDY FOR LOB (LEECH,
2004, Baker 2009) SYNCHRONIC STUDY OF LOB AND
BROWN CORPORA
The development and evolution of
language through history. Investigate differences between 2
Historical linguistics is typically a language varieties in the same period.
diachronic study
ADD A FOOTER 18
Sampling
A corpus is a sample of a given population
A sample is representative if what we find for the

sample holds for the general population
Samples are scaled-down versions of a larger

population
Sampling
Sampling unit: for written text, a s.u. could be a book, periodical
or newspaper.
Population: the assembly of all sampling units; it can be

defined in terms of language production, reception
(demographic, sex, age, etc.) or language as a product
(category, genre of language data). It is the notional space
within which language is sampled.
Sampling frame: the list of sampling units

Sampling
Sampling techniques:
• Simple random sampling: all sampling units within the
sampling frame are numbered and the sample is chosen by
use of a table or random numbers; rare features could not be
accounted for.
• Stratified random sampling: the population is divided in
relatively homogeneous groups, i.e. the strata, and then
these latter are sampled at random; never less
representative than the former method.
Sampling
Sample size:
• Full texts = no balance; peculiarity of individual texts

may show through.
• Text chunks are sufficient (e.g. 2000 running

words): frequent linguistic features are stable in their
distribution and hence short text chunks are sufficient for
their study (Biber 1993). Text initial, middle and end
samples must be balanced.
Sampling
Proportion and number of
samples:
The number of samples across

text categories should be
proportional to their frequencies
and/or weights in the target
population in order for the
resulting corpus to be
considered as representative.
In order to ‘If you are involved in
Size
study the behaviour of language teaching rather
words in texts, we need to than lexicography,
have available quite a large
number of occurrences’
single word lists from
(Sinclair 1991: 18). small selective corpora
can be seriously useful’
(Tribble 1997).
1960 -
1990s
1980s
• value of smaller
‘three generations’ of Leech
(1991) corpora and stressed
• hundred thousand their pedagogical
words to several hundred million purpose.
(British National Corpus [BNC],
Bank of English, • ‘balanced’ and
Cambridge International Corpus
‘representative’
[CIC], which stands at one billion
words). picture of a specific
area of the language.
ADD A FOOTER 25
Size
 Corpus size is usually represented as:
 Overall number of words in the corpus (e.g. BNC is a 100 million word corpus
of present day British English)
 Overall number of texts in the corpus (e.g. BNC features more than 4,000
texts, 90 % of the corpus and 10 % of its size goes to spoken language
sample)
Hoffman et al. (2008)
ADD A FOOTER 26
Size
 Issues of size also include
 the number of text types in text categories,
 the number of samples within each text type and
 the number of words per individual sample.
 E.g. If there are too few texts in a category, then one single text can influence the results
(cf the ocurrence of SCUM in the Leeds United email list in the BNC)
 The number of samples from each text is also important.

 E.g. If you are researching the characteristics of academic research articles , then you
would need samples from various parts of the article , Introduction, Methods, Results and
Discussion , as they all feature different language patterns
ADD A FOOTER 27
Is there an ideal sample size?
• Oostdijk (1991) and Kennedy (1998) –
• ‘A sample size of 20,000 words would yield samples that are

large enough to be representative of a given variety’.
• Based on heuristics (rule of thumb) (McEnery and Hardie

2012)
• BNC, target sample sizes of 40,000 words have been used.
ADD A FOOTER 28
Size
 Some research show (Biber 1990) that ten texts per category (e.g.
LOB Corpus) are representative enough . Still, many corpora
feature more texts per category.
 The number of words per sample should provide a stable and
reliable count of (grammatical or other) features in a text.
 Usually a 1,000 word sample provides a stable count of majority of
usual features (Biber 1990)
 However, in lexicographic researches in particular, some lexemes
are so rare that much larger samples are needed.
ADD A FOOTER 29
Size
• Biber has pointed out, that there is considerable variation within genre, in that
for some genres, 20,000 words would provide an adequate sample size. For
others this would not.
• E.g. in creating the British English Corpus.
• For approximately 20,000 words, 114 faxes were collected from different
sources.
• However, in the category of ‘business books’, 20,000 words would not cover
even one book. For this reason, a larger sample size of 50,000 words was
used for books, taking five 10,000 word samples from five different books.
ADD A FOOTER 30
Use of text chunks or extracts
• The BNC, with 40,000 word extracts, did not use full text (partially
for copyright reasons).
• They used continuous text within a whole, cutting the sample at a
logical point such as at the end of a chapter. This approach is well
suited to study of general language.
• In the BEC written section, which was concerned with the specialist
language of business, whole texts were used wherever possible.
ADD A FOOTER 31
3 main criteria for size in written corpora (Nelson, 2010)
HISTORICAL – PEDAGOGICAL - Smaller

Older or ‘first corpora enable easier
PRAGMATIC – resources Generation corpora access to the data
must always be weighed used the magic number found in them. This in
against projected of one-million-word turn leads to easier
corpus size mark (Leech (1991) transferral of results to
But later, specialized the classroom.
corpora were in the
ADD A FOOTER 32
“the overall size of a corpus can be secondary
to the need for adequate sampling”
(Nelson, 2010)
ADD A FOOTER 33
Ethics
When contacting potential sources of texts, it is essential to ensure
both that the data you collect is treated according to the laws of
copyright and also that you observe the privacy of the authors, if the
texts come from the private domain.
Draw up a contract on the usage of the data that you receive

from respondents. Once all the data have been gathered, the next
step is to store them and make them easily available for retrieval.
(Nelson 2010)
ADD A FOOTER 34
Biber’s criteria for text sampling
• Channel – written, spoken, electronic

• Published or unpublished
• Institutional or non-institutional
• Demographics of writer/ speaker
• Factual or fictional
• Purpose – persuasive, informative, etc.
• Topic.
To sum up :
• The appropriate design for a corpus depend upon what part, how
much and what phenomenon/a in language it is meant to
represent.
• The representativeness of the corpus determines:
• The kind of research questions to be addressed
• Generalizabilty of the results of the research
• We do not know the full extent of language variability and/or variety
; therefore, no corpus can be ideally representative
• However, a certain degree of representativeness must be provided
ADD A FOOTER 36
TASK FOR TODAY (GROUP DISSCUSSION)
 Imagine that you are supposed to compile a representative corpus of

Present-day English advertisements.
 What texts /pieces of language would you include in your corpus?
 How much of each text would be ‘just right’ for its representation ?
 Are there any texts that you would deliberately omit from the corpus?
Why?
 Think of the explicit criteria you would use.

WK 3 Key Issues For Corpora Selection

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

WK 3 Key Issues For Corpora Selection

Uploaded by

Copyright:

Available Formats

Key Issues For Corpora Selection

BBI5411 AP. DR AFIDA MOHAMAMAD ALI

‘it is not easy to be confident that a sample of texts can be

(Kennedy 1998: 62).

Language of service interactions in shops in the UK in the late 1990s

…the findings based on its contents can be

Defining the genres or

It changes over time (Hunston 2002): if a

e.g. Bank of English (Uni. Of Birmingham

External criteria Internal criteria

General corpora Specialized corpora

The acceptable balance is determined by the

A balanced corpus covers a wide

A representative sample of Business English materials was needed

7 major distributors of EFL materials were contacted

provide a list of their best-selling (popularity rank) Business English titles

thirty-eight books were obtained for the final list

Further parameters included in terms of type of book, included, gender of

It is more important for sample corpora (static)

• LOB represents a ‘snapshot’ of the

• For each category, samples of data were

• Counterpart is Brown (American English

A sample is representative if what we find for the

Samples are scaled-down versions of a larger

Population: the assembly of all sampling units; it can be

Sampling frame: the list of sampling units

• Full texts = no balance; peculiarity of individual texts

• Text chunks are sufficient (e.g. 2000 running

The number of samples across

Hoffman et al. (2008)

 The number of samples from each text is also important.

• Oostdijk (1991) and Kennedy (1998) –

• ‘A sample size of 20,000 words would yield samples that are

• Based on heuristics (rule of thumb) (McEnery and Hardie

• E.g. in creating the British English Corpus.

HISTORICAL – PEDAGOGICAL - Smaller

Draw up a contract on the usage of the data that you receive

• Channel – written, spoken, electronic

 Imagine that you are supposed to compile a representative corpus of

You might also like