Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Key Issues For Corpora Selection

BBI5411 AP. DR AFIDA MOHAMAMAD ALI

ADD A FOOTER 1
Authenticity and
validity – corpus text
must be genuine

Representativeness (Tognini-Bonelli,
2001; McEnery
and Wilson
Sampling
1996)

Balance

Questions on ethics
Representative of what?

‘it is not easy to be confident that a sample of texts can be


thoroughly representative of all possible genres or even of a
particular genre or subject field or topic’ Any attempt at corpus
creation is therefore a compromise between the hoped for and the
achievable.

(Kennedy 1998: 62).

ADD A FOOTER 3
A linguistic corpus should provide material for research which allows
for impartial description of language use – corpus-based research is
not prescriptive in nature.

As a corpus user, you need to know WHAT’S in the corpus before you
start looking for and interpreting linguistic data.
 An example: emails in British National Corpus 

ADD A FOOTER 4
At the time of compilation, emailing as means of communication – marginal
(academic and government contexts)
However, the BNC contains 7 files totaling 214,018 words
 Issue 1 : formality of those 7 emails as against today’s informal style in
emails
 Issue 2 : the word scum (342 instances in emails, whereas 540 instances in
the whole 100 million word corpus, the 342 included in that number . How is
that possible?
 All the emails were taken from the Leeds United mailing list and the ‘scum’
refers to Manchester United
 Consequence – BNC not representative for email communication
ADD A FOOTER 5
• Example 1 (McEnery & Hardie):

Language of service interactions in shops in the UK in the late 1990s

, the sampling frame is clear – we would only accept data into our corpus which represents
service interactions in UK shops in the 1990s.

However, if we only collected data gathered in coffee shops, we would not get a balanced set
of data for that population.

Relatively context-specific lexis, such as latte and frapuccino, would be likely to occur much
more frequently than they do in service interactions in general.

Phrases which are typical of other kinds of service interactions, such as Should I wrap that for
you?, might not occur at all.

ADD A FOOTER 6
A corpus is representative if

…the findings based on its contents can be


generalized to the said language variety (Leech,
1991);
…its samples include the full range of variability in a
population (Biber 1993)
Corpus Representation Limit the population
by sampling frame –
E.G. LANGUAGE OF
SERVICE
INTERACTIONS IN
SHOPS IN MALAYSIA

Demographic
sampling – narrower
– i.e. female writers

Defining the genres or


mode – i.e. written,
romantic, medical,
legal.

Determine sample
size in length and
number of texts.
Representativeness

It changes over time (Hunston 2002): if a


corpus is not regularly updated, it rapidly
becomes unrepresentative.

e.g. Bank of English (Uni. Of Birmingham


1980s) is a monitor (dynamic) corpus
that is continually expanded since it was
created.

ADD A FOOTER 9
Representativeness
Criteria to select texts for a corpus:

External criteria Internal criteria


• (Biber’s situational • (Biber’s linguistic perspective):
perspective): defined defined linguistically, taking into
account the distribution of linguistic
situationally, e.g. genres, features.
registers, text types, etc. What
kind of texts? Number of texts?
• CIRCULAR – because a corpus is
typically designed to study linguistic
distribution, so there is no point in
analysing a corpus where distribution
ADD A FOOTER 10 of linguistic features is
predetermined (planned and fixed).
Representativeness
2 main types (for the range of text categories represented):

General corpora Specialized corpora


• – a basis for an overall description of a • – domain- or genre specific corpora; their
language (variety); their r. depends on r. can be measured by the degree of
the sampling from a broad range of closure or saturation (lexical richness).
genres.

ADD A FOOTER 11
Balance
The range of text categories included in the
corpus:

The acceptable balance is determined by the


intended uses.

A balanced corpus covers a wide


range of text categories which are supposed
to be representative of the language (variety)
under consideration.
Example 1 again:

• For balance, we have to characterise the range of shops whose language we wanted to
sample, and collect data evenly from across that range.

• The shops samples are typical, that we gathered data from them in such a way as to avoid
introducing skew into our dataset.

• Let’s say we include bookshops, so we must not just choose 1 kind of bookshop (that sells
only antiquarian books).

• We must ensure the proportions of data in our corpus reflect, in some way, the numbers of
each type of interaction of interest that actually occur.

• Locations of the shops should also be balanced. (cover many parts of Malaysia)

ADD A FOOTER 13
The Case of Published
Materials Corpus (PMC; Nelson 2000)

A representative sample of Business English materials was needed

7 major distributors of EFL materials were contacted

provide a list of their best-selling (popularity rank) Business English titles


of 1996

thirty-eight books were obtained for the final list

Further parameters included in terms of type of book, included, gender of


author and those books focusing on one or more of the ‘four skills.
ADD A FOOTER 14
Balance
There is no scientific measure for balance.

It is more important for sample corpora (static)


than for monitor corpora (dynamic)
LOB is a sample (static)
corpora (1 million words)
• Corpora which seek balance and
representativeness within a given
sampling frame are snapshot corpora.

• LOB represents a ‘snapshot’ of the


standard written form of modern British
English in the early 1960s.

• For each category, samples of data were


gathered, with each sample being of
roughly similar length (2,000 words)

• Span of 30 years.

• Counterpart is Brown (American English


1961)

16
• The range of texts and linguistic distributions are interdependent:
if the text range is not representative, the distribution will fail in
representativeness too.

ADD A FOOTER 17
DIACHRONIC STUDY FOR LOB (LEECH,
2004, Baker 2009) SYNCHRONIC STUDY OF LOB AND
BROWN CORPORA
The development and evolution of
language through history. Investigate differences between 2
Historical linguistics is typically a language varieties in the same period.
diachronic study

ADD A FOOTER 18
Sampling
A corpus is a sample of a given population

A sample is representative if what we find for the


sample holds for the general population

Samples are scaled-down versions of a larger


population
Sampling
Sampling unit: for written text, a s.u. could be a book, periodical
or newspaper.

Population: the assembly of all sampling units; it can be


defined in terms of language production, reception
(demographic, sex, age, etc.) or language as a product
(category, genre of language data). It is the notional space
within which language is sampled.

Sampling frame: the list of sampling units


Sampling
Sampling techniques:
• Simple random sampling: all sampling units within the
sampling frame are numbered and the sample is chosen by
use of a table or random numbers; rare features could not be
accounted for.
• Stratified random sampling: the population is divided in
relatively homogeneous groups, i.e. the strata, and then
these latter are sampled at random; never less
representative than the former method.
Sampling
Sample size:

• Full texts = no balance; peculiarity of individual texts


may show through.

• Text chunks are sufficient (e.g. 2000 running


words): frequent linguistic features are stable in their
distribution and hence short text chunks are sufficient for
their study (Biber 1993). Text initial, middle and end
samples must be balanced.
Sampling
Proportion and number of
samples:

The number of samples across


text categories should be
proportional to their frequencies
and/or weights in the target
population in order for the
resulting corpus to be
considered as representative.
In order to ‘If you are involved in

Size
study the behaviour of language teaching rather
words in texts, we need to than lexicography,
have available quite a large
number of occurrences’
single word lists from
(Sinclair 1991: 18). small selective corpora
can be seriously useful’
(Tribble 1997).

1960 -
1990s
1980s

• value of smaller
‘three generations’ of Leech
(1991) corpora and stressed
• hundred thousand their pedagogical
words to several hundred million purpose.
(British National Corpus [BNC],
Bank of English, • ‘balanced’ and
Cambridge International Corpus
‘representative’
[CIC], which stands at one billion
words). picture of a specific
area of the language.

ADD A FOOTER 25
Size
 Corpus size is usually represented as:
 Overall number of words in the corpus (e.g. BNC is a 100 million word corpus
of present day British English)
 Overall number of texts in the corpus (e.g. BNC features more than 4,000
texts, 90 % of the corpus and 10 % of its size goes to spoken language
sample)

Hoffman et al. (2008)

ADD A FOOTER 26
Size
 Issues of size also include
 the number of text types in text categories,
 the number of samples within each text type and
 the number of words per individual sample.
 E.g. If there are too few texts in a category, then one single text can influence the results
(cf the ocurrence of SCUM in the Leeds United email list in the BNC)

 The number of samples from each text is also important.


 E.g. If you are researching the characteristics of academic research articles , then you
would need samples from various parts of the article , Introduction, Methods, Results and
Discussion , as they all feature different language patterns

ADD A FOOTER 27
Is there an ideal sample size?

• Oostdijk (1991) and Kennedy (1998) –

• ‘A sample size of 20,000 words would yield samples that are


large enough to be representative of a given variety’.

• Based on heuristics (rule of thumb) (McEnery and Hardie


2012)
• BNC, target sample sizes of 40,000 words have been used.

ADD A FOOTER 28
Size
 Some research show (Biber 1990) that ten texts per category (e.g.
LOB Corpus) are representative enough . Still, many corpora
feature more texts per category.
 The number of words per sample should provide a stable and
reliable count of (grammatical or other) features in a text.
 Usually a 1,000 word sample provides a stable count of majority of
usual features (Biber 1990)
 However, in lexicographic researches in particular, some lexemes
are so rare that much larger samples are needed.

ADD A FOOTER 29
Size
• Biber has pointed out, that there is considerable variation within genre, in that
for some genres, 20,000 words would provide an adequate sample size. For
others this would not.

• E.g. in creating the British English Corpus.

• For approximately 20,000 words, 114 faxes were collected from different
sources.

• However, in the category of ‘business books’, 20,000 words would not cover
even one book. For this reason, a larger sample size of 50,000 words was
used for books, taking five 10,000 word samples from five different books.

ADD A FOOTER 30
Use of text chunks or extracts

• The BNC, with 40,000 word extracts, did not use full text (partially
for copyright reasons).
• They used continuous text within a whole, cutting the sample at a
logical point such as at the end of a chapter. This approach is well
suited to study of general language.

• In the BEC written section, which was concerned with the specialist
language of business, whole texts were used wherever possible.

ADD A FOOTER 31
3 main criteria for size in written corpora (Nelson, 2010)

HISTORICAL – PEDAGOGICAL - Smaller


Older or ‘first corpora enable easier
PRAGMATIC – resources Generation corpora access to the data
must always be weighed used the magic number found in them. This in
against projected of one-million-word turn leads to easier
corpus size mark (Leech (1991) transferral of results to
But later, specialized the classroom.
corpora were in the

ADD A FOOTER 32
“the overall size of a corpus can be secondary
to the need for adequate sampling”

(Nelson, 2010)

ADD A FOOTER 33
Ethics
When contacting potential sources of texts, it is essential to ensure
both that the data you collect is treated according to the laws of
copyright and also that you observe the privacy of the authors, if the
texts come from the private domain.

Draw up a contract on the usage of the data that you receive


from respondents. Once all the data have been gathered, the next
step is to store them and make them easily available for retrieval.
(Nelson 2010)

ADD A FOOTER 34
Biber’s criteria for text sampling

• Channel – written, spoken, electronic


• Published or unpublished
• Institutional or non-institutional
• Demographics of writer/ speaker
• Factual or fictional
• Purpose – persuasive, informative, etc.
• Topic.
To sum up :

• The appropriate design for a corpus depend upon what part, how
much and what phenomenon/a in language it is meant to
represent.
• The representativeness of the corpus determines:
• The kind of research questions to be addressed
• Generalizabilty of the results of the research
• We do not know the full extent of language variability and/or variety
; therefore, no corpus can be ideally representative
• However, a certain degree of representativeness must be provided

ADD A FOOTER 36
TASK FOR TODAY (GROUP DISSCUSSION)

 Imagine that you are supposed to compile a representative corpus of


Present-day English advertisements.
 What texts /pieces of language would you include in your corpus?
 How much of each text would be ‘just right’ for its representation ?
 Are there any texts that you would deliberately omit from the corpus?
Why?
 Think of the explicit criteria you would use.

You might also like