Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Corpus

linguistics:
Method, Analysis, Interpretation

What is corpus linguistics? Which software packages are


available, what can they do?

The corpus approach harnesses the power of computers to allow analysts to work to
produce machine aided analyses of large bodies of language data - so-called corpora.
Computers allow us to do this on a scale and with a depth that would typically defy
analysis by hand and eye alone. In doing so, we gain unprecedented insights into the use
and manipulation of language in society.
What is Corpus Linguistics?
Corpus linguistics, broadly, is a collection of methods for studying language. It begins with
collecting a large set of language data – a corpus - which is made usable by computers.
Corpora (the plural of corpus) are usually so large that it would be impossible to analyze
them by hand, so software packages (often called concordancers) are used in order to
study them. It is also important that a corpus is built using data well matched to a research
question it is built to investigate. To investigate language use in an academic context, for
example, it would be appropriate for one to collect data from academic contexts such as
academic journals or lectures. Collecting data from the sports pages of a tabloid
newspaper would make much less sense.

Software:
A number of software packages are available with varying functionalities and price tags.
Some pieces of software can be downloaded and used for free, others cost money or are
available only online but have built-in reference corpora. This table an idea of the variety
of software currently available:
Glossary
Use this glossary, as a handy reference when you come
across any terminology on the course that you do not
understand.

Produced by: The ESRC Centre for Corpus Approaches to Social


Science (CASS), Lancaster University, UK

Annotation
Codes used within a corpus that add information about things such as, for example,
grammatical category. Also refers to the process of adding such information to a corpus.

Balance
A property of a corpus (or, more precisely, a sampling frame)
A corpus is said to be balanced if the relative sizes of each of its subsections have been
chosen with the aim of adequately representing the range of language that exists in the
population of texts being sampled (see also, sample).

Colligation
More generally, colligation is co-occurrence between grammatically categories (e.g. verbs
colligate with adverbs) but can also mean a co-occurrence relationship between a word
and a grammatical category.

Collocation
A co-occurrence relationship between words or phrases; Words are said to collocate with
one another if one is more likely to occur in the presence of the other than elsewhere.

Comparability
Two corpora or sub-corpora are said to be comparable if their sampling frames are similar
or identical.

Concordance
A display of every instance of a specified word or other search term in a corpus, together
with a given amount of preceding and following context for each result or ‘hit’
Concordancer
A computer program that can produce a concordance from a specified text or corpus;
Modern concordance software can also facilitate more advanced analyses
Corpus
From the Latin for ‘body’ (plural corpora), a corpus is a body of language representative
of a particular variety of language or genre which is collected and stored in electronic
form for analysis using concordance software.

Corpus construction
The process of designing a corpus, collecting texts, encoding the corpus, assembling and
storing the metadata, marking up (see markup) the texts where necessary and possibly
adding linguistic annotation.

Corpus-based
Where corpora are used to test preformed hypotheses or exemplify existing linguistic
theories. Can mean either:
(a) Any approach to language that uses corpus data and methods.
(b) An approach to linguistics that uses corpus methods but does not subscribe to corpus-
driven principles.

Corpus-driven
An inductive process where corpora are investigated from the bottom up and patterns
found therein are used to explain linguistic regularities and exceptions of the language
variety/genre exemplified by those corpora.

Diachronic
Diachronic corpora sample (see sampling frame) texts across a span of time or from
different periods in time in order to study the changes in the use of language over time.
Compare: synchronic.

Encoding
The process of representing the structure of a text using markup language and
annotation

Frequency list
A list of all the items of a given type in a corpus (e.g. all words, all nouns, all four-word
sequences) together with a count of how often each occurs
Key word in context (KWIC)
A way of displaying a node word or search term in relation to its context within a text;
this usually means the node is displayed centrally in a table with co-text displayed in
columns to its left and right. Here, ‘keyword’ means ‘search term’ and is distinguished
from keyword.

Keyword
A word that is more frequent in a text or corpus under study than it is in some (larger)
reference corpus. Differences between corpora in how the word being studied occurs will
be statistically significant (see, statistical significance) for it to be a keyword.

Lemma
A group of words related to the same base word differing only by inflection. For example,
walked, walking, and walks are all part of the verb lemma WALK.

Lemmatisation
A form of annotation where every token is labelled to indicate its lemma

Lexis
The words and other meaningful units (such as idioms) in a language; the lexis or
vocabulary of a language in usually viewed as being stored in a kind of mental dictionary,
the lexicon.

Markup
Codes inserted into a corpus file to indicate features of the original text other than the
actual words of the text. In a written text, for example, markup might include paragraph
breaks, omitted pictures, and other aspects of layout.

Markup language
A system or standard for incorporating markup (and, sometimes, annotation and
metadata) into a file of machine-readable text; the standard markup language today is
XML.

Metadata
The texts that make-up a corpus are the data. Metadata is data about that data - it gives
information about things such as the author, publication date, and title for a written text.

Monitor corpus
A corpus that grows continually, with new texts being added over time so that the dataset
continues to represent the most recent state of the language as well as earlier periods
Node
In the study of collocation - and when looking at a key word in context (KWIC) - the node
word is the word whose co-occurrence patterns are being studied.

Reference corpus
A corpus which, rather than being representative of a particular language variety,
attempts to represent the general nature of a language by using a sampling frame
emphasing representativeness.

Representativeness
A representative corpus is one sampled (see, sample) in such a way that it contains all the
types of text, in the correct proportions, that are needed to make the contents of the
corpus an accurate reflection of the whole of the language or variety of language that it
samples (also see: balance).

Sample
A single text, or extract of a text, collected for the purpose of adding it to a corpus. The
word sample may also be used in its statistical sense by corpus linguists. In this latter
sense, it means groups of cases taken from a population that will, hopefully, represent
that population such that findings from the sample can be generalised to the population.
In another sense, corpus is a sample of language

Sample corpus
A corpus that aims for balance and representativeness within a specified sampling frame

Sampling frame
A definition, or set of instructions, for the samples (see: sample) to be included in a
corpus. A sampling frame specifies how samples are to be chosen from the population of
text, what types of texts are to be chosen, the time they come from and other such
features. The number and length of the samples may also be specified.

Significance test
A mathematical procedure to determine the statistical significance of a result

Statistical significance
A quantitative result is considered statistically significant if there is a low probability
(usually lower than 5%) that the figures extracted from the data are simply the result of
chance. A variety of statistical procedures can be used to test statistical significance.
Synchronic
Relating to the study of language or languages as they exist at a particular moment in
time, without reference to how they might change over time (compare: diachronic). A
synchronic corpus contains texts drawn from a single period - typically the present or very
recent past.
Tagging
An informal term for annotation, especially forms of annotation that assign an analysis to
every word in a corpus (such as part-of-speech or semantic tagging).

Text
As a count noun: a text is any artefact containing language usage - typically a written
document or a recorded and/or transcribed spoken text. As a non-count noun: collected
discourse, on any scale.

Token
Any single, particular instance of an individual word in a text or corpus
Compare: lemma, type.

Type
(a) A single particular wordform. Any difference of form (e.g. spelling) makes a word a
different type. All tokens comprising the same characters are considered to be examples
of the same type.
(b) Can also be used when discussing text types.

Type-token ratio
A measure of vocabulary diversity in a corpus, equal to the total number of types divided
by the total number of tokens; the closer the ratio is to 1 (or 100%), the more varied the
vocabulary is. This statistic is not comparable between corpora of different sizes.

XML
A markup language which is the contemporary standard for use in corpora as well as for
a range of data-transmission purposes on the Internet. In XML, tags are indicated by
<angle> <brackets>.
Part One:
An Introduction to
Corpus Linguistics
 Introduction to this part's activities
 Warm up activity
 Part 1: why use a corpus?
 Part 2: annotation and mark-up
 Part 3: types of corpora
 Part 4: Frequency Data, Concordances and Collocation
 Part 5: Corpora and Language Teaching
 Test your Knowledge (Quiz)
 Why do I need special software?
 Brown and LOB
 Downloads
 Introduction to AntConc
 AntConc - concordancing
 AntConc - using advanced search to explore the Brown corpus
 AntConc - creating and using a wordlist
 Practical activity - a question
 Further Reading
 Discussion question for Part 1
Introduction to this part’s activities
In this part, we begin by looking at the background to corpus linguistics –
the types of things you can do using a corpus and some of the technical
details of how corpora are built.

In the ‘how to’ part of this part, we introduce you to the concordance
package available free with this course – AntConc, authored by Laurence
Anthony of Waseda University.

Take notes as you go and use the ‘pop quiz’ to test your comprehension.
Undertake the readings for the part and contribute to the discussion.
Warm up activity
A quick activity to get started

Think of something you would like to find out about language. As you attend
the lecture, reflect back on your own interests – what types of corpora might
help you and what type of design issues would you have to consider if you
were to put together your own corpus to investigate language as you would
wish?
Part 1: why use a corpus?
The lecturer gives a brief review of why you might want to use a corpus and
decisions to make when building a corpus.

Please see:
Week 1 Lectures (part 1)
Week 1 Slides (Part 1)
Week 1 Videos (Part 1)
Part 2: annotation and mark-up
The Lecturer gives a brief overview of how corpus texts may be enriched
with additional information to ease analysis.

Note that this type of additional information may be called ‘mark up’,
‘annotation’, or ‘tagging’. All three terms are near synonyms. Annotation
usually refers to linguistic information encoded in a corpus - however, the
encoding is achieved using a mark-up language. Similarly, the annotation
itself is usually undertaken by putting so called tags - short codes to indicate
some linguistics feature - into a text. Hence, while the terms can be
separated, they can also be used inter-changeably!

One final note - an xml tag finishes with a forward slash rather than a back
slash.

Please see:
Week 1 Lectures (part 2)
Week 1 Slides (Part 2)
Week 1 Videos (Part 2)
Part 3: Types of Corpora
The Lecturer looks at a range of different types of corpora.

Please see:
Week 1 Lectures (part 3)
Week 1 Slides (Part 3)
Week 1 Videos (Part 3)

\
Part 4: Frequency Data,
Concordances and Collocation
The Lecturer explores the value of frequency data in corpus linguistics and
takes a first look at a key concept in corpus linguistics - collocation.

This lecture mentions the idea of normalised frequencies per million. What
are these? Imagine you have two corpora, one of two million words and
another of three million words. You look in each for the word ‘dalek’ and find
20 examples in the first and 30 examples in the second. That does not mean
that the word is more frequent in the second corpus - remember it is bigger.

One way of the making this issue apparent, and making the numbers more
comparable, is to normalise the frequencies. To normalise per million, you
are in essence asking the question ‘if my corpus was only one million words,
how many examples would I expect to find?’.

Our first corpus is two million words - so to normalise the frequency of ‘dalek’
to one million words, we would divide by two, giving us 20/2=10.

The second corpus is three times as large as one million, so to normalise per
million we would divide the results from the second corpus by three giving
30/3=10. This shows clearly that we have no reason to claim that the word
‘dalek’ is more frequent in one of the corpora than the other.

Please see:
Week 1 Lectures (part 4)
Week 1 Slides (Part 4)
Week 1 Videos (Part 4)
Part 5: Corpora and Language
Teaching
The Lecturer takes a brief look at a major application area for corpus
linguistics - language teaching.

The video concludes by considering some of the limitations of corpus


linguistics.

After the video, don’t forget to update your journal! Keep a record of what
you are learning. You will find it really helps as the course proceeds if you
keep clear, structured notes of what you have learnt.
Test your Knowledge (Quiz)
What is a corpus?

A theory of language
A collection of texts stored on a computer
An electronic database similar to a dictionary
Any large collection of words such as a collection of books, newspapers or
magazines

What is the main reason for using corpora?

Other methods of language analysis are not reliable


Computers can confirm our intuitions about language
Computers can help us discover interesting patterns in language which would be
difficult to spot otherwise
With corpora we can answer all research questions about language

What is corpus annotation?

Adding an extra layer of information to the text to allow for more sophisticated
searches
Separating text into sentences
Manual coding of text for parts of speech
Adding critical comments to a text

What is a specialised corpus?

A corpus that is used for historical language investigations


A corpus that is composed of a large variety of genres
A corpus that is used by language specialists
A corpus that focuses on e.g. one type of genre, one period, one place etc
Which of these is NOT a type of corpus?

Multilingual corpus
Learner corpus
Diachronic corpus
Observer corpus

What is the BNC?

A large general corpus of British English


A corpus of different genres of English writing
A large spoken corpus of British English
A specialised corpus representing the language of newspapers

Which of these statements is NOT true about a monitor corpus?

It is frequently updated
The Bank of English is an example of a monitor corpus
The BNC is an example of a monitor corpus
It is used to monitor rapid change in language

What is a concordance?

Information about word frequencies normalised per million words


Listing of examples of a word searched in a corpus with some context on the right
and some context on the left
An alphabetical list of words that appear in a text
A list of words and their frequencies that can be used for identifying important
words in a text

What is collocation?
The tendency of speakers to talk over each other
The tendency of words to co-occur with one another
The tendency of words to appear in unique, different contexts each time
The tendency of sentences to create meaning

What is a frequency distribution in a corpus?

Information about how frequent a word is in a corpus


Information about the frequency of use of a term across a number of different
texts, corpus sections, speakers etc
Information about how frequent a word is per million words
Sociolinguistic information about the gender of the speakers that are represented
in a corpus

Why do I need special software?


Some of the things you can do with a program like AntConc will be familiar
to you from word processing. For example, you can search for a word in a
word processor and see the context around each use of that word. So why
bother with corpus browsing software?

As you will discover, software like AntConc allows you to do so much more
than a word processor does. Even for something as simple as searching for a
word, it presents the results in a format that is more suitable for those
interested in studying language; the standard concordance view of one
example per line with left and right context allows you to rapidly browse data
looking for patterns of usage.

Yet beyond this the software allows you to do a number of things that no
word processor does, such as undertaking keyword analyses and looking for
collocations. By the time you have finished learning to use AntConc, you will
have developed a full appreciation of the need to use such software to study
language in use.
Brown and LOB
These corpora are sometimes referred to as ‘snapshot’ corpora - their
design is such that they try to represent a broad range of genres of
published, professionally authored, English. Their goal is to capture the
language at one moment in time, hence the term ‘snapshot’.

Of course, as with any snapshot there are things you see and things you do
not see. So, in this case, we are looking at professionally authored written
English - not speech and not writing of a more informal variety. We are also
only looking at certain genres. As with any snapshot, it was taken at a certain
point of time in a certain place - Brown is America in the early 1960s, LOB is
the UK in the early 1960s. Such corpora are often used to compare and
contrast varieties of a language - in this case two varieties of English. They
can also be looked at on their own to explore either variety of English in its
own right.

The Brown corpus is so named because it was developed at Brown University


in the US. LOB is an acronym, standing for Lancaster-Oslo-Bergen, the three
Universities that collaborated to build that corpus.

Back to the snapshot metaphor! The two corpora can be compared because
they are composed in the same way - the subject is the same, if you like. They
look at broadly the same genres. Those genres are represented by similarly
sized and numbers of chunks of data. Also, of course, the data was gathered
in roughly the same time period.

The genres covered in the two corpora are outlined below. Note the letter
code for each genre - that is important, as it shows you which genre is
associated with which file in the corpus. Following the letter code is a
description of the type of data in the category, followed by two numbers in
parentheses - the first is the number of chunks of data in that category in
Brown, the second is the number of chunks of data in that category in LOB.
There are five hundred chunks of data in each corpus. Each chunk is
approximately 2,000 words in size, giving a rough overall corpus size of
1,000,000 words each.
A Press: reportage (44, 44)

B Press: editorial (27, 27)

C Press: reviews (17, 17)

D Religion (17, 17)

E Skills, trades and hobbies (36, 38)

F Popular lore (48, 44)

G Belles lettres, biography, essays (75, 77)

H Miscellaneous (documents, reports, etc.) (30, 30)

J Learned and scientific writings (80, 80)

K General fiction (29, 29)

L Mystery and detective fiction (24, 24)

M Science fiction (6, 6)

N Adventure and western fiction (29, 29)

P Romance and love story (29, 29)

R Humour (9, 9)

Downloads
(The instructor will provide students with the different software
packages and corpora)
Instructions on how to download AntConc and the Brown and LOB corpora
for analysis

How to download AntConc


The latest versions (3.4.3w, 3.4.3m, 3.4.3u) of AntConc are available for
download from Laurence Anthony’s website
at: http://www.antlab.sci.waseda.ac.jp/software.html

Choose the version you want to run (i.e. for Windows, Mac or Linux) and
click the link for version 3.4.3

If you are using a Windows computer, you will download a single executable
(.exe) file. Put this on your desktop or in some other area that is easy for you
to access. Double click to start.

If you are using a Linux computer, you will download a tar.gz folder that you
need to decompress first. Inside the folder, you will find the AntConc
executable file, an icon, and a simple setup guide. Set the permissions of the
executable file and double click to start.

If you are using a Macintosh computer, you will download a zip file that you
need to unzip first. Put the unzipped AntConc application on your desktop or
in some other area which is easy for you to access. Double click to start. (At
this point, you may get one or two security warnings. AntConc is completely
virus free, so you can ignore these warnings or, if necessary, disable them via
the System Preferences.)

How to download Brown and LOB corpora


Important Note
The Brown and Lob corpora are only made available to learners of the
FutureLearn Corpus Linguistics: Method, Analysis and Interpretation
FutureLearn course. They should not be re-distributed or re-published. The
LOB corpus is made available to you by ICAME.

Click this link to download a zip file containing the two corpora.

To use the corpora, first, unzip the file (see below), and then drag the two
folders inside (“brown_corpus_untagged” and “lob_corpus_untagged”) to
a convenient place on your computer. We suggest you place them in a new
folder called “corpora”. You can then delete the original zip file if you want.

If you are using a Windows computer, you can unzip the file by right clicking
on the file name and selecting “Extract All”. The unzipped file will open in a
new window where you can see the two corpora.

If you are using a Macintosh computer, you can unzip the file by simply
double-clicking on it. You can then open the unzipped file and see the two
corpora inside.

If you are using a Linux computer, unzip the file using your preferred zip
program. On most systems you can simply double click the file and then
move the two corpora inside to a convenient place.

If you are experiencing problems downloading or have other technical issues


please post a question on this page. If anyone has resolved issues, please feel
free to post your solutions.
Introduction to AntConc
Part one of an introduction to the AntConc program. In this video Laurence
Anthony tells you how to download and install AntConc, how to load
corpus files into the program and introduces some of the first steps you can
take in analysing corpus data.

This includes showing you how to build a wordlist from a corpus. As part of
this, you will hear the terms type and token. A token is any given word in the
corpus. A type is the number of unique word forms present in a corpus.

Imagine your corpus is the sentence “I came, I saw, I concordanced”. This


sentence contains six separate words - hence there are six tokens in the
sentence. However, there are only four unique words in the corpus - the
token ‘I’ repeats 3 times. So the types in the corpus are ‘I’, ‘came’, ‘saw’ and
‘concordanced’. Thus the sentence has six tokens and four types.

Note that we can, of course, quibble about the definition of a word! Consider
the word ‘gonna’ - some may argue this is two words, others one.

Please see:
AntConc Videos (1)
AntConc Transcript (1)

AntConc - concordancing
Laurence Anthony looks at some of the basic features of the AntConc
concordance tool.

Topics covered include how to load a corpus, how to search for words in a
corpus, how to order the results of a search and how to search for parts of
words.

Please see:
AntConc Videos (2)
AntConc Transcript (2)
AntConc - using advanced search to explore
the Brown corpus
Laurence Anthony looks at some of the advanced features of the AntConc
program.

Laurence works with a subset of the Brown corpus, demonstrating the


functions of the concordance window in AntConc, including the use of the
advanced search box.

Please see:
AntConc Videos (3)
AntConc Transcript (3)
AntConc - creating and using a wordlist
Laurence Anthony shows you how to build a frequency wordlist from a
corpus.

In addition, he covers some related issues such as sorting the list and
searching it.

Download the lemma list (Ask the instructor)

Please see:
AntConc Videos (4)
AntConc Transcript (4)
Practical activity - a question
Take the LOB corpus and build a word list. Look at the top thirty words. How
would you characterise these words? Do the same with the Brown corpus. Is
it similar? Are there any differences between LOB and Brown? Feel free to
concordance the words to inform your analysis.

If you have the time, do the same with the subsections of LOB and Brown.
Might wordlists help to determine genre?
Further Reading:
Our readings this week come to us courtesy of Edinburgh University Press
and Routledge

Our first reading is taken from: (Week 1 PDF 1)

McEnery, T. and Wilson, A. (2001) Corpus Linguistics, Edinburgh University


Press, Edinburgh.

It is chapter one of this book. It will help you broaden your understanding of
the background to corpus linguistics and will place in historical context the
move away from, and return to, corpus data in linguistics.

The second reading is chapter one of: (Week 1 PDF 2)

Garside, R., Leech, G. and McEnery, T. (1997) Corpus Annotation, Longman,


Harlow

This book will be of great assistance to you throughout this course. Each time
you hear or see a type of annotation discussed, you should be able to use
this book as a useful reference guide to find out what that type of annotation
is and how it is undertaken. While published in 1997, this book is still a good
reference guide. For this week, read chapter 1 of the book - Leech’s outline
of the principles of corpus annotation are as relevant today as they were
when they were written.
Discussion question for Part 1
When you have completed the lecture and the associated readings, consider
and discuss the following statement:

“Noam Chomsky is one of the most influential figures in corpus linguists. His
ideas have shaped corpus linguistics while also, paradoxically, seeking to
deny its value”.

Given what you have read, discuss what is a somewhat deliberately


provocative statement!

Reflect back on the warm up activity and your readings this week. Think
about what you would like to use corpora for and consider the types of
corpora you would need to use.

Discuss the design aspects of your proposed work. For example, what type
of corpus would you have to use? How large do you think it would have to
be? Would annotation help you and if so what sort?

Discuss these and any other questions related to your proposed use of
corpus data.

You might also like