1.? Introduction To Corpora and Corpus Linguistics. General Introduction

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 33

Introduction to Corpora

and Corpus Linguistics


COGS 523-Lecture 1
General
Introduction

10.11.23 COGS 523 - Bilge Say 1


Related Readings
Course Pack:
 Meyer (2002). Corpus Analysis and Linguistic Theory. Ch 1
 Abney (1996) Statistical Methods and Linguistics
Extra Material: (Entirely optional, part of the presentation draws
on these material)
 McEnery and Wilson (2001) Ch1
 McEnery et al. (2006) A1 and B2
 Tognini-Bonelli (2001). Corpus Linguistics at Work. Ch 3
 Corpora Discussion List Archives: Corpora: Chomsky/Harris
Discussion, April 2001
 Borsley&Ingham vs Stubbs Discussion. Lingua 112 (2002)
 Schönefeld (1999) Corpus Linguistics and Cognitivism,
International Journal of Corpus Linguistics 4(1)

10.11.23 COGS 523 - Bilge Say 2


What is a Corpus?
Derlem (alt. Bütünce)

Text/Speech/
Video + Annotation
Digital media

Written/Spoken Design Criteria


Language

10.11.23 COGS 523 - Bilge Say 3


Questions of the Week
 Is working with corpora a methodology
within linguistics or a distinctive subfield
(corpus linguistics)?
 What potential is there for empirical
analysis of corpora to contribute to
linguistic theory?
 What are the dangers involved in corpus-
based linguistics? How can these dangers
be reduced?

10.11.23 COGS 523 - Bilge Say 4


What is a Corpus,again?
 A body of written text or transcribed
speech which can serve as a basis
for linguistic analysis or description,
designed or required for a particular
“representative” function.
 An electronic collection of texts in a
uniform representation
 Corpus vs text archive vs database

10.11.23 COGS 523 - Bilge Say 5


Sinclair’s definition
 A corpus is a collection of pieces of
language that are selected and
ordered according to explicit
linguistic criteria in order to be used
as sample of language

10.11.23 COGS 523 - Bilge Say 6


Should a Corpus be
Necessarily
 Large?
 Be authentic?
 Compiled for linguistic analysis?
 Be saturated in terms of lexical
growth?
 Be representative?
 Be machine readable?

10.11.23 COGS 523 - Bilge Say 7


A History of Corpora
 Pre-computers era (pre 60s)
 Transition era (60s to beginning of
90s)
 Maturation era (90s onwards)
 What did technology bring?
 Increased accuracy, speed,
accountability, replicability, large
volumes of better annotated data.
10.11.23 COGS 523 - Bilge Say 8
Phonology
Morphology
Lexicon
Syntax
Semantics
Introspection
Discourse
Experimental Methods
Pragmatics
Linguistics Formal Linguistic Analysis
Computational
Computational Modeling
Linguistics
Corpus Based Methods? Psycholinguistics
Sociolinguistics
Historical
Linguistics
Applied Linguistics

Corpus
Linguistics ?
11/10/23 COGS 523 - Bilge Say 9
Corpus Linguistics
 The term emerged in 1980s,
although the use of corpora has a
long history.
 Modern perspectives contain a
number of opposing positions.

10.11.23 COGS 523 - Bilge Say 10


Linguistic Subdisciplines
with a tradition for corpora
 Historical Linguistics
 Phonetics
 Language Acquisition
 Statistical Natural Language
Processing/Language
Engineering/Computational
Linguistics

10.11.23 COGS 523 - Bilge Say 11


Corpus Linguistics: a Methodology,
Theory, or Subfield of Linguistics?
 Rationalism vs Empiricism
 Formalists vs Functionalists
 Competence vs Performance
 Core vs Periphery
 Applied Linguistics vs Theoretical
Linguistics
 Corpus-Based vs Corpus-Driven
Approaches (Tognini-Bonelli)
10.11.23 COGS 523 - Bilge Say 12
False Assumptions
 All corpus linguists are descriptivists,
interested only in counting and
categorizing occurrences in a corpus,
and that all generative grammarians
are theoreticians unconcerned with
the data on which their theories are
based. Complexity of the structure is
not in the interest of corpus linguist.
(Meyer, 2002)

10.11.23 COGS 523 - Bilge Say 13


Evaluating Linguistic
Theories
 Observational vs explanatory vs
descriptive adequacy
 Falsifiability, Completeness,
Simplicity, Objectivity etc...

10.11.23 COGS 523 - Bilge Say 14


Chomskyan quotes:
 “The corpus could never be a useful tool for the linguist,
as the linguist must seek to model language”
 “Corpus Linguistics does not exist.”
 “Any natural corpus will be skewed and incomplete. Some
sentences won’t occur, because they are obvious, others
because they are false, still others because they are
impolite. The corpus, if natural, will be so wildly skewed
that the description would be no more than a mere list.”
Indeed Chomsky contributed to modern view of corpus
linguistics by improving language technology and to
overcoming the structuralist-behaviourist views of language
as something that could be enumerated, by way of formal
language theory.

10.11.23 COGS 523 - Bilge Say 15


Why Statistics help?
(Abney)
 Language Acquisition
 Language Changes
 Language Variation
 Grammaticality- Ambiguity –
Computation
 Modularity is not in isolation

10.11.23 COGS 523 - Bilge Say 16


Grammaticality Judgements
*He shines Tony books.
He gives Tony books.
If intutions do,why bother with corpus
analysis?
 Artificial data is artificial and creates
another kind of skewedness.
 “Yes I could say that-but I never would” –
gradedness in grammaticality judgements
 Intuitions are perceptions....

10.11.23 COGS 523 - Bilge Say 17


Alternative Views
 Leech (92)
 “Computer Corpus Linguistics” is a new
research enterprise, a new philosophical
approach that
• Concentrates on linguistic performance
• Leads to a more empirical view of scientific inquiry
• Exploits qualitative as well as quantitative
methodology to produce a quantitatively oriented
language model such as Bayesian language models.
 Not everyone agrees!

10.11.23 COGS 523 - Bilge Say 18


Further Remarks
 Corpus Linguistics contributed to
blurring the distinction between
grammar and lexicon.
 Sinclair’s open choice vs idiom principle
 Cognitive linguists can
accommodate data and facts
revealed by corpus linguistic
analysis
10.11.23 COGS 523 - Bilge Say 19
Corpus Linguistics vs
Corpus Based Linguistics
There is no inherent incompatibility between
theoretical generative linguistics and corpus
linguistics (Seegmiller)
Generative and corpus linguistics are two
approaches to the same problem, and must
meet somewhere. Generative theories should
match or be backed up by real data. (Schiffrin)
What is possible and what is probable? Corpus
linguistics offers a way of describing things that
we *do* regularly and frequently with greater
confidence and reliability than by using
introspection alone. (Krishnamurthy)

10.11.23 COGS 523 - Bilge Say 20


Corpus-Based Linguistics vs
Corpus-Driven Linguistics
 Take existing theory as a  Favour very large, full text
starting point and correct corpora, with the idea of
and revise the theory in cumulative
light of corpus evidence. representativeness and no
annotation-to be able to
free oneself of
preconceived theories.
 e.g collocations rather
than colligations
 Without a corpus, there is
no meaningful work to be
done (attributed to
Sinclair, Stubbs – but see
their own writings)

10.11.23 COGS 523 - Bilge Say 21


Reconciling Views
 Corpora are excellent resources for
verifying the falsifiability, completeness,
simplicity, strength, and objectivity of
linguistic hypotheses (Meyer, 2002).
 They can provide additional linguistic
perspectives which improve our
knowledge of language and our ability to
use it (a weaker position)

10.11.23 COGS 523 - Bilge Say 22


The Rise of Corpora
Years No of Corpus based
studies
To 1965 10
1966-1970 20
1971-1975 30
1976-1980 80
1981-1985 160
1986-1991 320
(McEnery and Wilso, 2001)
10.11.23 COGS 523 - Bilge Say 23
Range of Activities in
Corpus-based Linguistics
1. Corpus Design, Compilation and
Annotation
2. Developing Tools for (1) or
Analysis of Corpora
3. Linguistic Studies or Applications
using corpora developed in (1)
using tools developed in (2)

10.11.23 COGS 523 - Bilge Say 24


Types of Corpora
 General (typically balanced and
made available for general linguistic
use) vs Specialized (Dialect
corpora,language acquisition
corpora,learner corpora)
 Core Corpora
 Written vs Spoken Corpora
 Full-text vs Sample-text Corpora

10.11.23 COGS 523 - Bilge Say 25


More Typology
 Finite-size (Static) vs Dynamic/Monitor
Corpora
 Monolingual vs Multilingual Corpora
(Parallel corpora, Comparable Corpora)
 Rather Graded Distinctions:
 Raw vs Annotated,
 Balanced vs Pyramidal vs Opportunistic
Corpora
 Synchronic vs Diachronic

10.11.23 COGS 523 - Bilge Say 26


Some Examples of Corpora
 Pre-electronic corpora
 Biblical and Literary Studies
 Lexicographical
 Dialect Studies
 Language Education
 Grammatical
• Quirk’s Survey of English Usage Corpus (later
computerized) had 200 samples of 5000 words
each, half spoken, half written, tagged manually
with 65 grammatical features.

10.11.23 COGS 523 - Bilge Say 27


More Examples
 Major Electronic Corpora
 Brown Corpus (Francis and Kucera, 1965)
Brown University Standart Corpus of Present
Day American English- 1 million words, 1961-
64, 500 samples of 2000 words each
 Lancaster-Oslo-Bergen Corpus (LOB corpus) a
comparable corpus of British English – fewer
westerns exist,though!
 FBrown and FLOB – comparable corpora of
1990s

10.11.23 COGS 523 - Bilge Say 28


Major Electronic Corpora
 Also modeled after Brown:
 Kolhapur Corpus of Indian English
 Wellington Corpus of New Zealand English...
 London-Lund Corpus (1975)- 100 5000-
word samples of spoken data, major
spoken corpus till mid 1990s,
predominantly highly educated adult
speakers
 Lancaster/IBM Spoken Corpus (SEC)-
better balance-11 categories,detailed
prosodic annotation

10.11.23 COGS 523 - Bilge Say 29


Major Electronic Corpora
 Longman Dictionary of Contemporary English (LDOCE);
COBUILD Project-Bank of English-524 million words as of
2004.
 International Corpus of English
 International Corpus of Learner’s English- 2M words- 500
word essays, different English backgrounds
 Longman Learner’s Corpus, HKUST Learner’s Corpus
 CHILDES Child Language Data Exchange System
 European Corpus Initiative – ECI – 93 million words
 Many corpora are available from LDC and ELDA/ELRA.

10.11.23 COGS 523 - Bilge Say 30


Major Natural Language
Processing Corpora
 PennTreebank (1993) – 4.9 million words,
tagged and parsed, not balanced
(optional paper in course pack)
 TIPSTER corpus- AP Newswire and Wall
Street Journal – mainly used for
Information Retrieval
 More variety by National Corpora and
dependency treebanks

10.11.23 COGS 523 - Bilge Say 31


National Corpora
 British National Corpus (BNC Corpus)
 100 million words, 90% written, 10% spoken, BNC Baby – 2
million word sampler, SARA and Xaira – its own corpus query
tools, wholly tagged by CLAWS tagger
 American National Corpus (ANC)
 In progress, preliminary releases available
 Czech National Corpus (optional paper in course pack)
 12 full time persons working for 5 years in a speacialized
institute
 100 million words
 Partially tagged and parsed in Prague Dependency School
tradition
 See METU Online links

10.11.23 COGS 523 - Bilge Say 32


Lecture 2
 Corpus Design Issues
 Readings:

• Tognini-Bonelli (2001) Corpus Issues. Ch3


• McEnery et al(2006) Unit A7-A9, B1 –all
appear to be one article in the course pack
• Meyer (2002) Planning the Construction of
a corpus. Ch 2.

10.11.23 COGS 523 - Bilge Say 33

You might also like