Cospus Approaches in Discourse Analysis

Republic of the Philippines
VIZCAYA STATE UNIVERSITY NUEVA

Bayombong, Nueva Vizcaya
GRADUATE SCHOOL
Subject: Discourse Analysis Professor: William D. Magday Jr., PhD.

Time: 11:00-2-00 Student: Frelita O. Bartolome (PHD-
MLE)
CORPUS APPROACHES IN DISCOURSE ANALYSIS

Chapter 7
Meaning of Corpus
Corpus is derived from the Latin word “Corpus” which means BODY
Corpus – body (collection) of texts
Corpora – plural form of Corpus
Corpus is a principled and a large collection (body) of authentic texts
that are stored in a computer and analyzed using software designed for
corpus analysis
PRINCIPLED: Collecting the text for the corpus is a planned operation, not
randomResearchers have normally a research question in mind before they
start compiling for the corpus
LARGE COLLECTION OF AUTHENTIC TEXTS: Naturally occurring examples of
language (written or spoken)Genuine communication of people going about
their normal business (Sinclair , 1996) with usually millions of texts
depending on the corpus/research objectives
STORED IN A COMPUTER: storing texts in a computer makes the corpus
analysis easy because a computer allows for a fast and accurate analysis of a
large amount of data . A computer can give insights on patterns not easily
detectible by human. However, it has also its limitations
SOFTWARE: software application for data analysis such as LancsBox and
Antconc
Kinds of Corpora
1. General or Reference Corpora (Reppen and Simpson 2004:95)
General corpora has the aim to represent language in its broadest sense
and to serve as a widely available resource for baseline or comparative
studies of general linguistic features. A general corpus provides a sample
data from which we can make generalization about spoken and written
discourse as a whole and frequencies of occurrence, and co-occurence of
particular aspects of language in discourse.
e.g.
to what extent hedges such as ,sort of, kind of are typical of
English, in general, compared with what words these hedges
typically collocate with in spoken academic discourse ( Poos and
Simpsons 2002).
2. Specialized corpora
A specialized corpus as Hunston (2002: 14) explain is “a corpus of texts

of a particular type, such as newspaper, editorials, geography textbooks,
academic articles in a particular subject, lectures, casual conversations,
essays written by students etc. It aims to be representative of a given
type of test. It is used to investigate a particular type of text.’’ Specialized
corpora are required when the research question relates to the use of
spoken or written discourse in particular kinds of text or in particular
situation. (e.g. the use of hedges in casual conversation or the ways in
which people signal a change in topic in an academic presentation, or it
may look data discourse features in a particular academic genres)
2.1 The Michigan Corpus of Academic Spoken English (MICASE)

This is an example of specialized corpus that is designed with a
particular research project in mind. MICASE is normally an open access
corpus and it is available without charge to people who wish to use it
(www.isra.umich.edu/eli/micase/index.htm) it also has data from a wide
range of spoken academic genres as well as information on speakers
attributes and characteristics of the speech events contained in the data.
Findings from MICASE projects have been integrated into training
courses for international teaching assistant and for the teaching of oral
presentations (Reinhart 2002). The MICASE data has also been used in
the development of English language test (MICASE online).
2.2 The British Academic Spoken English (BASE)
The British academic spoken English (BASE) corpus

(www.rdg.ac.uk/Acadepts/II/basecorpus/) has been developed at the
University of Warwick and the University of Reading, UK as similar
spoken corpus to the Michigan corpus. . The study of this corpus have
important implication for the development of English for academic
purposes course which aim to prepare students to study English
medium universities. An example of this is the study based on the
British corpus which looks at the relationship between lexical density
and speed in academic lectures (Nesi 2001).
2.3 The British Academic Written English (BAWE)
The British academic written English is a specialized corpora that is

based on the written discourse alone, developed at the University of
Warwick and the University of Reading and Oxford Brooks University in
the UK (www.Warwick.ac.uk/fac/soc/celte/bawe/). This corpus
examines student’s written assignment at different levels of study an d in
a range of disciplines with the goal of providing a database for the use of
research and teacher to enable them to identify and describe academic
writing requirement in British university settings. The BAWE corpus
include contextual information on the students’ writing such as the
gender and year of study , details of the course, the assignment was set
for, and the grade that was awarded to the piece of work so as to be able
to consider the relationships between these variables and the nature of
the student’s written academic discourse
2.4 The TOEFL spoken and written academic language corpus
A specialized corpus which include both spoken and written discourse is

the TOEFL 2000 Spoken and Written Academic Language Corpus. The
TOEFL corpus was made up of 2.7 million words and aimed to represent
the spoken and academic genres that university students in the US have
to participate in, or read, such as class sessions, office hour
conversations, study group discussions, on-campus service encounters,
text books, reading packs, university catalogues and brochures.
A key observation of the TOEFL study was that spoken genres in US

university settings are fundamentally different from written genres. The
study found, however, that classroom teaching in the US was similar in
many ways to conversational genres. It found that language use varied
in the textbooks of different disciplines, but not in classroom teaching in
different disciplines.
Design and Construction of Corpora
While there are already established corpora but the data needed in your
study of a particular genre in a particular setting is not available,then
you can create your own corpus for the study.
Eg.
Hyland’s study on the use of personal pronouns I, me, we and us

among Hongkong students’ academic writing . He tried to examine
the extent to which student writers use self-mention to strengthen
their arguments and gain recognition on their claims. His findings
was used to compare with existing corpus related to discourse and
identity.
Issues to Consider in Constructing a Corpus
1. Authenticity, representativeness and validity of the corpus
Authenticity, representativeness and validity are also issues in,

corpus construction, as well as whether the corpus should present a
static or a dynamic picture of the discourse under examination; that
is, whether it should be a sample of discourse use at particular point
in time (a static, or sample corpus) or whether it should give more of a
'moving picture' view of the discourse that shows change in language
use over a period of time (a dynamic, or monitor corpus) (Kennedy
1998; Reppen and Simpson 2002).
2. Kinds of texts to include in the corpus
A key issue is what kind of texts the corpus should contain. This
decision may be based on what the corpus is designed for, but it may
also be constrained by what texts are available. Another issue is the
permanence of the corpus; that is, whether it will be regularly
updated so that it doesn't become unrepresentative, or whether it will
remain as an example of the use of discourse at a particular point in
time (Hunston 2002).
3. Size of the texts in the corpus
The size of texts in the corpus is also a consideration. Some corpora

aim for an even sample size of individual texts. If, for example, the
corpus aims to represent a particular genre, and instances of the
genre are typically long, or short, this needs to be reflected in the
collection of texts that make up the corpus.
4. Sampling and representativeness of the corpus
Sampling is also an issue in corpus design. The key issue here is

defining the target population that the corpus is wishing to represent.
The representativeness of the corpus, further depends on the extent
to which it includes the range of linguistic distribution in the
population. That is, different linguistic features are differently
distributed (within texts, across texts, across text types), and a
representative corpus must enable analysis of these various
distributions.
As Reppen and Simpson (2002: 97) explain 'no corpus can be
everything to everyone'. Any corpus in the end is a compromise
between the desirable and the feasible" (Stubbs 2004: 113).
The Longman Spoken and Written English Corpus
The Longman Spoken and Written English (LSWE) Corpus is an

important example of a corpus study. The LSWE was used at the
basis for the Longman Grammar of Spoken and Written English. The
LSWE corpus is made up of 40 million words, representing four major
dis course types: conversation, fiction, news and academic prose, with
two additional categories, non-conversational speech (such as lectures
and public meetings) and general written non-fiction prose.
The main source of the conversational data in the corpus was British
English, although a smaller sample of conversational American
English data was added for comparison. The news data contained an
almost equivalent amount of British English and American English
data. The fiction sample drew on British English and American
English, as did the academic prose. The non-conversational speech
was all British English data and the general prose contained both
British English and American English data.
The study was designed to contain about 5 million words of text in
each discourse category. Most of the texts in the corpus were
produced after 1980 so the sample is mostly of contemporary British
and American English usage. The corpus was made up of 37,244
texts and approximately 40,026,000 words. The texts in the corpus
varied, however, in length. The newspaper texts tended to be the
shortest while fiction and academic prose were the longest.
The LSWE corpus aimed to provide a representative sampling of texts

across the discourse types it contained. The conversational data in
the corpus was collected in real-life settings and is many times larger
than most other collections of conversational data. Both the British
and American conversational data were collected from representative
samples of the British and US populations. The conversational data in
the corpus aimed to represent a range of English speakers in terms of
age, sex, social and regional groupings (Biber et al 1999).
conversation discourse then has many features which are not typical
of more formal kinds of spoken discourse, or of written discourse ,
because conversation takes place in a shared context and in real time,
there is often less specification of meaning than there I in other
spoken and written genre ,a also because conversation take place
between people who usually know each other it is less influenced by
traditional blew of accuracy and correctness that is associated with
more publically available texts
Discourse Characteristics of Conversational English
1. Non-clausal unit
What is a clause?
* A clause is a fundamental unit in the process of communication
because it is the minimal unit which can stand alone as constituting
a complete message. e.g. Go! Stop! and Run!
What is non-clausal unit

● These are utterances which do not contain explicit subject and
verb
● These are units which are independent and self-standing
● Have no grammatical connection with what immediately

precedes or follows them
● Conversation is very interactive and avoids elaboration (Biber
et.al.1999)
*unlike in written discourse, we are very particular in the

organization, cohesion and other norms in writing
* Why non- clausal language is more common in

conversation than in writing?
- Non-clausal units reflect the simplicity of grammatical constructions

resulting from real-time production in conversation. e.g. Poor Kids,
Good for you.
- Many questions in conversation occur as noun phrases or a verbless
structure beginning with a wh-word e.g. More sauce?, How about
your wife?
- Non-clausal units can also be related to ellipsis. For example,
Perfect! as a response is equivalent to the clause That’s perfect with
the subject and verb omitted.
2. Personal pronouns and ellipsis ( you, I, we…..)
Personal pronouns are widely used because in a conversation, the
source and the receiver are expected to be in the shared context
where the conversation occurs so they know who is being referred to
as I & you
3. Situational ellipsis
Conversation where the situation or context makes the missing
element clear. It's informal and mostly used in conversation.
Example:
'Would you like a cup of tea? ' can easily become 'Tea? ' if you are
waving a mug at someone, or even just sitting in the kitchen.
4. Non-clausal units as elliptic responses

It often occurs in conversation
Ex: Marie simply said Why?( do you have to get Paul come over)? In
the shared social situation where the conversation is taking place,
both the speaker knows what they are talking about even without
completing the whole sentence
Ryan: I’m gonna have to get Paul to come over too.

Marie: Why?
5. Repetition
This is being done to give emphasis to a point
Ex:
Marie: It’s more drama living in this house that out of it.
John: (Quietly) I don’t know why.
Marie: (Loudly) I don’t know why.
6. Lexical Bundles – “bundles of words”
Lexical bundles are combinations of three or more words which are

identified empirically in a corpus of natural language. According to
research, these are more frequently used in conversational discourse.
Examples of lexical bundles are expressions such as:

take a look at, I don't know, on the other hand, and as a result of,
among many others, all I am saying is
Performance Phenomena in Conversational Discourse

1. Silent and filled pauses in conversation Performance phenomena
that are characteristics of conversational discourse include silent
and filled pauses in the middle of a sentence or a grammatical unit
Ex:
Marie: you are being… a sixteen- year old – twit. Sit down and
write down your guests.
2. Utterance launchers and filled pauses
Filled pauses at transition point in conversational discourse

typically use utterance launchers such as “and” “well” and “ right”
as the speaker prepares what they will say
Ex:
Ryan: And… can I have a DJ, is that OK?
3. Attention signals in conversation
Speakers use another person’s name as attention signal to make it

clear who they are speaking to as in : Marie: John? John: what?
4. Response elicitors in conversation
There are number of typical ways of eliciting a response in

conversation discourse. A question tag or a single item
for example , can function as a response elicitor as in
Marie: we’ll keep an orderly party for Saturday night.. All right?
5. Non-clausal items as response forms:
Non-clausal items such as uh, huh, mm , yeah and ok often

operates as response forms in conversation as in
Marie: …the DJ why d’ you have a DJ? What does he do? Just play
records all night?
Ryan : yeah
6. Extended co-ordination of clauses
Conversational discourse often includes long extended turns.

These turns may be extended by co-ordination where one clausal
unit is added to another clause then another item such as , and
and but , or by the direct juxtaposition of clauses as in
Example:
Ryan: We’ll leave the gate open. We’ll leave the pontoon there, and
you will see, just see. You… you think I’m so stupid but if you…
you look around and open your eye, you will see.
Constructional Principles of Conversational Discourse

1. Conversational Prefaces – give time for the speaker to plan what
to say next as in the use of utterance launchers ( and…. For
you to…)
2. Tags - Speakers add tags in many ways as an after through to

a grammatical unit in conversational discourse, they do this by
the use of question tag at the end of sentence. The effect of this
is to turn a statement into a question , and to reinforce what
has been just said .
Ex: You are going to the party. Right John?
CONCLUSION:
Conversation discourse then has many features which are not

typical of more formal kinds of spoken discourse, or of written
discourse , because conversation takes place in a shared context
and in real time, there is often less specification of meaning than
there is in other spoken and written genre. Also because
conversation take place between people who usually know each
other it is less influenced by traditional blew of accuracy and
correctness.
Corpus Studies of the Social Nature of Discourse
1. Spoken language in academic settings(Swales 2003)

Findings: Academic speaking across university tended to be informal
and conversational. He found out that spoken discourse vocabulary is
unpretentious and therefore concludes that there were fewer barriers
to cross-disciplinary oral communication than there might be in
written academic communication because of the convergence of
spoken discourse style (use of um, uh)
2. Dissertation acknowledgements (Hyland 2004)
He tried to analyze the social role of the part of this part of the
discourse
Findings: These simple text bridges the personal and the public, the
social and the professional, academic from the moral
Collocation and corpus studies
Collocation - words that often go together
Ex: light sleeper, early riser, make the bed
1. Dissertation acknowledgements (Hyland and Tse 2004)
Corpus approach was used to examine collocation in written and spoken
discourse. It was found out that “special thanks” was the most common
way dissertation writers express gratitude, followed by “sincere thanks
and “deep thanks”. They found this by searching for their corpus
including what items typically occurs on the left side of the word
“thanks”
2.Personal ads (Ooi 2001)
Ooi used the concordance program Wordsmith Tools to examine word
frequency and lexical and grammatical collocations in his example texts.
His interest was in how people in different cultures communicate on the
internet on the same topic and in the same genre , as well as what
gender differences there might be in the ways that they do this , he found
, for example , that many US writers used the terms “attractive” and
“great” as descriptive devices whereas the Singaporean writers largely did
not . Used the item “old” many more men preceded this with specification
of age (as in “39 years old”) than did women. The verb “looking for”
predominated the data and commonly collocated with an item which
represented the writer’s “hope or dream” . as in “someone special”. “that
special woman “ , “a discreet relationship”. Looking for features of the
language of romance , dating, Intimacy and desire. Bruthiaux (1994)
found in his study that writers frequently used personal chaining and
hyphenated items in personal advertisements: that is, strings of
adjectives and nouns Conventionalized abbreviations for collocations
such as SAM for Single Asian Male and SWF for Single White Female.
The genre of personal ads, further , continuously uses linguistic
simplification and an economy of language that is characteristic of other
discourse types , such as newspaper headlines , academic note taking
and conversational discourse
Criticisms of corpus studies

1. the computer based orientation of corpus studies leads to atomized,
bottom-up investigation of language use
2. Cospus studies do not take account of the contextual aspects of
Texts.
However, Tribble (2002) provided a detailed explanation of these

criticisms in his Table of Contextual and Linguistic Framework
Contextual Linguistic
Social context lexico-grammatical
features
Communicative purpose text structure
Roles of readers and writers textual patterns
Shared cultural values
Knowledge of other texts

Cospus Approaches in Discourse Analysis

Uploaded by

Copyright:

Available Formats

You might also like

Cospus Approaches in Discourse Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cospus Approaches in Discourse Analysis

Uploaded by

Copyright:

Available Formats

Republic of the Philippines

VIZCAYA STATE UNIVERSITY NUEVA

Subject: Discourse Analysis Professor: William D. Magday Jr., PhD.

CORPUS APPROACHES IN DISCOURSE ANALYSIS

A specialized corpus as Hunston (2002: 14) explain is “a corpus of texts

2.1 The Michigan Corpus of Academic Spoken English (MICASE)

2.2 The British Academic Spoken English (BASE)

The British academic spoken English (BASE) corpus

2.3 The British Academic Written English (BAWE)

The British academic written English is a specialized corpora that is

2.4 The TOEFL spoken and written academic language corpus

A specialized corpus which include both spoken and written discourse is

A key observation of the TOEFL study was that spoken genres in US

Design and Construction of Corpora

Hyland’s study on the use of personal pronouns I, me, we and us

Authenticity, representativeness and validity are also issues in,

2. Kinds of texts to include in the corpus

3. Size of the texts in the corpus

The size of texts in the corpus is also a consideration. Some corpora

Sampling is also an issue in corpus design. The key issue here is

The Longman Spoken and Written English Corpus

The Longman Spoken and Written English (LSWE) Corpus is an

The LSWE corpus aimed to provide a representative sampling of texts

What is non-clausal unit

● Have no grammatical connection with what immediately

*unlike in written discourse, we are very particular in the

* Why non- clausal language is more common in

- Non-clausal units reflect the simplicity of grammatical constructions

4. Non-clausal units as elliptic responses

Ryan: I’m gonna have to get Paul to come over too.

6. Lexical Bundles – “bundles of words”

Lexical bundles are combinations of three or more words which are

Examples of lexical bundles are expressions such as:

Performance Phenomena in Conversational Discourse

2. Utterance launchers and filled pauses

Filled pauses at transition point in conversational discourse

Speakers use another person’s name as attention signal to make it

There are number of typical ways of eliciting a response in

Non-clausal items such as uh, huh, mm , yeah and ok often

6. Extended co-ordination of clauses

Conversational discourse often includes long extended turns.

Constructional Principles of Conversational Discourse

2. Tags - Speakers add tags in many ways as an after through to

Conversation discourse then has many features which are not

Corpus Studies of the Social Nature of Discourse

1. Spoken language in academic settings(Swales 2003)

Criticisms of corpus studies

However, Tribble (2002) provided a detailed explanation of these

You might also like