CHAP 6

CHAP 6
What's the Point of a Small Corpus?

Introduction
Corpora, which are collections of written or spoken texts used for language research, come in
different sizes. There are two main trends in corpus compilation:
1. Large Corpora: These contain hundreds of millions of words, like the Bank of English
and the Cambridge International Corpus.
2. Small, Specialized Corpora: These focus on specific types of language or genres.
Why Use a Small Corpus?
While large corpora allow for extensive analysis of language patterns, small corpora also have
their unique benefits and purposes. Here’s why small corpora are important:
Limitations of Small Corpora
John Sinclair, a pioneer in Corpus Linguistics, argued that small corpora have limitations:
 Limited Data: Small corpora may not provide enough data to identify rare phrases or
words.
 Example: Sinclair searched for the phrase "fit into place" in corpora of different sizes.
He only found a few examples in a 200-million-word corpus, but none in smaller ones.
Advantages of Small Corpora
However, small corpora can be useful in certain contexts:
1. Frequent Grammatical Items:

o Grammatical Study: Pronouns, prepositions, and auxiliary verbs are very
common. A small corpus can be sufficient to study these.
o Example: Studying the use of the word "the" can be effectively done in a smaller
corpus.
2. Contextual Analysis:
o Context Preservation: Small corpora often keep the context of the language use,
which is lost in large corpora.
o Example: In large corpora, it’s hard to determine the specific context of each
text. Small corpora, however, maintain this context, making it easier to
understand how language is used in specific situations.
3. Specialized Research:
o Targeted Studies: Small, specialized corpora are designed for specific research
purposes. This is particularly useful in fields like English for Specific Purposes
(ESP) and English for Academic Purposes (EAP).
o Example: Teachers and learners in ESP/EAP benefit more from focused data
relevant to their specific needs rather than vast amounts of general language data.
4. Practicality:
o Ease of Compilation: For individual researchers or teachers, compiling a small
corpus is more practical. Collecting and transcribing data for a small corpus is
manageable.
Conclusion
Small corpora are not just smaller versions of large corpora; they serve different purposes and
provide unique insights, especially when studying frequent grammatical items, maintaining
context, and focusing on specific language uses. They are particularly useful in educational
settings and specialized language research, providing targeted and practical data for specific
learning need
How Small and How Specialized Can a Corpus Be?
Understanding Small Corpora
1. What is a Corpus? A corpus is a collection of written or spoken material that researchers use
to study language. It serves as a database for analyzing various linguistic elements such as
grammar, vocabulary, and usage patterns.
2. What Defines a Small Corpus? The size of a corpus is relative and depends on its content
(written or spoken) and its purpose.
a. Differences Between Spoken and Written Corpora:
 Spoken Corpus: Smaller in size because collecting and transcribing spoken data is time-
consuming.
 Written Corpus: Generally larger because written texts are easier to compile and digitize.
b. Size Examples:
 Spoken Corpus: A corpus with over a million words of speech is considered large.
 Written Corpus: A corpus with fewer than five million words of text is seen as small.
 General Agreement: Many researchers agree that small corpora contain up to 250,000 words.
c. Use Cases of Small Corpora:
 Even small corpora can be highly effective if they are well-constructed and representative of the
language or topic being studied.
 Examples:
o The 52,000-word POTTI Corpus used for studying post-observation teacher trainee
interaction.
o The 34,000-word Corpus of American and British Office Talk (ABOT).
3. Importance of Design and Representativeness:
 The quality and representativeness of a corpus are more critical than its size.
 A small, well-designed corpus can yield significant insights if it accurately represents the
language usage it aims to study.
Exploring Specialization in Corpora
1. Specialization Parameters: A corpus can be specialized in various ways, depending on the

research focus. Here are some parameters for specialization:
a. Purpose:
 Designed to investigate specific grammatical or lexical items.
b. Context:
 Focused on particular settings, participants, and communicative purposes.
c. Genre:
 Includes specific types of texts like grant proposals or sales letters.
d. Text Type:
 Consists of particular text types, such as biology textbooks or casual conversations.
e. Subject Matter:
 Concentrates on specific topics, such as economics or environmental studies.
f. Variety of English:
 Focuses on different English varieties, like Learner English or regional dialects.
2. Examples of Specialized Corpora:
 Limerick Corpus of Irish English: Focuses on Irish English.

 International Corpus of Learner English (ICLE): Contains writings by English learners from
various backgrounds.
 Michigan Corpus of Spoken Academic English (MICASE): Collects spoken academic
English.
3. Size and Specialization:
 Specialized corpora can be large. For instance:

o ICLE: Contains three million words.
o Cambridge Learner Corpus: Over thirty million words.
4. Degrees of Specialization:
 General Specialization: Represents a broad language variety and includes various genres.
 Focused Sub-Corpora: Specific sub-sections within a corpus. For example, the Hong Kong
Corpus of Spoken English has sub-corpora for conversation, business discourse, academic
discourse, and public discourse.
5. ESP/EAP Corpora:
 ESP (English for Specific Purposes) and EAP (English for Academic Purposes) corpora are
highly specialized for particular research or teaching needs.
 Examples:
o Corpus of Environmental Impact Assessment (EIA): Contains 250,000 words from
summary reports commissioned by the Hong Kong Environmental Protection
Department.
o Indianapolis Business Learner Corpus (IBLC): Contains 200 application letters
written by business communication students from three different countries.
Importance of Size and Specialization in Corpora
1. Targeted and Reliable Results:
 Specialized corpora, even if small, can provide reliable and insightful results because they are
designed to accurately represent a specific register or genre.
2. Pattern and Distribution:
 Specialized lexis (vocabulary) and structures (grammar) tend to show consistent patterns in
specialized corpora, making them effective for targeted research.
Practical Guidelines for Building a Small Specialized Corpus
Key Considerations
1. Purpose and Research Questions:

o Clearly define the purpose of your corpus and the research questions you aim to answer.
2. Representativeness:
o Ensure your corpus represents the language, genre, or context you are studying.
3. Data Collection:
o Choose appropriate methods for collecting data, whether written or spoken, to meet your
research needs.
4. Ethics and Permissions:

o Obtain necessary permissions and ensure ethical guidelines are followed during data
collection.
5. Data Organization:
o Organize your data systematically for easy analysis and retrieval.
6. Annotation and Tagging:

o Annotate and tag your data for specific linguistic features relevant to your study.
7. Analysis Tools:
o Use appropriate software tools for analyzing your corpus data.
By following these guidelines, you can build a small specialized corpus that is effective for your
specific linguistic research need
. Important considerations in the designing of a small specialised corpus
1. Representativeness
Definition: Representativeness is about ensuring that the corpus accurately reflects the range of
language use in the population it is meant to represent.
Types of Variability:
 Situational Variability: This refers to the different contexts or settings in which

language is used. For example, in a business context, this could include emails, meetings,
reports, etc.
 Linguistic Variability: This includes the variety of language features, such as
vocabulary, grammar, and sentence structures used in those contexts.
Example: If you're studying customer service interactions, your corpus should include samples
from phone calls, emails, chat interactions, and face-to-face conversations across different
companies and industries.
2. Establishing Situational Representativeness First
Importance: Before you can claim that your corpus is linguistically representative, you need to
ensure it covers the full range of situations or contexts in which the language is used.
Example: If your focus is on technical manuals, you need to include manuals from various fields
like electronics, software, automotive, and medical equipment. This ensures that you capture the
different ways technical language is used across these fields.
3. Sampling from Typical Situations
Practical Limitations: It's often impractical to collect samples from every possible situation,
especially for a small corpus. For instance, getting emails from every single business sector is
unrealistic.
Solution: Focus on a range of typical or common situations within the genre. This approach
helps in capturing the diversity of language use without needing exhaustive sampling.
Example: In the ABOT Corpus, which studied face-to-face office interactions, data were
collected from various organizations and sectors such as education, publishing, and retail. This
ensured that the corpus wasn’t biased towards any single setting.
4. Linguistic Representativeness
Link to Situational Representativeness: Once you have a variety of situational samples, you
can then assess linguistic variability. This step ensures that the language features in your corpus
reflect those in the broader population.
Sample Size and Quantity:
 Text Length: Ideally, each text sample should be around 1,000 words. This length is
often sufficient to capture the linguistic features of the text.
 Number of Samples: Aim for at least five, and preferably ten, samples per genre or
register. This helps ensure that the findings are reliable and not skewed by anomalies in a
few texts.
Example: Biber's research showed that common linguistic features are relatively stable across
1,000-word samples. Therefore, even a small corpus with 1,000-word samples can provide
reliable insights into language use.
5. Challenges with Small Corpora
Short Texts: In some cases, texts may be shorter than 1,000 words. For instance, many spoken
interactions (like customer service calls) are brief and don’t reach 1,000 words.
Complete Texts: It’s more important to collect complete texts or interactions, even if they are
shorter than the ideal length. This approach maintains the context and ensures the text is
representative of its genre.
Example: In workplace emails, many messages may be brief. Instead of forcing a length
requirement, it’s better to include complete emails to accurately reflect email communication in
the workplace.
6. Dealing with Under-represented Genres
Local Densities: Be cautious of local densities, where certain words or phrases appear frequently
in one text but are not common across the genre. This can skew your results if not identified.
Comparison: To avoid skewed results, compare broader categories or macro-genres rather than
focusing on individual genres with few samples.
Example: In the ABOT Corpus, comparing decision-making and procedural discourse (the most
frequent genres) provided more reliable results than comparing less common genres like
reporting, which had fewer samples.
7. Purpose of the Corpus
Research Focus: The corpus should be designed to answer specific research questions. This
focus will guide the selection of texts and ensure the corpus is fit for its intended purpose.
Example: The Indianapolis Business Learner Corpus (IBLC) aimed to study language use and
genre acquisition among business students. It included application letters written by students
from different countries, allowing researchers to analyze cross-cultural differences in business
communication.
Summary
When designing a small, specialized corpus, follow these detailed steps:
1. Ensure Representativeness:
o Include a wide range of situational and linguistic variability.
o Start with situational representativeness.
2. Establish Situational Representativeness:
o Cover different settings and contexts within the target genre.
3. Sample from Typical Situations:
o Collect samples from a range of typical scenarios.
o Avoid bias by not focusing on a single setting.
4. Achieve Linguistic Representativeness:
o Use text samples of around 1,000 words.
o Aim for at least five to ten samples per genre.
5. Address Challenges with Small Corpora:
o Collect complete texts even if they are shorter.
o Maintain context for accurate representation.
6. Handle Under-represented Genres:
o Be aware of local densities.
oCompare broader categories to avoid skewed results.
7. Design for Specific Research Purposes:
o Tailor the corpus to fit the research questions.
o Ensure the data collected aligns with the study’s goals.
By adhering to these detailed considerations, you can create a small, specialized corpus that is
both representative and useful for your specific research needs.
Compiling and Transcribing a Small Specialized Spoken Corpus: A Detailed

Guide
1. Understanding the Limitations and Context
 Small Corpus Challenges: A small corpus has limitations, but understanding the context
can help overcome these.
 Importance of Context: For specialized corpora, knowing the setting where the data was
collected is crucial. This background information is necessary to make sense of
specialized discourse.
2. Example: The Hong Kong Corpus of Spoken English (HKCSE)
 Observation Period: Before collecting data, researchers observed the organization. This
helped them choose recording sites that represented various functions within the
organization.
 Essential for Interpretation: This initial observation period was crucial for later
interpreting the data.
3. Ethnographic Methods in Corpus Studies
 Combining Methods: Ethnographic methods like observation, note-taking, and

interviews can be used alongside corpus studies.
 Value of Contextual Information: For small specialized corpora, contextual information
is invaluable for interpreting data and qualitatively analyzing results.
4. Untranscribed Data and Its Uses
 Recording More Than Needed: Typically, more data is recorded than transcribed.
Untranscribed data can still provide useful background information.
 Example: ABOT Corpus: Only a small portion of the 30 hours of recordings was
transcribed.
5. Consulting Participants
 Aiding Transcription and Compilation: Sometimes, it is necessary to consult
participants or organization representatives to help with transcription or corpus
compilation.
 Example from HKCSE: When compilers couldn't assign a particular encounter to any
genre, they created a new category after consulting an employee.
6. Incorporating Background Information in Corpus Design
 Linking Linguistic Practices to Context: Detailed information about speakers, the goals
of interactions, and the setting can link linguistic practices to specific contexts.
 Collecting Speaker Information: Information such as place of birth, gender, occupation,
and educational background can be collected and included in the corpus database.
7. Example: Cambridge and Nottingham Business English Corpus (CANBEC)
 Categories of Collected Information:

1. Relationship between speakers: Peer, manager-subordinate, colleagues from the
same or different departments.
2. Topic: Sales, marketing, production.
3. Purpose of the meeting: Internal/external, reviewing, planning.
4. Speaker information: Age, title, department, level in the company.
5. Company type and size.
8. Transcribing Data
 Level of Detail: The level of detail in transcription depends on the project's aim. More
detailed transcriptions are closer to the original interaction and provide more features for
analysis.
 Transcription Conventions: These should be computer-readable, often in plain text
format.
9. Interactive Features in Transcriptions
 Indicating Interactive Features: Pauses, overlaps, interruptions, and non-linguistic

features like laughter should be indicated in small spoken corpora.
 Example: Limerick Corpus of Irish English: This corpus shows interactive features but
does not transcribe prosodic features (intonation).
10. Detailed Transcriptions
 Prosodic Features: More detailed transcriptions include prosodic features, such as

syllable emphasis and intonation patterns.
 Example: VOICE and HKCSE: These corpora include detailed prosodic features using
specific transcription systems.
11. Customizing Transcription Conventions

 Adapting to Data: Transcription conventions should capture significant features of the
data and be suitable for the research purposes. They may need to be adapted for different
projects.
By following these steps, you can effectively compile and transcribe a small specialized spoken
corpus, ensuring that the data is accurately represented and useful for analysis.
Learning from a Small Specialized Corpus
A small specialized corpus is a collection of texts focused on a specific area, such as business
meetings or academic writing. Analyzing such a corpus provides unique insights into language
use within that specific context. Here’s a detailed and easy-to-understand explanation of what
can be learned from a small specialized corpus:
1. Understanding Contextual Language Use
Context and Language Connection
 Close Link: In a small corpus, the language is closely tied to its context, unlike large corpora
where the context might be diluted.
 Accessible Texts: The texts are more accessible, meaning you can easily understand why certain
words or phrases are used.
Informing Corpus-Based Analysis
 Background Information: When you compile a small specialized corpus, you often have
detailed background information, helping you interpret the data accurately.
 Examples: For instance, if you're studying business meetings, knowing the meeting's purpose can
explain why specific jargon or phrases are used.
2. Revealing Social and Cultural Context
Linguistic Patterns
 Patterns and Context: The patterns in language use can reveal much about the social and
cultural context of the data.
 Example: The phrase "going forward" in business settings marks in-group membership,
indicating it's part of the business culture.
Signature Uses of Language
 Specific Phrases: Certain phrases become "signatures" of specific contexts. For example, "going
forward" is used frequently in business meetings instead of "in the future."
 Localised Uses: These signature uses are often localised to specific situational conditions like
gender, power, or discourse goals.
3. Genre, Topic, and Participant Relationships

Influence of Genre
 Types of Interactions: Different genres (types of interactions) influence language use. For
instance, in collaborative genres, participants use language differently than in unidirectional
genres.
 Example: In the CANBEC corpus, "issue" and "problem" are used differently based on the
meeting type and participants' relationships.
Speaker Relationships
 Power Dynamics: The relationship between speakers (e.g., manager vs. subordinate) affects
word choice and language style.
 Example: "Issue" is more frequent in interactions between managers and subordinates, while
"problem" is more common among peers.
4. Detailed Examples from CANBEC Corpus
Lexical Items: "Issue" and "Problem"
 Topic and Purpose: The use of "issue" and "problem" varies based on the topic and purpose of
meetings.
o Issue: More frequent in human resources and marketing meetings.
o Problem: More common in technical and procedural meetings.
 Speaker Relationship:
o Issue: More common between managers and subordinates.
o Problem: More common in peer discussions.
Modals of Obligation
 Collaborative vs. Unidirectional: Modals like "have to," "need to," and "should" are used more
in collaborative settings where participants are equals.
o Collaborative Genres: Frequent use of "we" and "you" with modals.
o Unidirectional Genres: More use of "I" with modals, avoiding direct language to reduce
face-threatening acts.
5. Semantic Prosodies and Collocations
Negative Connotations
 Phrase Use: Phrases like "associated with" often carry a negative connotation in specialized
contexts, such as environmental reports.
 Example: "Difficulties associated with hydraulic dredging" implies problems without directly
stating them.
6. Comparing Small and Large Corpora

Validating Findings
 Generalization: By comparing a small corpus with a larger one, you can check if the findings are
consistent across different contexts.
 Example: Checking if "associated with" has a negative connotation in both small specialized and
large general corpora.
Identifying Keywords
 Frequency Analysis: Comparing small and large corpora helps identify keywords that are
unusually frequent in the small corpus.
 Example: Words unique to business interactions can be highlighted by comparing CANBEC
with a general corpus.
Conclusion
A small specialized corpus offers deep insights into specific language use, showing how closely
language patterns are tied to their contexts. It is particularly valuable for studying detailed and
contextual language use, revealing social and cultural nuances, and identifying signature phrases
unique to particular settings. This detailed examination of language use in specific contexts
provides a richer understanding than what might be achieved with larger, more generalized
corpora.

CHAP 6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CHAP 6

Uploaded by

Copyright:

Available Formats

CHAP 6

What's the Point of a Small Corpus?

Why Use a Small Corpus?

Limitations of Small Corpora

Advantages of Small Corpora

However, small corpora can be useful in certain contexts:

1. Frequent Grammatical Items:

How Small and How Specialized Can a Corpus Be?

Understanding Small Corpora

a. Differences Between Spoken and Written Corpora:

c. Use Cases of Small Corpora:

3. Importance of Design and Representativeness:

Exploring Specialization in Corpora

1. Specialization Parameters: A corpus can be specialized in various ways, depending on the

 Designed to investigate specific grammatical or lexical items.

 Focused on particular settings, participants, and communicative purposes.

 Includes specific types of texts like grant proposals or sales letters.

 Consists of particular text types, such as biology textbooks or casual conversations.

 Concentrates on specific topics, such as economics or environmental studies.

 Focuses on different English varieties, like Learner English or regional dialects.

2. Examples of Specialized Corpora:

 Limerick Corpus of Irish English: Focuses on Irish English.

 Specialized corpora can be large. For instance:

Importance of Size and Specialization in Corpora

1. Targeted and Reliable Results:

2. Pattern and Distribution:

Practical Guidelines for Building a Small Specialized Corpus

1. Purpose and Research Questions:

4. Ethics and Permissions:

6. Annotation and Tagging:

. Important considerations in the designing of a small specialised corpus

 Situational Variability: This refers to the different contexts or settings in which

2. Establishing Situational Representativeness First

3. Sampling from Typical Situations

Sample Size and Quantity:

5. Challenges with Small Corpora

6. Dealing with Under-represented Genres

7. Purpose of the Corpus

When designing a small, specialized corpus, follow these detailed steps:

Compiling and Transcribing a Small Specialized Spoken Corpus: A Detailed

1. Understanding the Limitations and Context

2. Example: The Hong Kong Corpus of Spoken English (HKCSE)

3. Ethnographic Methods in Corpus Studies

 Combining Methods: Ethnographic methods like observation, note-taking, and

4. Untranscribed Data and Its Uses

6. Incorporating Background Information in Corpus Design

7. Example: Cambridge and Nottingham Business English Corpus (CANBEC)

 Categories of Collected Information:

9. Interactive Features in Transcriptions

 Indicating Interactive Features: Pauses, overlaps, interruptions, and non-linguistic

10. Detailed Transcriptions

 Prosodic Features: More detailed transcriptions include prosodic features, such as

11. Customizing Transcription Conventions

Learning from a Small Specialized Corpus

1. Understanding Contextual Language Use

Context and Language Connection

Informing Corpus-Based Analysis

2. Revealing Social and Cultural Context

Signature Uses of Language

3. Genre, Topic, and Participant Relationships

4. Detailed Examples from CANBEC Corpus

Lexical Items: "Issue" and "Problem"

5. Semantic Prosodies and Collocations

6. Comparing Small and Large Corpora