Corpus Definitions. Last Year

The term The definition
Corpus-driven Studies Do not start with a previous hypothesis or theory.

Instead, there is a question that came out the blue
and you need to find and answer for.
Corpus-based Studies You start from a previous theory or hypothesis and try
to test it against a corpus.
A corpus Is a principled and large collection of authentic texts

that are stored in a computer, and analyzed using
software designed for corpus analysis.
Corpus Linguistics (CL) Is the quantitative approach to linguistic analysis.
Authenticity Means that the texts in your collection are generated

by native speakers in natural-communicative settings.
corpus Is a collection of texts produced by native speakers.

That collection is used for linguistic analysis
A Learner corpus Is compiled from the writings of language learners for
pedagogical purposes.
Comparative corpora Are monolingual corpora. They comprise different

varieties of the same language.
Comparable corpora, Comprise bi/multilingual texts that tackle the same

topics in multiple languages; yet, the texts are not
exact translations of one another. One well-known
comparable corpus is the Wikipedia Corpus.
Parallel corpora Means that the texts are direct translations of one
another. One good example is the UN Corpus. It is
multilingual as it represents the six official UN
languages and the texts are exact translations.
a multilingual corpus Comprises 3+ different languages.

A bilingual corpus Comprises exactly two different languages.
A monolingual corpus Comprises one language and even one variety of that
language. Classic examples are the British National
Corpus or COCA.
Static corpora And most of the corpora are static – are not updated
once the researcher is done compiling them. For
example, COCA stopped in 2017.
Dynamic corpora Grow exponentially over time as they are being

continuously updated. In other words, new texts are
added to them regularly.
An annotated corpus Corpus is a corpus that has undergone some sort of

linguistic analysis. An example of an annotated corpus
is the Quranic Arabic Corpus.
A raw corpus Is a corpus without any linguistic analysis; only plain

text. One example is the Charles Dickens Corpus from
the Gutenberg Project.
Lexicography is the industry of making dictionaries

A synchronic corpus Corpus covers one period of time that can be either
past or present. However,
A diachronic – or a historical corpus comprises texts from more than one period of
time.
A specialized corpus Is a corpus that is specific to one language form and
one text genre.
A general corpus Is a corpus that covers:

• more than one form of language: spoken
and written
• more than one text genre: conversations,
song lyrics, novels, etc.
• more than one era: 1990s, 2000s, 2010s,
Corpus Linguistics a quantitative, descriptive, and experimental field
Corpus is a collection of real-world texts
Text archives These are Web-based repositories of fiction and non-
fiction texts that are scanned from the original
sources or typed in by volunteers.
Non-adjacent collocations • these are separable collocations such as
give it up
Adjacent collocations • these are inseparable collocations such

as kick the bucket
threshold option • This option sets the minimum frequency for the
two words to co-occur to be considered a
collocation.
Collocations • are words that frequently come together BUT

one word at least must be a content word
Concordance Plot is a visualization function. It allows you to see the

distribution of your query in the different parts of the
file.
Concordance • , it gives you a list of all the contexts in which
the query occurs according to the corpus.
Grams • mean words and (n) stands for the number of

words: uni = one, bi-two, tri = three, quadric =
four ... etc.
The Wordlist function • gives you a list of word frequencies that can be
sorted in an ascending or a descending order.
The type count • represents the total number of the words

without duplicates.
The token count • count represents the total number of the words
in the corpus including duplicates.
AntConc is a free, offline, and light-weight corpus processor
that we will use for this course
POS taggers. • They take raw corpora as input and give the
grammatical category of each word as output.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑒𝑑 𝑤𝑜𝑟𝑑𝑠 𝑜𝑟 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠

𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 𝒓𝒂𝒕𝒆 • ( )*
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑜𝑘𝑒𝑛𝑠
100
The tagset • is the set of labels used by the part of speech

tagger to mark the grammatical class of the
word.
Deep Parsers • They analyze the entire sentence to identify the

Subject and the Predicate.
Shallow Parsers • also work on the syntactic level but they try to
analyze phrase structure, that is to say, they try
to identify: noun, verb, adjectival, adverbial,
and prepositional phrases. It doesn’t work at
the word level but the phrase level.
Core Frame elements • Frame elements that are essential to the
meaning of a frame are called "core" FEs (e.g
Speaker in frames connected with
communication); expressions of time, place and
manner are generally not core FEs.
Frame element (FE) • frame-specific defined semantic role that is the

basic unit of a frame
Frame (semantic frame) • A schematic representation of a situation

involving various participants, props and other
conceptual roles, each of which is a frame
element
Frame semantics a descriptive framework for characterizing lexical

meaning in terms of semantic frames
Annotation The assignment of semantic role tags to syntactic
constituents
lexical unit (LU) • a pairing of a lemma and frame - i.e. a "word"
taken in one of its senses.
FrameNet • is an online lexical database that documents

the semantic and syntactic information of each
lexical unit (Baker et al., 1998).
Manual annotation Means to hire a human subject to do the linguistic

annotation based on some guidelines.
In-lab annotation means:

you hire expert human subjects
you give them enough training
you provide them with a place to work
you keep them under your close supervision
Corpus annotation Is adding linguistic information to the corpus.
Crowdsourcing Is creating an online survey and forming the

annotation task as a Q&A one.
COCA • The Corpus of Contemporary American English
(COCA): It is a corpus representing American
English. It comprises 520 million words of text
(20 million words each year 1990 – 2015) and it
is equally divided among spoken, fiction,
popular magazines, newspapers, and academic
texts.
The wildcard Means anything: anything that comes in the position

of the wildcard.
• Wildcards can also be used to do morphological
searches.
Normalized Frequency (w) = 𝐶(𝑤) ∗ 𝑐𝑜𝑚𝑚𝑜𝑛 𝑏𝑎𝑠𝑒

𝑁
where C(w) is the raw frequency of the given word, N
is the total number of words in the corpus, and the
common base ranges from 10 to 1,000,000 depending
on the size of the corpus.
Raw frequency • is the number of occurrences in the corpus.
The Key Word In Content • is the concordance function which displays up

(KWIC) to 1,000 random contexts of the query word.

Corpus Definitions. Last Year

Uploaded by

Copyright:

Available Formats

You might also like

Corpus Definitions. Last Year

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Corpus Definitions. Last Year

Uploaded by

Copyright:

Available Formats

The term The definition

Corpus-driven Studies Do not start with a previous hypothesis or theory.

A corpus Is a principled and large collection of authentic texts

Corpus Linguistics (CL) Is the quantitative approach to linguistic analysis.

Authenticity Means that the texts in your collection are generated

corpus Is a collection of texts produced by native speakers.

Comparative corpora Are monolingual corpora. They comprise different

Comparable corpora, Comprise bi/multilingual texts that tackle the same

a multilingual corpus Comprises 3+ different languages.

Dynamic corpora Grow exponentially over time as they are being

An annotated corpus Corpus is a corpus that has undergone some sort of

A raw corpus Is a corpus without any linguistic analysis; only plain

Lexicography is the industry of making dictionaries

A general corpus Is a corpus that covers:

Adjacent collocations • these are inseparable collocations such

Collocations • are words that frequently come together BUT

Concordance Plot is a visualization function. It allows you to see the

Grams • mean words and (n) stands for the number of

The type count • represents the total number of the words

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑒𝑑 𝑤𝑜𝑟𝑑𝑠 𝑜𝑟 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠

The tagset • is the set of labels used by the part of speech

Deep Parsers • They analyze the entire sentence to identify the

Frame element (FE) • frame-specific defined semantic role that is the

Frame (semantic frame) • A schematic representation of a situation

Frame semantics a descriptive framework for characterizing lexical

FrameNet • is an online lexical database that documents

Manual annotation Means to hire a human subject to do the linguistic

In-lab annotation means:

Corpus annotation Is adding linguistic information to the corpus.

Crowdsourcing Is creating an online survey and forming the

The wildcard Means anything: anything that comes in the position

Normalized Frequency (w) = 𝐶(𝑤) ∗ 𝑐𝑜𝑚𝑚𝑜𝑛 𝑏𝑎𝑠𝑒

The Key Word In Content • is the concordance function which displays up

You might also like