Corpus Definitions. Last Year

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

The term The definition

Corpus-driven Studies Do not start with a previous hypothesis or theory.


Instead, there is a question that came out the blue
and you need to find and answer for.

Corpus-based Studies You start from a previous theory or hypothesis and try
to test it against a corpus.

A corpus Is a principled and large collection of authentic texts


that are stored in a computer, and analyzed using
software designed for corpus analysis.

Corpus Linguistics (CL) Is the quantitative approach to linguistic analysis.

Authenticity Means that the texts in your collection are generated


by native speakers in natural-communicative settings.

corpus Is a collection of texts produced by native speakers.


That collection is used for linguistic analysis
A Learner corpus Is compiled from the writings of language learners for
pedagogical purposes.

Comparative corpora Are monolingual corpora. They comprise different


varieties of the same language.

Comparable corpora, Comprise bi/multilingual texts that tackle the same


topics in multiple languages; yet, the texts are not
exact translations of one another. One well-known
comparable corpus is the Wikipedia Corpus.

Parallel corpora Means that the texts are direct translations of one
another. One good example is the UN Corpus. It is
multilingual as it represents the six official UN
languages and the texts are exact translations.

a multilingual corpus Comprises 3+ different languages.


A bilingual corpus Comprises exactly two different languages.
A monolingual corpus Comprises one language and even one variety of that
language. Classic examples are the British National
Corpus or COCA.

Static corpora And most of the corpora are static – are not updated
once the researcher is done compiling them. For
example, COCA stopped in 2017.

Dynamic corpora Grow exponentially over time as they are being


continuously updated. In other words, new texts are
added to them regularly.

An annotated corpus Corpus is a corpus that has undergone some sort of


linguistic analysis. An example of an annotated corpus
is the Quranic Arabic Corpus.

A raw corpus Is a corpus without any linguistic analysis; only plain


text. One example is the Charles Dickens Corpus from
the Gutenberg Project.

Lexicography is the industry of making dictionaries


A synchronic corpus Corpus covers one period of time that can be either
past or present. However,

A diachronic – or a historical corpus comprises texts from more than one period of
time.
A specialized corpus Is a corpus that is specific to one language form and
one text genre.

A general corpus Is a corpus that covers:


• more than one form of language: spoken
and written
• more than one text genre: conversations,
song lyrics, novels, etc.
• more than one era: 1990s, 2000s, 2010s,
Corpus Linguistics a quantitative, descriptive, and experimental field
Corpus is a collection of real-world texts
Text archives These are Web-based repositories of fiction and non-
fiction texts that are scanned from the original
sources or typed in by volunteers.
Non-adjacent collocations • these are separable collocations such as
give it up

Adjacent collocations • these are inseparable collocations such


as kick the bucket

threshold option • This option sets the minimum frequency for the
two words to co-occur to be considered a
collocation.

Collocations • are words that frequently come together BUT


one word at least must be a content word

Concordance Plot is a visualization function. It allows you to see the


distribution of your query in the different parts of the
file.
Concordance • , it gives you a list of all the contexts in which
the query occurs according to the corpus.

Grams • mean words and (n) stands for the number of


words: uni = one, bi-two, tri = three, quadric =
four ... etc.

The Wordlist function • gives you a list of word frequencies that can be
sorted in an ascending or a descending order.

The type count • represents the total number of the words


without duplicates.

The token count • count represents the total number of the words
in the corpus including duplicates.
AntConc is a free, offline, and light-weight corpus processor
that we will use for this course
POS taggers. • They take raw corpora as input and give the
grammatical category of each word as output.

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑒𝑑 𝑤𝑜𝑟𝑑𝑠 𝑜𝑟 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠


𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 𝒓𝒂𝒕𝒆 • ( )*
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑜𝑘𝑒𝑛𝑠
100

The tagset • is the set of labels used by the part of speech


tagger to mark the grammatical class of the
word.

Deep Parsers • They analyze the entire sentence to identify the


Subject and the Predicate.

Shallow Parsers • also work on the syntactic level but they try to
analyze phrase structure, that is to say, they try
to identify: noun, verb, adjectival, adverbial,
and prepositional phrases. It doesn’t work at
the word level but the phrase level.
Core Frame elements • Frame elements that are essential to the
meaning of a frame are called "core" FEs (e.g
Speaker in frames connected with
communication); expressions of time, place and
manner are generally not core FEs.

Frame element (FE) • frame-specific defined semantic role that is the


basic unit of a frame

Frame (semantic frame) • A schematic representation of a situation


involving various participants, props and other
conceptual roles, each of which is a frame
element

Frame semantics a descriptive framework for characterizing lexical


meaning in terms of semantic frames
Annotation The assignment of semantic role tags to syntactic
constituents
lexical unit (LU) • a pairing of a lemma and frame - i.e. a "word"
taken in one of its senses.

FrameNet • is an online lexical database that documents


the semantic and syntactic information of each
lexical unit (Baker et al., 1998).

Manual annotation Means to hire a human subject to do the linguistic


annotation based on some guidelines.

In-lab annotation means:


you hire expert human subjects
you give them enough training
you provide them with a place to work
you keep them under your close supervision

Corpus annotation Is adding linguistic information to the corpus.

Crowdsourcing Is creating an online survey and forming the


annotation task as a Q&A one.
COCA • The Corpus of Contemporary American English
(COCA): It is a corpus representing American
English. It comprises 520 million words of text
(20 million words each year 1990 – 2015) and it
is equally divided among spoken, fiction,
popular magazines, newspapers, and academic
texts.

The wildcard Means anything: anything that comes in the position


of the wildcard.
• Wildcards can also be used to do morphological
searches.

Normalized Frequency (w) = 𝐶(𝑤) ∗ 𝑐𝑜𝑚𝑚𝑜𝑛 𝑏𝑎𝑠𝑒


𝑁
where C(w) is the raw frequency of the given word, N
is the total number of words in the corpus, and the
common base ranges from 10 to 1,000,000 depending
on the size of the corpus.
Raw frequency • is the number of occurrences in the corpus.

The Key Word In Content • is the concordance function which displays up


(KWIC) to 1,000 random contexts of the query word.

You might also like