Professional Documents
Culture Documents
De Inv
De Inv
A corpus (plural corpora) or text corpus is a large and structured set of texts (now usually
electronically stored and processed). They are used to do statistical analysis and hypothesis
testing, checking occurrences or validating linguistic rules on a specific universe.
A corpus is an artifact!
A corpus can be divided into subcorpora. A subcorpus has all the properties of a corpus but
happens to be part of a larger corpus. Corpora and subcorpora are divided into components. A
component is not necessarily an adequate sample of a language and in that way it is distinct
from a corpus and a subcorpus. It is a collection of pieces of language that are selected and
ordered according to a set of linguistic criteria that serve to characterize its linguistic
homogeneity. Whereas a corpus may illustrate heterogeneity, and also a subcorpus to some
extent, the component illustrates a particular type of language. What are called sublanguages
are components in this definition, but there are other restrictions on sublanguages which will be
dealt with later.
A comparable corpus is one which selects similar texts in more than one language or variety.
A parallel corpus is a bi- or multilingual corpus that contains one set of texts in two or more
language.