Corpus(Sinclair) – a collection of pieces of language that are selected and ordered according

to explicit linguistic criteria in order to be use as a simple of the language.

Pieces of language(Sinclair) - are used to described the components of a corpus.

A corpus (plural corpora) or text corpus is a large and structured set of texts (now usually
electronically stored and processed). They are used to do statistical analysis and hypothesis
testing, checking occurrences or validating linguistic rules on a specific universe.
A corpus is an artifact!

A corpus can be divided into subcorpora. A subcorpus has all the properties of a corpus but
happens to be part of a larger corpus. Corpora and subcorpora are divided into components. A
component is not necessarily an adequate sample of a language and in that way it is distinct
from a corpus and a subcorpus. It is a collection of pieces of language that are selected and
ordered according to a set of linguistic criteria that serve to characterize its linguistic
homogeneity. Whereas a corpus may illustrate heterogeneity, and also a subcorpus to some
extent, the component illustrates a particular type of language. What are called sublanguages
are components in this definition, but there are other restrictions on sublanguages which will be
dealt with later.

A comparable corpus is one which selects similar texts in more than one language or variety.

A parallel corpus is a bi- or multilingual corpus that contains one set of texts in two or more

