What Is A Corpus

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

What is a Corpus?

The word "corpus", derived from the Latin word meaning "body", may be used to refer to any text in
written or spoken form. However, in modern Linguistics this term is used to refer to large collections of
texts which represent a sample of a particular variety or use of language(s) that are presented in
machine readable form. Other definitions, broader or stricter, exist. See, for example, the definition in
the book "Corpus Linguistics" by Tony McEnery and Andrew Wilson or read more about different kinds
of corpora in the Systematic Dictionary of Corpus Linguistics.

Computer-readable corpora can consist of raw text only, i.e. plain text with no additional information.
Many corpora have been provided with some kind of linguistic information, here called mark-up or
annotation.

Types of corpora

There are many different kinds of corpora. They can contain written or spoken (transcribed) language,
modern or old texts, texts from one language or several languages. The texts can be whole books,
newspapers, journals, speeches etc, or consist of extracts of varying length. The kind of texts included
and the combination of different texts vary between different corpora and corpus types.

'General corpora' consist of general texts, texts that do not belong to a single text type, subject field,
or register. An example of a general corpus is the British National Corpus. Some corpora contain texts
that are sampled (chosen from) a particular variety of a language, for example, from a particular dialect
or from a particular subject area. These corpora are sometimes called 'Sublanguage Corpora'.

Corpora can consist of texts in one language (or language variety) only or of texts in more than one
language. If the texts are the same in all languages, e.i. translations, the corpus is called a Parallel
Corpus. A Comparable Corpus is a collection of "similar" text

You might also like