Brown Corpus

Brown Corpus
The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) is an electronic collection of text
samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the scientific study
of the frequency and distribution of word categories in everyday language use. Compiled by Henry Kučera and W. Nelson Francis at
Brown University, in Rhode Island, it is a general language corpus containing 500 samples of English, totaling roughly one million
words, compiled from works published in the United States in 1961.
History
In 1967, Kučera and Francis published their classic work Computational Analysis of Present-Day American English, which provided
basic statistics on what is known today simply as the Brown Corpus.[1]
The Brown Corpus was a carefully compiled selection of current American English, totalling about a million words drawn from a wide
variety of sources. Kučera and Francis subjected it to a variety of computational analyses, from which they compiled a rich and
variegated opus, combining elements of linguistics, psychology, statistics, and sociology. It has been very widely used in
computational linguistics, and was for many years among the most-cited resources in the field.[2]
Shortly after publication of the first lexicostatistical analysis, Boston publisher Houghton-Mifflin approached Kučera to supply a
million word, three-line citation base for its new American Heritage Dictionary. This ground-breaking new dictionary, which first
appeared in 1969, was the first dictionary to be compiled using corpus linguistics for word frequency and other information.
The initial Brown Corpus had only the words themselves, plus a location identifier for each. Over the following several years part-of-
speech tags were applied. The Greene and Rubin tagging program (see under part of speech tagging) helped considerably in this, but
the high error rate meant that extensive manual proofreading was required.
The tagged Brown Corpus used a selection of about 80 parts of speech, as well as special indicators for compound forms,
contractions, foreign words and a few other phenomena, and formed the model for many later corpora such as the Lancaster-Oslo-
Bergen Corpus (British English from the early 1990s) and the Freiburg-Brown Corpus of American English (FROWN) (American
English from the early 1990s).[3][4] Tagging the corpus enabled far more sophisticated statistical analysis, such as the work
programmed by Andrew Mackie, and documented in books on English grammar.[5]
One interesting result is that even for quite large samples, graphing words in order of decreasing frequency of occurrence shows a
hyperbola: the frequency of the n-th most frequent word is roughly proportional to 1/n. Thus "the" constitutes nearly 7% of the Brown
Corpus, "to" and "of" more than another 3% each; while about half the total vocabulary of about 50,000 words are hapax legomena:
words that occur only once in the corpus.[6] This simple rank-vs.-frequency relationship was noted for an extraordinary variety of
phenomena by George Kingsley Zipf (for example, see his The Psychobiology of Language), and is known as Zipf's law.
Although the Brown Corpus pioneered the field of corpus linguistics, by now typical corpora (such as the Corpus of Contemporary
American English, the British National Corpus or the International Corpus of English) tend to be much larger, on the order of 100
million words.
Sample distribution
The Corpus consists of 500 samples, distributed across 15 genres in rough proportion to the amount published in 1961 in each of
those genres. All works sampled were published in 1961; as far as could be determined they were first published then, and were
written by native speakers of American English.
Each sample began at a random sentence-boundary in the article or other unit chosen, and continued up to the first sentence
boundary after 2,000 words. In a very few cases miscounts led to samples being just under 2,000 words.
The original data entry was done on upper-case only keypunch machines; capitals were indicated by a preceding asterisk, and
various special items such as formulae also had special codes.
The corpus originally (1961) contained 1,014,312 words sampled from 15 text categories:
A. PRESS: Reportage (44 texts)

Political
Sports
Society
Spot News
Financial
Cultural
B. PRESS: Editorial (27 texts)
Institutional Daily
Personal
Letters to the Editor
C. PRESS: Reviews (17 texts)
theatre
books
music
dance
D. RELIGION (17 texts)
Books
Periodicals
Tracts
E. SKILL AND HOBBIES (36 texts)
Books
Periodicals
F. POPULAR LORE (48 texts)
Books
Periodicals
G. BELLES-LETTRES - Biography, Memoirs, etc. (75 texts)
Books
Periodicals
H. MISCELLANEOUS: US Government & House Organs (30 texts)
Government Documents
Foundation Reports
Industry Reports
College Catalog
Industry House organ
J. LEARNED (80 texts)
Natural Sciences
Medicine
Mathematics
Social and Behavioral Sciences
Political Science, Law, Education
Humanities
Technology and Engineering
K. FICTION: General (29 texts)
Novels
Short Stories
L. FICTION: Mystery and Detective Fiction (24 texts)
Novels
Short Stories
M. FICTION: Science (6 texts)
Novels
Short Stories
N. FICTION: Adventure and Western (29 texts)
Novels
Short Stories
P. FICTION: Romance and Love Story (29 texts)
Novels
Short Stories
R. HUMOR (9 texts)
Novels
Essays, etc.
Part-of-speech tags used

Tag Definition
. sentence (. ; ? *)
( left paren
) right paren
* not, n't
-- dash
, comma
: colon
ABL pre-qualifier (quite, rather)
ABN pre-quantifier (half, all)
ABX pre-quantifier (both)
AP post-determiner (many, several, next)
AT article (a, the, no)
BE be
BED were
BEDZ was
BEG being
BEM am
BEN been
BER are, art
BBB is
CC coordinating conjunction (and, or)
CD cardinal numeral (one, two, 2, etc.)
CS subordinating conjunction (if, although)
DO do
DOD did
DOZ does
DT singular determiner/quantifier (this, that)
DTI singular or plural determiner/quantifier (some, any)
DTS plural determiner (these, those)
DTX determiner/double conjunction (either)
EX existential there
FW foreign word (hyphenated before regular tag)
HL word occurring in the headline (hyphenated after regular tag)
HV have
HVD had (past tense)
HVG having
HVN had (past participle)
HVZ has
IN preposition
JJ adjective
JJR comparative adjective
JJS semantically superlative adjective (chief, top)
JJT morphologically superlative adjective (biggest)
MD modal auxiliary (can, should, will)
NC cited word (hyphenated after regular tag)
NN singular or mass noun
NN$ possessive singular noun
NNS plural noun
NNS$ possessive plural noun
NP proper noun or part of name phrase
NP$ possessive proper noun
NPS plural proper noun
NPS$ possessive plural proper noun
NR adverbial noun (home, today, west)
NRS plural adverbial noun
OD ordinal numeral (first, 2nd)
PN nominal pronoun (everybody, nothing)
PN$ possessive nominal pronoun
PP$ possessive personal pronoun (my, our)
PP$$ second (nominal) possessive pronoun (mine, ours)
PPL singular reflexive/intensive personal pronoun (myself)
PPLS plural reflexive/intensive personal pronoun (ourselves)
PPO objective personal pronoun (me, him, it, them)
PPS 3rd. singular nominative pronoun (he, she, it, one)
PPSS other nominative personal pronoun (I, we, they, you)
QL qualifier (very, fairly)
QLP post-qualifier (enough, indeed)
RB adverb
RBR comparative adverb
RBT superlative adverb
RN nominal adverb (here, then, indoors)
RP adverb/particle (about, off, up)
TL word occurring in title (hyphenated after regular tag)
TO infinitive marker to
UH interjection, exclamation
VB verb, base form
VBD verb, past tense
VBG verb, present participle/gerund
VBN verb, past participle
VBP verb, non 3rd person, singular, present
VBZ verb, 3rd. singular present
WDT wh- determiner (what, which)
WP$ possessive wh- pronoun (whose)
WPO objective wh- pronoun (whom, which, that)
WPS nominative wh- pronoun (who, which, that)
WQL wh- qualifier (how)
WRB wh- adverb (how, where, when)
Note that some versions of the tagged Brown corpus contain combined tags. For instance the word "wanna" is tagged VB+TO, since
it is a contracted form of the two words, want/VB and to/TO. Also some tags might be negated, for instance "aren't" would be tagged
"BER*", where * signifies the negation. Additionally, tags may have hyphenations: The tag -HL is hyphenated to the regular tags of
words in headlines. The tag -TL is hyphenated to the regular tags of words in titles. The hyphenation -NC signifies an emphasized
word. Sometimes the tag has a FW- prefix which means foreign word.
See also
LOB Corpus, a corpus of British English based on the same parameters as the Brown Corpus
British National Corpus
References
1. Francis, W. Nelson & Henry Kucera. 1967. Computational Analysis of Present-Day American English. Providence, RI: Brown
University Press.
2. Francis, W. Nelson & Henry Kucera. 1979. BROWN CORPUS MANUAL: Manual of Information to Accompany a Standard Corpus of
Present-Day Edited American English for Use with Digital Computers. http://icame.uib.no/brown/bcm.html .
3. Hundt, Marianne, Andrea Sand & Rainer Siemund. 1998. Manual of Information to Accompany the Freiburg-Brown Corpus of
American English (FROWN). http://khnt.hit.uib.no/icame/manuals/frown/INDEX.HTM
4. Leech, Geoffrey & Nicholas Smith. 2005. Extending the possibilities of corpus-based research on English in the twentieth century: A
prequel to LOB and FLOB. ICAME Journal 29. 83–98.
5. Winthrop Nelson Francis and Henry Kučera. 1983. Frequency Analysis of English Usage: Lexicon and Grammar, Houghton Mifflin.
. Kirsten Malmkjær, The Linguistics Encyclopedia , 2nd ed, Routledge, 2002, ISBN 0-415-22210-9, p. 87.
External links
Brown Corpus Manual
Download the Brown Corpus
Search in the Brown Corpus Annotated by the TreeTagger v2
More details on the Brown Corpus tagset
Python software for convenient access to the Brown Corpus
PHP (Part Of Speech Tagging)
Retrieved from "https://en.wikipedia.org/w/index.php?title=Brown_Corpus&oldid=974903320"
Last edited 2 months ago by JJMC89 bot III
Content is available under CC BY-SA 3.0 unless otherwise noted.

Brown Corpus - Wikipedia

Uploaded by

Copyright:

Available Formats

You might also like

Brown Corpus - Wikipedia

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Brown Corpus - Wikipedia

Uploaded by

Copyright:

Available Formats

A. PRESS: Reportage (44 texts)

Part-of-speech tags used

ABL pre-qualiﬁer (quite, rather)

ABN pre-quantiﬁer (half, all)

ABX pre-quantiﬁer (both)

AP post-determiner (many, several, next)

AT article (a, the, no)

BER are, art

CC coordinating conjunction (and, or)

CD cardinal numeral (one, two, 2, etc.)

CS subordinating conjunction (if, although)

DT singular determiner/quantiﬁer (this, that)

DTI singular or plural determiner/quantiﬁer (some, any)

DTS plural determiner (these, those)

DTX determiner/double conjunction (either)

FW foreign word (hyphenated before regular tag)

HL word occurring in the headline (hyphenated after regular tag)

HVD had (past tense)

HVN had (past participle)

JJR comparative adjective

JJS semantically superlative adjective (chief, top)

JJT morphologically superlative adjective (biggest)

MD modal auxiliary (can, should, will)

NC cited word (hyphenated after regular tag)

NN singular or mass noun

NN$ possessive singular noun

NNS plural noun

NNS$ possessive plural noun

NP proper noun or part of name phrase

NP$ possessive proper noun

NPS plural proper noun

NPS$ possessive plural proper noun

NR adverbial noun (home, today, west)

NRS plural adverbial noun

OD ordinal numeral (ﬁrst, 2nd)

PN nominal pronoun (everybody, nothing)

PN$ possessive nominal pronoun

PP$ possessive personal pronoun (my, our)

PP$$ second (nominal) possessive pronoun (mine, ours)

PPL singular reﬂexive/intensive personal pronoun (myself)

PPLS plural reﬂexive/intensive personal pronoun (ourselves)

PPO objective personal pronoun (me, him, it, them)

PPS 3rd. singular nominative pronoun (he, she, it, one)

PPSS other nominative personal pronoun (I, we, they, you)

QL qualiﬁer (very, fairly)

QLP post-qualiﬁer (enough, indeed)

RBR comparative adverb

RBT superlative adverb

RN nominal adverb (here, then, indoors)

RP adverb/particle (about, off, up)

TL word occurring in title (hyphenated after regular tag)

VB verb, base form

VBD verb, past tense

VBG verb, present participle/gerund

VBN verb, past participle

VBP verb, non 3rd person, singular, present

VBZ verb, 3rd. singular present