W1L4

Empirical Laws
Pawan Goyal
CSE, IITKGP
in
C
Week 1: Lecture 4
Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 1 / 15

Function Words vs. Content Words
Function words have little lexical meaning but serve as important elements to
the structure of sentences.

Example
The winfy prunkilmonger from the glidgement mominkled and brangified
all his levensers vederously.
Glop angry investigator larm blonk government harassed gerfritz
✓
infuriated sutbor pumrog listeners thoroughly. → content words
. •
In I. .
→ anami
words
Tsarbase f-bad
Replaced tunetionwonds →
are ,
,
all his ,

→
Example
The winfy prunkilmonger from the glidgement mominkled and brangified
all his levensers vederously.
Glop angry investigator larm blonk government harassed gerfritz
infuriated sutbor pumrog listeners thoroughly.
Function words are closed-class words

prepositions, pronouns, auxiliary verbs, conjunctions, grammatical articles,
-
(function
-
particles etc.
t.sn#-,-ai
-
""
words
he his
1 her, , ,
-
.

Most Common Words in Tom Sawyer
[}
Any aortas
""
"
function
words
← 2-
ConIFa

The list is dominated by the little words of English, having important

grammatical roles.

These are usually referred to as function words, such as determiners,

prepositions, complementizers etc.

← →
The one really exceptional word is Tom, whose frequency reflects the text
chosen.

functioning .
content
How many words are there in this text?

-

Type vs. Tokens
w÷÷÷i¥÷
tokens
Type-Token distinction
¥¥
Type-token distinction is a distinction that separates a concept from the objects
which are particular instances of the concept
? I -01
¥T¥
→ How many
, ,
Pokey
→ How
many ? 7-
Type vs. Tokens
Type-Token distinction
Type-token distinction is a distinction that separates a concept from the objects
which are particular instances of the concept
Type/Token Ratio
The type/token ratio (TTR) is the ratio of the number of different words
(types) to the number of running words (tokens) in a given text or corpus.
This index indicates how often, on average, a new ‘word form’ appears in
the text or corpus.
low TTR us .
high TTR
a- →
generic { literature
Comparison Across Texts
Mark Twain’s Tom Sawyer

71,370 word tokens
8,018 word types
TTR = 0.112
Complete Shakespeare work

884,647 word tokens
29,066 word types
TTR = 0.032

Empirical Observations on Various Texts
Comparing Conversation, academic prose, news, fiction

✓ - ur
lowest TTR hishes-R

Academic Prose
Conversation
fiction
news


TTR scores the lowest value (tendency to use the same words) in
conversation.


conversation.
TTR scores the highest value (tendency to use different words) in news.


conversation.
Academic prose writing has the second lowest TTR.


conversation.
Academic prose writing has the second lowest TTR.
Not a valid measure of ‘text complexity’ by itself

The value varies with the size of the text.
For a valid measure, a running average is computed on consecutive
-
1000-word chunks of the text.
→R running average

Word Distribution from Tom Sawyer
TTR = 0.11 ) Words occur on

average 9 times each.
→ -0¥ But words have a very uneven

distribution.
m÷ta
functionword
☒ D- -
dominated by

But words have a very uneven

distribution.
Most words are rare

3993 (50%) word types appear
only once
They are called happax legomena

(Greek for ‘read only once’)


But words have a very uneven

distribution.
Most words are rare

3993 (50%) word types appear
only once
They are called happax legomena

(Greek for ‘read only once’)
But common words are very common

100 words account for 51% of all
tokens of all text

Zipf’s Law
Count the frequency of each word type in a large corpus

List the word types in decreasing order of their frequency
10000 → 1 I
the f
- 6000 → 2
of
Lark
f → 3
f.ge" µ¥
,
1
!
,
I -

Zipf’s Law

Zipf’s Law
A relationship between the frequency of a word (f ) and its position in the list
(its rank r).

Zipf’s Law

Zipf’s Law
(its rank r).
1
fµ
r

Zipf’s Law

Zipf’s Law
(its rank r).
1
fµ
r
or, there is a constant k such that
f .r =
Fk

Zipf’s Law

Zipf’s Law
(its rank r).
1
fµ
r
or, there is a constant k such that
f .r = k
i.e. the 50th most common word should occur with 3 times the frequency of
the 150th most common word.

Zipf’s Law
Let
pr denote the probability of word of rank r
N denote the total number of word occurrences
Pr --

Zipf’s Law
Let
f A
☐
pr = =
N r

Zipf’s Law
f. r =
•¥wwM f. r =
AI
(
G
for
0.1×112
a '
=
Let
f 0 A on
⇐
pr = =
N r
The value of A is found closer to 0.1 for corpus

✓ 1 5" f. ✗ =
I #tokens
corpora 10k
↳
-
s

Empirical Evaluation from Tom Sawyer
~
←
→ -
⇐I -
-
← ↳A
¥8 tee
Zipf’s Other Laws
Correlation: Number of meanings and word frequency

The number of meanings m of a word obeys the law:
Content word
p
mµ f

Zipf’s Other Laws

p
mµ f
Given the First law
-0
1
mµ p
r

Zipf’s Other Laws

mµ f
p
contented
Given the First law
1
mµ p
r
Empirical Support
e-
Rank ⇡ 10000, average 2.1 meanings
Rank ⇡ 5000, average 3 meanings
Rank ⇡ 2000, average 4.6 meanings

Zipf’s Other Laws
Correlation: Word length and word frequency

Word frequency is inversely proportional to their length.

Impact of Zipf’s Law
The Good part

Stopwords account for a large fraction of text, thus eliminating them greatly
-
reduces the number of tokens in a text.

Impact of Zipf’s Law
sonnet
maybe
The Good part
i
Stopwords account for a large fraction of text, thus eliminating them greatly
reduces the number of tokens in a text.
The Bad part

Most words are extremely rare and thus, gathering sufficient data for
meaningful statistical analysis is difficult for most words.
the f
Vocabulary Growth
How does the size of the overall vocabulary (number of unique words) grow
with the size of the corpus? v
N

Vocabulary Growth
with the size of the corpus?
Heaps’ Law
Let |V| be the size of vocabulary and N be the number of tokens.

Vocabulary Growth
with the size of the corpus?
Heaps’ Law
Let |V| be the size of vocabulary and N be the number of tokens.
|V| = KN b
Typically
K ⇡ 10-100
b ⇡ 0.4 - 0.6 (roughly square root)

Heaps’ Law: Empirical Evidence

W1L4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

W1L4

Uploaded by

Copyright:

Available Formats

Empirical Laws

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 1 / 15

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 2 / 15

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 2 / 15

Function words are closed-class words

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 2 / 15

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 3 / 15

The list is dominated by the little words of English, having important

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 3 / 15

These are usually referred to as function words, such as determiners,

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 3 / 15

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 3 / 15

How many words are there in this text?

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 3 / 15

Mark Twain’s Tom Sawyer

Complete Shakespeare work

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 5 / 15

Comparing Conversation, academic prose, news, fiction

lowest TTR hishes-R

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 6 / 15

Comparing Conversation, academic prose, news, fiction

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 6 / 15

Comparing Conversation, academic prose, news, fiction

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 6 / 15

Comparing Conversation, academic prose, news, fiction

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 6 / 15

Comparing Conversation, academic prose, news, fiction

Not a valid measure of ‘text complexity’ by itself

1000-word chunks of the text.

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 6 / 15

TTR = 0.11 ) Words occur on

→ -0¥ But words have a very uneven

TTR = 0.11 ) Words occur on

But words have a very uneven

Most words are rare

They are called happax legomena

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 7 / 15

TTR = 0.11 ) Words occur on

But words have a very uneven

Most words are rare

They are called happax legomena

But common words are very common

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 7 / 15

Count the frequency of each word type in a large corpus

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 8 / 15

Count the frequency of each word type in a large corpus

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 8 / 15

Count the frequency of each word type in a large corpus

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 8 / 15

Count the frequency of each word type in a large corpus

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 8 / 15

Count the frequency of each word type in a large corpus

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 8 / 15

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 9 / 15

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 9 / 15

The value of A is found closer to 0.1 for corpus

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 9 / 15

Correlation: Number of meanings and word frequency

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 11 / 15

Correlation: Number of meanings and word frequency

Given the First law

Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 11 / 15