Professional Documents
Culture Documents
W1L4
W1L4
Pawan Goyal
CSE, IITKGP
in
C
Week 1: Lecture 4
Function words have little lexical meaning but serve as important elements to
the structure of sentences.
Function words have little lexical meaning but serve as important elements to
the structure of sentences.
Example
The winfy prunkilmonger from the glidgement mominkled and brangified
all his levensers vederously.
Glop angry investigator larm blonk government harassed gerfritz
✓
infuriated sutbor pumrog listeners thoroughly. → content words
. •
In I. .
→ anami
words
Tsarbase f-bad
Replaced tunetionwonds →
are ,
,
all his ,
→
Function words have little lexical meaning but serve as important elements to
the structure of sentences.
Example
The winfy prunkilmonger from the glidgement mominkled and brangified
all his levensers vederously.
Glop angry investigator larm blonk government harassed gerfritz
infuriated sutbor pumrog listeners thoroughly.
(function
-
particles etc.
t.sn#-,-ai
-
""
words
he his
1 her, , ,
-
.
[}
Any aortas
""
"
function
words
← 2-
ConIFa
← →
The one really exceptional word is Tom, whose frequency reflects the text
chosen.
functioning .
content
w÷÷÷i¥÷
tokens
Type-Token distinction
¥¥
Type-token distinction is a distinction that separates a concept from the objects
which are particular instances of the concept
? I -01
¥T¥
→ How many
, ,
Pokey
→ How
many ? 7-
Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 4 / 15
Type vs. Tokens
Type-Token distinction
Type-token distinction is a distinction that separates a concept from the objects
which are particular instances of the concept
Type/Token Ratio
The type/token ratio (TTR) is the ratio of the number of different words
(types) to the number of running words (tokens) in a given text or corpus.
This index indicates how often, on average, a new ‘word form’ appears in
the text or corpus.
low TTR us .
high TTR
a- →
generic { literature
Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 4 / 15
Comparison Across Texts
→R running average
m÷ta
functionword
☒ D- -
dominated by
Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 7 / 15
Word Distribution from Tom Sawyer
f.ge" µ¥
,
1
!
,
I -
Zipf’s Law
A relationship between the frequency of a word (f ) and its position in the list
(its rank r).
Zipf’s Law
A relationship between the frequency of a word (f ) and its position in the list
(its rank r).
1
fµ
r
Zipf’s Law
A relationship between the frequency of a word (f ) and its position in the list
(its rank r).
1
fµ
r
or, there is a constant k such that
f .r =
Fk
Zipf’s Law
A relationship between the frequency of a word (f ) and its position in the list
(its rank r).
1
fµ
r
or, there is a constant k such that
f .r = k
i.e. the 50th most common word should occur with 3 times the frequency of
the 150th most common word.
Let
pr denote the probability of word of rank r
N denote the total number of word occurrences
Pr --
Let
pr denote the probability of word of rank r
N denote the total number of word occurrences
f A
☐
pr = =
N r
f. r =
•¥wwM f. r =
AI
(
G
for
0.1×112
a '
=
Let
pr denote the probability of word of rank r
N denote the total number of word occurrences
f 0 A on
⇐
pr = =
N r
~
←
→ -
⇐I -
-
← ↳A
¥8 tee
Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 10 / 15
Zipf’s Other Laws
Content word
p
mµ f
-0
1
mµ p
r
mµ f
p
contented
Given the First law
1
mµ p
r
Empirical Support
e-
Rank ⇡ 10000, average 2.1 meanings
Rank ⇡ 5000, average 3 meanings
Rank ⇡ 2000, average 4.6 meanings
sonnet
maybe
The Good part
i
Stopwords account for a large fraction of text, thus eliminating them greatly
reduces the number of tokens in a text.
the f
Pawan Goyal (IIT Kharagpur) Empirical Laws Week 1: Lecture 4 13 / 15
Vocabulary Growth
How does the size of the overall vocabulary (number of unique words) grow
with the size of the corpus? v
N
How does the size of the overall vocabulary (number of unique words) grow
with the size of the corpus?
Heaps’ Law
Let |V| be the size of vocabulary and N be the number of tokens.
How does the size of the overall vocabulary (number of unique words) grow
with the size of the corpus?
Heaps’ Law
Let |V| be the size of vocabulary and N be the number of tokens.
|V| = KN b
Typically
K ⇡ 10-100
b ⇡ 0.4 - 0.6 (roughly square root)