IB Mathematical Exploration 2018

Exploration
“Studying Zipf’s law in random and natural texts
with and without translation”
Subject: Mathematics HL
May 2018 session
!1
Table of contents:
1. Introduction page 3
2. Investigation page 4
2.1 Exploring the distribution for a given text
2.2 The Least Squares method
2.3 Results
2.4 Zipf’s Law for random texts. Reasoning.
2.5 Addition. Zipf’s Law in Russian cities’ population distribution
3. Conclusion page 14
4. Bibligraphy page 15
!2
Introduction
Zipf's law is an empirical regularity in the distribution of the frequency of words in a
natural language: if all words of a language (or just a sufficiently long text) are ordered
in descending order of their usage frequency, then the frequency of the -th word in such
a list will be approximately inversely proportional to its ordinal number.
However, American biologist Li Wentian tried to refute Zipf's law, strictly proving that
the random sequence of symbols also obeys Zipf's law. The author makes a hypothetical
conclusion that Zipf's law appears to be a purely statistical phenomenon, not related to
the semantics of the text. The law itself became very unclear and fascinating for me
once found. Because of the fact that I am extremely interested in linguistics, I became
very motivated to explore the implementations of Zipf’s law in literature and texts. The
Zipf’s Law basically predicts that there is a dependence of the word occurrence in the
text on its rank. The formula of the law is presented below:

c
p(r) =
rα
where α ≈ 1, c is a random constant , α is power coefficient , which is close to a
reciprocal function. The graph below shows the distribution for different languages.1
1 https://en.wikipedia.org/wiki/Zipf%27s_law, 20.02.17
!3
Figure 1. A plot of the rank versus frequency for the first 10 million words in 30
Wikipedias (dumps from October 2015) in a log-log scale.
Investigation
1.Exploring the distribution for a given text.
In my mathematical exploration, I intend to explore the distribution for a given text. I

decided to take the French novel “20 thousand leagues under the sea” in two different
versions: the original text and the professional English translation. The texts were
downloaded from Project Gutenberg.2
First, it is necessary to observe the distribution. I implemented a Python script,
performing the following procedure:
1. Read the file line by line with a ‘for’ – into one string.
2. To make a list, get rid of all punctuation marks and gets all the letters to
lowercase. This is essential, because unless the step is made the words like He
and he will be read as two different words. Then divide the string into words.
Now we have the full list of the words. However, the problem emerges. We
consider different forms of the same word – like ‘be’ and ‘was’ – different, and to
capture this one needs a lemmatizer. Even provided this, Zipf’s law clearly
establishes.
3. Using regular expression find all unique words in the list.
4. Calculate the number of their occurrence and the total number of words. Sort in
descending order.
Now we have a descending list of frequencies. However, if we now plot the
experimental points for (r a n ki , f r equ en c yi), such a plot would be very difficult to
understand and interpret since it is too close to the asymptotes. There are no means
of doing regression is such non-linear cases, so one could plot in Log-Log scale. What
does it actually mean?
2 https://www.gutenberg.org/
!4
Figure 2. The Log-Log scale plot.
c
L og : p(r) =
rα
Let us take a logarithm from both sides of the equation:
logp = logc –αlogr ,
y = − α x + logc, where α represents the inclination of the line and logc represents the
shifts.
y = k x + b, k = − α;
b = logc, c = e b;
The theory predicts that it is very likely to get a linear equation. But if we move the
parameters, the equation will simply shift. That is why we are not as interested in c, as
we are in determining the α .
2. The Least Squares method
In our IB syllabus, we have the Least Squares Method which is a solution for finding the
best line fitting a set of points. Let us construct the perfect line using this method. The
(xi, yi) pairs here are experimentally observed points.
!5
Figure 3. The Least Squares Method.
The green line shown on figure is about to be the best fit. Now the idea is simply take
the difference (in vertical direction) – mismatch – between the experimental points and
the line’s prediction for given xi values and sum the squares of such differences over all
points:
2
di 2 =
∑ ∑
(yi − k xi − b) = S(k, b)
This is now a function of k and b. Notice that due to squares this function is non-
negative, so its extremum point is its minimum – and the minimum is what we’re
looking for.
Let us take the derivatives of this function with respect to both its arguments and set
them to zero to provide maximum point condition.
( )
∂s ∂
yi − k xi − b)2 = 2(y − k xi − b (−xi) = 0
∂k ∑ ∑ i
=
i
∂k i
∂s
∂b ∑
= 2(yi − k xi − b)(−1) = 0
i
This leaves
{∑
(yi − k xi − b)xi = ∑ yi xi − k xi − bxi = 0
2
i i
!6
0 = ∑i yixi − k ∑ xi 2 − b ∑ xi
{0 = ∑i yi − k ∑i xi − Nb
Which is a system of linear equations on (k, b). The power of the least squares
method is that it provides explicit formulae for the best values of line
parameters: denote
∑i xi ∑i xi2 ∑i yi ∑i yi xi
2
⟨x⟩ = ⟨x ⟩ = ⟨y⟩ = ⟨x y⟩ =
N N N N
To rewrite the system of equations as
{k⟨x⟩ + Nb = ⟨y⟩
k⟨x 2⟩ + b⟨x⟩ = ⟨xy⟩
→ k, b
From which (k, b) are obtained.
Results
As I have already mentioned, I decided to check the novel “Twenty thousand Leagues
under the sea” in two variants - the original and the translated ones. For each of these
texts I calculated the frequency of occurrence of words depending on their ranks. I
obtained the following log-log scale plots:
12
log(word frequency)
0
0 2.5 5 7.5 10
! log(word rank)
!7
Fig. 4 Data for the English translation. Green part is a cut for better fit
αuncut = 1.16594 and αcut = − 0.84609
12
log(word frequency)
0
0 2.5 5 7.5 10
! log(word rank)
Fig. 5 Data for French original. Green part is a cut for better fit
αuncut = 1.13962 and αcut = 0.88697
To draw the best line, we cut the graphs and take only the logarithm of rank 2 to
6, so that the line is more accurate. Thus, both graphs are constructed by the
method of least squares, with green for cut data and red is for the uncut one.
As we can observe, both graphs match the Zipf’s Law perfectly. However, I
decided to go further and investigate the procedure of translation in a slightly
more profound way to check, whether the results change sufficiently. I thus took
the English version of the text and translated it multiple times to and from
French using Google translate. Therefore, for each language the three graphs
emerged. I drew them all at one.
1) Original English text (translation of a human translator from French)
2) Translation through French
3) Double translation through French
!8
4
Original English
2 En->Fr->En
En->Fr->En->Fr->En
0
0 1 2 3 4
Fig. 3, the multiple translations of an English version
of a text via Google translate
Zipf’s Law for random texts
The actual reason of Zipf’s law arising in natural languages is yet unclear to the full
extent. Many studies of signals that intelligent animals (dolphins’ whistling language3
being an example) use this law to show that at least such systems statistically look the
same (at large) as numerous human languages, for which this law is undoubtedly
observed.
3 https://arxiv.org/abs/1205.0321
!9
However, it turned out, that there’s not so much needed to assume about texts for them
to undergo this law – in fact, as is shown in a paper4 – even a totally random text
exhibits Zipf’s statistics. Let us see why, following this work.
We’ll call a random text a collection of symbols picked randomly from some alphabet of
size A, plus the space sign, _ to distinguish the words apart:
this_is_my_lovely_mathematical_exploration
By picking a sign at random from this set of (A + 1) signs we mean that all signs are
equally probable – with probability p(sign) = 1/(A + 1) – so the signs are uniformly
distributed over the text. This assumption is at the same time simplifying the analysis
and seeming totally nonsense for natural languages – in the way written languages (at
least, alphabetic ones) form based on spoken ones, it is extremely difficult to believe
there are languages for which this could hold.
Now the probability of a given separate (separated by spaces) word of length L (so L + 2
symbols in a row) equals
1
pword (L) = a
(A + 1)L+2
for any of A L words of such length. The constant multiple a is there to make sure this is
a proper probability expression, so that
(A + 1)
∞ ∞ L
a A
A L p(L) =
∑ (A + 1)2 ∑
L=1 L=1
A
which is a sum of infinite converging q = < 1 geometric series, with
A+1
4 “Wentian Li “Random Texts Exhibit Zipf's-Law-Like Word Frequency Distribution”, Wentian Li 
Santa Fe Institute, March 1991
!10
A + 1( A + 1)
∞
qN − 1 q A A
q L = lim q
∑
= = 1− =A
L=1
N→∞ q−1 1−q
so that gives a = (A + 1)2 /A. Now if one is interested in the probability of all L-long
words appearing, this is
1 A L−1
pword (L) ∙ (# of wor d s of length L) = A L ∙ =
A(A + 1)L (A + 1)L
We thus get quite naturally that since in such a text all the words of same length are
somewhat similar – they all are equally probable (with probabilities scaling
exponentially in word’s length), ranks of words are solely determined by their lengths.
Consider all words of length L. For any word of this set all the words of smaller length
rank higher, so
L−1 L−1
A L−1 − 1
Al = A
∑ ∑
r (L) > (# of wor d s of length l ) =
l=1 l=1
A−1
and within these words of length L we need to assign exactly another A L ranks, so the
rank of a given word is bounded
A L−1 − 1 AL − 1
A < r (L) ≤ A
A−1 A−1
To get Zipf’s law from this there are three steps left:
( A )
A−1
L − 1 < logA r (L) + 1 ≤ L
which, exponentiated base 1/(M + 1), multiplied by 1/M, an having used the formula for
a probability of a certain word, gives

c
pword (L) < ≤ pword (L − 1)
(r (L) + β)
α
where
log(A + 1) A 1 A α−1
α= , β = , c =
logA A−1 A (A − 1)α
!11
are the constants of interest. For English using Latin alphabet of A = 26 characters this
provides5 α = 1.01158 and c = 0.04, which is very close to observed values.
A possible explanation to the above phenomenon, as stated in the paper, is that this
happens due to our choice of rank as an independent variable – not the word length, for
example, so exponential distribution of word frequencies as depending on length
pword ∝ e −γL
with γ being the scaling factor, turns into a power-law distribution of Zipf’s law as
depending on word’s rank.
Addition
Apart from the above, Zipf’s Law does not only work for texts. The question of economic
activity in the territorial space is investigated by scientists for two hundred years.
Contemporary works show the interest of economists to the manifestation of the
regional system of Zipf's law and distribution of cities according to the principle of
"rank-size". 
In addition to the main part of my exploration, I aimed to check the sub-implementation
of the Zipf’s Law in Russian cities, to prove or refute the hypothesis that in Russia the
Zipf coefficient depends on the size of the geographical territory of the Federal district.
I achieved this by using the method of least squares that I explored through my
investigation above. First, I analyzed the implementation of the Zipf’s Law as a whole in
Russia and then separately for each Federal district. In total 1 123 cities of Russia with
population over 1 000 people got to sample.
5 “Wentian Li “Random Texts Exhibit Zipf's-Law-Like Word Frequency Distribution”, Wentian Li 
!12
The Zipf’s Law states that within a territory the size distribution of a city is subject to
Pareto distribution with an index equal to one.
Graphs reflecting the manifestation of the rank-size regularity (The Zipf’s law) in cities
at the regional and national levels of Russia are presented in Fig. 1.
Figure 1. The rank-size dependence for cities of Russia in general and by
Federal districts
a – Central Federal district; b – North-Western Federal district; c - South Federal
district; d – North-Caucasian Federal District
a) b)
c) d)
!13
Finally, during the process of checking the Zipf’s Law for the cities of Russia I
determined that the law is implemented for small (8 600-15 300 people) and large cities
(66 700-331 000 people). In the sample of cities with population exceeding 100 thousand
people, the Zipf’s Law is not implemented for cities over 1 million people (the exception
is St. Petersburg). The result of the research was the confirmation of the hypothesis
about the dependence of the Zipf’s Law coefficient on the size of the geographical
territory of the Federal district.
Conclusion
Considering the mathematical exploration that conducted, I would like to point out that
Zipf’s Law is formulated using mathematical statistics that refers to the fact that many
types of data studied in the physical and social sciences can be approximated with a
Zipfian distribution, one of a family of related discrete power law probability
distributions. It has a surprisingly broad field of implications, including random texts’
words distribution and even cities’ population statistics. Through my research, I studied
the least squares method in a detailed way and learned to apply it, as well as plotting in
Log-Log scale. Furthermore, I not only found out that Zipf’s distribution law works for a
literary work “Twenty thousand Leagues under the sea”, but also proved that it’s
!14
applications remain even after several stages of translation, despite the fact that the
meaning is lacking. The graphs look almost identical which gives me the right to
conclude that random texts are also subject to Zipf’s Law.
Moreover, as a second part of my exploration, I analyzed a sample of several Russian
Federal districts and found out that the Zipf coefficient varies from -0.64 (far Eastern
Federal district) to -0.9 (Ural and North Caucasus Federal districts). In the analysis of
the sample of cities with population over 100 thousand people the ratio of texts made up
-1,13 that indicates the uniformity of the hierarchy of cities in this sample. The result of
the research is the confirmation of the hypothesis about the dependence of the Zipf
coefficient on the size of the geographical territory of the Federal district.
Bibliograhy
1. Mansury Yu., Gulyás L. The Emergence of Zipf’s Law in a System of Cities: An
Agent-Based Simulation  
Approach. Journal of Economic Dynamics and Control, 2007, vol. 31, iss. 7, pp.
2438–2460.  
2. Moura N.J., Ribeiro M.B. Zipf Law for Brazilian Cities. Physica A: Statistical
Mechanics and its Applications, 2006, vol. 367(C), pp. 441–448.  
3. Jiang B., Jia T. Zipf's Law for All the Natural Cities in the United States: A
Geospatial Perspective. International Journal of Geographical Information
Science, 2011, vol. 25, no. 8, pp. 1269–1281.  
!15
4. Xu Z., Harriss R. A Spatial and Temporal Autocorrelated Growth Model for City
Rank–Size Distribution.  
Urban Studies, 2010, vol. 47, iss. 2, pp. 321–335. 
5. “Wentian Li “Random Texts Exhibit Zipf's-Law-Like Word Frequency
Distribution”, Wentian Li 
6. Figures were made using Geogebra software, https://www.geogebra.org
7. Plots for experiment results were made in Microsoft Excel
8. The code for analyzing was written in Python, following https://
code.tutsplus.com/tutorials/how-to-use-python-to-find-the-zipf-distribution-of-a-
text-file--cms-26502
!16

IB Mathematical Exploration 2018

Uploaded by

Copyright:

You might also like

IB Mathematical Exploration 2018

Uploaded by

Document Information

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

IB Mathematical Exploration 2018

Uploaded by

Copyright:

Exploration

“Studying Zipf’s law in random and natural texts

with and without translation”

May 2018 session

2.1 Exploring the distribution for a given text

2.2 The Least Squares method

2.4 Zipf’s Law for random texts. Reasoning.

2.5 Addition. Zipf’s Law in Russian cities’ population distribution

Zipf's law is an empirical regularity in the distribution of the frequency of words in a

a list will be approximately inversely proportional to its ordinal number.

text on its rank. The formula of the law is presented below:

where α ≈ 1, c is a random constant , α is power coefficient , which is close to a

1.Exploring the distribution for a given text.

In my mathematical exploration, I intend to explore the distribution for a given text. I

First, it is necessary to observe the distribution. I implemented a Python script,

performing the following procedure:

3. Using regular expression find all unique words in the list.

Now we have a descending list of frequencies. However, if we now plot the

does it actually mean?

Let us take a logarithm from both sides of the equation:

we are in determining the α .

2. The Least Squares method

(xi, yi) pairs here are experimentally observed points.

them to zero to provide maximum point condition.

To rewrite the system of equations as

From which (k, b) are obtained.

texts I calculated the frequency of occurrence of words depending on their ranks. I

obtained the following log-log scale plots:

αuncut = 1.16594 and αcut = − 0.84609

αuncut = 1.13962 and αcut = 0.88697

decided to go further and investigate the procedure of translation in a slightly

emerged. I drew them all at one.

1) Original English text (translation of a human translator from French)

2) Translation through French

3) Double translation through French

Fig. 3, the multiple translations of an English version

of a text via Google translate

Zipf’s Law for random texts

exhibits Zipf’s statistics. Let us see why, following this work.

size A, plus the space sign, _ to distinguish the words apart:

there are languages for which this could hold.

symbols in a row) equals

a proper probability expression, so that

Santa Fe Institute, March 1991

words appearing, this is

rank of a given word is bounded

a probability of a certain word, gives

provides5 α = 1.01158 and c = 0.04, which is very close to observed values.

example, so exponential distribution of word frequencies as depending on length

depending on word’s rank.

Contemporary works show the interest of economists to the manifestation of the

In addition to the main part of my exploration, I aimed to check the sub-implementation

population over 1 000 people got to sample.

Santa Fe Institute, March 1991

Pareto distribution with an index equal to one.

at the regional and national levels of Russia are presented in Fig. 1.

Figure 1. The rank-size dependence for cities of Russia in general and by

a – Central Federal district; b – North-Western Federal district; c - South Federal

district; d – North-Caucasian Federal District

territory of the Federal district.

types of data studied in the physical and social sciences can be approximated with a

Zipfian distribution, one of a family of related discrete power law probability

distributions. It has a surprisingly broad field of implications, including random texts’

Mechanics and its Applications, 2006, vol. 367(C), pp. 441–448.  

Science, 2011, vol. 25, no. 8, pp. 1269–1281.  

Urban Studies, 2010, vol. 47, iss. 2, pp. 321–335. 

Distribution”, Wentian Li