IB Mathematical Exploration 2018

You might also like

You are on page 1of 16

Exploration

“Studying Zipf’s law in random and natural texts

with and without translation”

Subject: Mathematics HL

May 2018 session

!1
Table of contents:

1. Introduction page 3

2. Investigation page 4

2.1 Exploring the distribution for a given text

2.2 The Least Squares method

2.3 Results

2.4 Zipf’s Law for random texts. Reasoning.

2.5 Addition. Zipf’s Law in Russian cities’ population distribution

3. Conclusion page 14

4. Bibligraphy page 15

!2
Introduction

Zipf's law is an empirical regularity in the distribution of the frequency of words in a

natural language: if all words of a language (or just a sufficiently long text) are ordered

in descending order of their usage frequency, then the frequency of the -th word in such

a list will be approximately inversely proportional to its ordinal number.

However, American biologist Li Wentian tried to refute Zipf's law, strictly proving that

the random sequence of symbols also obeys Zipf's law. The author makes a hypothetical

conclusion that Zipf's law appears to be a purely statistical phenomenon, not related to

the semantics of the text. The law itself became very unclear and fascinating for me

once found. Because of the fact that I am extremely interested in linguistics, I became

very motivated to explore the implementations of Zipf’s law in literature and texts. The

Zipf’s Law basically predicts that there is a dependence of the word occurrence in the

text on its rank. The formula of the law is presented below:


c
 p(r) =

where α ≈ 1, c is a random constant , α is power coefficient , which is close to a

reciprocal function. The graph below shows the distribution for different languages.1

1 https://en.wikipedia.org/wiki/Zipf%27s_law, 20.02.17

!3
Figure 1. A plot of the rank versus frequency for the first 10 million words in 30
Wikipedias (dumps from October 2015) in a log-log scale.

Investigation

1.Exploring the distribution for a given text.

In my mathematical exploration, I intend to explore the distribution for a given text. I


decided to take the French novel “20 thousand leagues under the sea” in two different
versions: the original text and the professional English translation. The texts were
downloaded from Project Gutenberg.2

First, it is necessary to observe the distribution. I implemented a Python script,

performing the following procedure:

1. Read the file line by line with a ‘for’ – into one string.

2. To make a list, get rid of all punctuation marks and gets all the letters to

lowercase. This is essential, because unless the step is made the words like He

and he will be read as two different words. Then divide the string into words.

Now we have the full list of the words. However, the problem emerges. We

consider different forms of the same word – like ‘be’ and ‘was’ – different, and to

capture this one needs a lemmatizer. Even provided this, Zipf’s law clearly

establishes.

3. Using regular expression find all unique words in the list.

4. Calculate the number of their occurrence and the total number of words. Sort in

descending order.

Now we have a descending list of frequencies. However, if we now plot the

experimental points for (r a n ki ,  f r equ en c yi), such a plot would be very difficult to

understand and interpret since it is too close to the asymptotes. There are no means

of doing regression is such non-linear cases, so one could plot in Log-Log scale. What

does it actually mean?

2 https://www.gutenberg.org/

!4
Figure 2. The Log-Log scale plot.
c
L og :    p(r) =

Let us take a logarithm from both sides of the equation:

logp  = logc –αlogr ,

y =   − α x  + logc, where α represents the inclination of the line and logc represents the

shifts.

y = k x + b,  k = − α;

b = logc,  c = e b;

The theory predicts that it is very likely to get a linear equation. But if we move the

parameters, the equation will simply shift. That is why we are not as interested in c, as

we are in determining the α .

2. The Least Squares method

In our IB syllabus, we have the Least Squares Method which is a solution for finding the

best line fitting a set of points. Let us construct the perfect line using this method. The

(xi, yi) pairs here are experimentally observed points.

!5
Figure 3. The Least Squares Method.

The green line shown on figure is about to be the best fit. Now the idea is simply take

the difference (in vertical direction) – mismatch – between the experimental points and

the line’s prediction for given xi values and sum the squares of such differences over all

points:
2
di 2 =  
∑ ∑
(yi − k xi − b) = S(k, b)

This is now a function of k and b. Notice that due to squares this function is non-

negative, so its extremum point is its minimum – and the minimum is what we’re

looking for.

Let us take the derivatives of this function with respect to both its arguments and set

them to zero to provide maximum point condition.

( )
∂s ∂
yi − k xi − b)2 = 2(y − k xi − b (−xi) = 0
∂k ∑ ∑ i
=
i
∂k i

∂s
∂b ∑
= 2(yi − k xi − b)(−1) = 0
i

This leaves

{∑
(yi − k xi − b)xi = ∑ yi xi − k xi − bxi = 0
2

i i

!6
0 = ∑i yixi − k ∑ xi 2 − b ∑ xi
{0 = ∑i yi − k ∑i xi − Nb

Which is a system of linear equations on (k, b). The power of the least squares

method is that it provides explicit formulae for the best values of line

parameters: denote

∑i xi ∑i xi2 ∑i yi ∑i yi xi
2
⟨x⟩ =     ⟨x ⟩ =     ⟨y⟩ =     ⟨x y⟩ =
N N N N

To rewrite the system of equations as

{k⟨x⟩ + Nb = ⟨y⟩
k⟨x 2⟩ + b⟨x⟩ = ⟨xy⟩
→ k, b

From which (k, b) are obtained.

Results

As I have already mentioned, I decided to check the novel “Twenty thousand Leagues

under the sea” in two variants - the original and the translated ones. For each of these

texts I calculated the frequency of occurrence of words depending on their ranks. I

obtained the following log-log scale plots:

12
log(word frequency)

0
0 2.5 5 7.5 10
! log(word rank)

!7
Fig. 4 Data for the English translation. Green part is a cut for better fit

αuncut = 1.16594 and αcut = − 0.84609

12
log(word frequency)

0
0 2.5 5 7.5 10
! log(word rank)

Fig. 5 Data for French original. Green part is a cut for better fit

αuncut = 1.13962 and αcut = 0.88697

To draw the best line, we cut the graphs and take only the logarithm of rank 2 to

6, so that the line is more accurate. Thus, both graphs are constructed by the

method of least squares, with green for cut data and red is for the uncut one.

As we can observe, both graphs match the Zipf’s Law perfectly. However, I

decided to go further and investigate the procedure of translation in a slightly

more profound way to check, whether the results change sufficiently. I thus took

the English version of the text and translated it multiple times to and from

French using Google translate. Therefore, for each language the three graphs

emerged. I drew them all at one.

1) Original English text (translation of a human translator from French)

2) Translation through French

3) Double translation through French

!8
4

Original English
2 En->Fr->En
En->Fr->En->Fr->En

0
0 1 2 3 4

Fig. 3, the multiple translations of an English version

of a text via Google translate

Zipf’s Law for random texts

The actual reason of Zipf’s law arising in natural languages is yet unclear to the full

extent. Many studies of signals that intelligent animals (dolphins’ whistling language3

being an example) use this law to show that at least such systems statistically look the

same (at large) as numerous human languages, for which this law is undoubtedly

observed.

3 https://arxiv.org/abs/1205.0321

!9
However, it turned out, that there’s not so much needed to assume about texts for them

to undergo this law – in fact, as is shown in a paper4 – even a totally random text

exhibits Zipf’s statistics. Let us see why, following this work.

We’ll call a random text a collection of symbols picked randomly from some alphabet of

size A, plus the space sign, _ to distinguish the words apart:

this_is_my_lovely_mathematical_exploration

By picking a sign at random from this set of (A + 1) signs we mean that all signs are

equally probable – with probability p(sign) = 1/(A + 1) – so the signs are uniformly

distributed over the text. This assumption is at the same time simplifying the analysis

and seeming totally nonsense for natural languages – in the way written languages (at

least, alphabetic ones) form based on spoken ones, it is extremely difficult to believe

there are languages for which this could hold.

Now the probability of a given separate (separated by spaces) word of length L (so L + 2

symbols in a row) equals

1
pword (L) = a
(A + 1)L+2

for any of A L words of such length. The constant multiple a is there to make sure this is

a proper probability expression, so that

(A + 1)
∞ ∞ L
a A
A L p(L) =
∑ (A + 1)2 ∑
L=1 L=1

A
which is a sum of infinite converging q = < 1 geometric series, with
A+1

4 “Wentian Li “Random Texts Exhibit Zipf's-Law-Like Word Frequency Distribution”, Wentian Li


Santa Fe Institute, March 1991

!10
A + 1( A + 1)

qN − 1 q A A
q L = lim q

= = 1− =A
L=1
N→∞ q−1 1−q

so that gives a = (A + 1)2 /A. Now if one is interested in the probability of all L-long

words appearing, this is

1 A L−1
pword (L) ∙ (# of wor d s of length L) = A L  ∙ =
A(A + 1)L (A + 1)L

We thus get quite naturally that since in such a text all the words of same length are

somewhat similar – they all are equally probable (with probabilities scaling

exponentially in word’s length), ranks of words are solely determined by their lengths.

Consider all words of length L. For any word of this set all the words of smaller length

rank higher, so
L−1 L−1
A L−1 − 1
Al = A
∑ ∑
r (L) > (# of wor d s of length l ) =
l=1 l=1
A−1

and within these words of length L we need to assign exactly another A L ranks, so the

rank of a given word is bounded

A L−1 − 1 AL − 1
A < r (L) ≤ A
A−1 A−1

To get Zipf’s law from this there are three steps left:

( A )
A−1
L − 1 < logA  r (L) + 1 ≤ L

which, exponentiated base 1/(M + 1), multiplied by 1/M, an having used the formula for

a probability of a certain word, gives


c
pword (L) < ≤ pword (L − 1)
(r (L) + β)
α

where

log(A + 1) A 1 A α−1
α= ,     β = ,     c =
logA A−1 A (A − 1)α

!11
are the constants of interest. For English using Latin alphabet of A = 26 characters this

provides5 α = 1.01158 and c = 0.04, which is very close to observed values.

A possible explanation to the above phenomenon, as stated in the paper, is that this

happens due to our choice of rank as an independent variable – not the word length, for

example, so exponential distribution of word frequencies as depending on length

pword ∝ e −γL

with γ being the scaling factor, turns into a power-law distribution of Zipf’s law as

depending on word’s rank.

Addition

Apart from the above, Zipf’s Law does not only work for texts. The question of economic

activity in the territorial space is investigated by scientists for two hundred years.

Contemporary works show the interest of economists to the manifestation of the

regional system of Zipf's law and distribution of cities according to the principle of

"rank-size".


In addition to the main part of my exploration, I aimed to check the sub-implementation

of the Zipf’s Law in Russian cities, to prove or refute the hypothesis that in Russia the

Zipf coefficient depends on the size of the geographical territory of the Federal district.

I achieved this by using the method of least squares that I explored through my

investigation above. First, I analyzed the implementation of the Zipf’s Law as a whole in

Russia and then separately for each Federal district. In total 1 123 cities of Russia with

population over 1 000 people got to sample.

5 “Wentian Li “Random Texts Exhibit Zipf's-Law-Like Word Frequency Distribution”, Wentian Li


Santa Fe Institute, March 1991

!12
The Zipf’s Law states that within a territory the size distribution of a city is subject to

Pareto distribution with an index equal to one.

Graphs reflecting the manifestation of the rank-size regularity (The Zipf’s law) in cities

at the regional and national levels of Russia are presented in Fig. 1.

Figure 1. The rank-size dependence for cities of Russia in general and by

Federal districts

a – Central Federal district; b – North-Western Federal district; c - South Federal

district; d – North-Caucasian Federal District

a) b)

c) d)

!13
Finally, during the process of checking the Zipf’s Law for the cities of Russia I

determined that the law is implemented for small (8 600-15 300 people) and large cities

(66 700-331 000 people). In the sample of cities with population exceeding 100 thousand

people, the Zipf’s Law is not implemented for cities over 1 million people (the exception

is St. Petersburg). The result of the research was the confirmation of the hypothesis

about the dependence of the Zipf’s Law coefficient on the size of the geographical

territory of the Federal district.

Conclusion

Considering the mathematical exploration that conducted, I would like to point out that

Zipf’s Law is formulated using mathematical statistics that refers to the fact that many

types of data studied in the physical and social sciences can be approximated with a

Zipfian distribution, one of a family of related discrete power law probability

distributions. It has a surprisingly broad field of implications, including random texts’

words distribution and even cities’ population statistics. Through my research, I studied

the least squares method in a detailed way and learned to apply it, as well as plotting in

Log-Log scale. Furthermore, I not only found out that Zipf’s distribution law works for a

literary work “Twenty thousand Leagues under the sea”, but also proved that it’s

!14
applications remain even after several stages of translation, despite the fact that the

meaning is lacking. The graphs look almost identical which gives me the right to

conclude that random texts are also subject to Zipf’s Law.

Moreover, as a second part of my exploration, I analyzed a sample of several Russian

Federal districts and found out that the Zipf coefficient varies from -0.64 (far Eastern

Federal district) to -0.9 (Ural and North Caucasus Federal districts). In the analysis of

the sample of cities with population over 100 thousand people the ratio of texts made up

-1,13 that indicates the uniformity of the hierarchy of cities in this sample. The result of

the research is the confirmation of the hypothesis about the dependence of the Zipf

coefficient on the size of the geographical territory of the Federal district.

Bibliograhy

1. Mansury Yu., Gulyás L. The Emergence of Zipf’s Law in a System of Cities: An

Agent-Based Simulation 


Approach. Journal of Economic Dynamics and Control, 2007, vol. 31, iss. 7, pp.

2438–2460. 


2. Moura N.J., Ribeiro M.B. Zipf Law for Brazilian Cities. Physica A: Statistical

Mechanics and its Applications, 2006, vol. 367(C), pp. 441–448. 


3. Jiang B., Jia T. Zipf's Law for All the Natural Cities in the United States: A

Geospatial Perspective. International Journal of Geographical Information

Science, 2011, vol. 25, no. 8, pp. 1269–1281. 


!15
4. Xu Z., Harriss R. A Spatial and Temporal Autocorrelated Growth Model for City

Rank–Size Distribution. 


Urban Studies, 2010, vol. 47, iss. 2, pp. 321–335.


5. “Wentian Li “Random Texts Exhibit Zipf's-Law-Like Word Frequency

Distribution”, Wentian Li


Santa Fe Institute, March 1991

6. Figures were made using Geogebra software, https://www.geogebra.org

7. Plots for experiment results were made in Microsoft Excel

8. The code for analyzing was written in Python, following https://

code.tutsplus.com/tutorials/how-to-use-python-to-find-the-zipf-distribution-of-a-

text-file--cms-26502

!16

You might also like