Professional Documents
Culture Documents
IB Mathematical Exploration 2018
IB Mathematical Exploration 2018
IB Mathematical Exploration 2018
Subject: Mathematics HL
!1
Table of contents:
1. Introduction page 3
2. Investigation page 4
2.3 Results
3. Conclusion page 14
4. Bibligraphy page 15
!2
Introduction
natural language: if all words of a language (or just a sufficiently long text) are ordered
in descending order of their usage frequency, then the frequency of the -th word in such
However, American biologist Li Wentian tried to refute Zipf's law, strictly proving that
the random sequence of symbols also obeys Zipf's law. The author makes a hypothetical
conclusion that Zipf's law appears to be a purely statistical phenomenon, not related to
the semantics of the text. The law itself became very unclear and fascinating for me
once found. Because of the fact that I am extremely interested in linguistics, I became
very motivated to explore the implementations of Zipf’s law in literature and texts. The
Zipf’s Law basically predicts that there is a dependence of the word occurrence in the
reciprocal function. The graph below shows the distribution for different languages.1
1 https://en.wikipedia.org/wiki/Zipf%27s_law, 20.02.17
!3
Figure 1. A plot of the rank versus frequency for the first 10 million words in 30
Wikipedias (dumps from October 2015) in a log-log scale.
Investigation
1. Read the file line by line with a ‘for’ – into one string.
2. To make a list, get rid of all punctuation marks and gets all the letters to
lowercase. This is essential, because unless the step is made the words like He
and he will be read as two different words. Then divide the string into words.
Now we have the full list of the words. However, the problem emerges. We
consider different forms of the same word – like ‘be’ and ‘was’ – different, and to
capture this one needs a lemmatizer. Even provided this, Zipf’s law clearly
establishes.
4. Calculate the number of their occurrence and the total number of words. Sort in
descending order.
experimental points for (r a n ki , f r equ en c yi), such a plot would be very difficult to
understand and interpret since it is too close to the asymptotes. There are no means
of doing regression is such non-linear cases, so one could plot in Log-Log scale. What
2 https://www.gutenberg.org/
!4
Figure 2. The Log-Log scale plot.
c
L og : p(r) =
rα
logp = logc –αlogr ,
y = − α x + logc, where α represents the inclination of the line and logc represents the
shifts.
y = k x + b, k = − α;
b = logc, c = e b;
The theory predicts that it is very likely to get a linear equation. But if we move the
parameters, the equation will simply shift. That is why we are not as interested in c, as
In our IB syllabus, we have the Least Squares Method which is a solution for finding the
best line fitting a set of points. Let us construct the perfect line using this method. The
!5
Figure 3. The Least Squares Method.
The green line shown on figure is about to be the best fit. Now the idea is simply take
the difference (in vertical direction) – mismatch – between the experimental points and
the line’s prediction for given xi values and sum the squares of such differences over all
points:
2
di 2 =
∑ ∑
(yi − k xi − b) = S(k, b)
This is now a function of k and b. Notice that due to squares this function is non-
negative, so its extremum point is its minimum – and the minimum is what we’re
looking for.
Let us take the derivatives of this function with respect to both its arguments and set
( )
∂s ∂
yi − k xi − b)2 = 2(y − k xi − b (−xi) = 0
∂k ∑ ∑ i
=
i
∂k i
∂s
∂b ∑
= 2(yi − k xi − b)(−1) = 0
i
This leaves
{∑
(yi − k xi − b)xi = ∑ yi xi − k xi − bxi = 0
2
i i
!6
0 = ∑i yixi − k ∑ xi 2 − b ∑ xi
{0 = ∑i yi − k ∑i xi − Nb
Which is a system of linear equations on (k, b). The power of the least squares
method is that it provides explicit formulae for the best values of line
parameters: denote
∑i xi ∑i xi2 ∑i yi ∑i yi xi
2
⟨x⟩ = ⟨x ⟩ = ⟨y⟩ = ⟨x y⟩ =
N N N N
{k⟨x⟩ + Nb = ⟨y⟩
k⟨x 2⟩ + b⟨x⟩ = ⟨xy⟩
→ k, b
Results
As I have already mentioned, I decided to check the novel “Twenty thousand Leagues
under the sea” in two variants - the original and the translated ones. For each of these
12
log(word frequency)
0
0 2.5 5 7.5 10
! log(word rank)
!7
Fig. 4 Data for the English translation. Green part is a cut for better fit
12
log(word frequency)
0
0 2.5 5 7.5 10
! log(word rank)
Fig. 5 Data for French original. Green part is a cut for better fit
To draw the best line, we cut the graphs and take only the logarithm of rank 2 to
6, so that the line is more accurate. Thus, both graphs are constructed by the
method of least squares, with green for cut data and red is for the uncut one.
As we can observe, both graphs match the Zipf’s Law perfectly. However, I
more profound way to check, whether the results change sufficiently. I thus took
the English version of the text and translated it multiple times to and from
French using Google translate. Therefore, for each language the three graphs
!8
4
Original English
2 En->Fr->En
En->Fr->En->Fr->En
0
0 1 2 3 4
The actual reason of Zipf’s law arising in natural languages is yet unclear to the full
extent. Many studies of signals that intelligent animals (dolphins’ whistling language3
being an example) use this law to show that at least such systems statistically look the
same (at large) as numerous human languages, for which this law is undoubtedly
observed.
3 https://arxiv.org/abs/1205.0321
!9
However, it turned out, that there’s not so much needed to assume about texts for them
to undergo this law – in fact, as is shown in a paper4 – even a totally random text
We’ll call a random text a collection of symbols picked randomly from some alphabet of
this_is_my_lovely_mathematical_exploration
By picking a sign at random from this set of (A + 1) signs we mean that all signs are
equally probable – with probability p(sign) = 1/(A + 1) – so the signs are uniformly
distributed over the text. This assumption is at the same time simplifying the analysis
and seeming totally nonsense for natural languages – in the way written languages (at
least, alphabetic ones) form based on spoken ones, it is extremely difficult to believe
Now the probability of a given separate (separated by spaces) word of length L (so L + 2
1
pword (L) = a
(A + 1)L+2
for any of A L words of such length. The constant multiple a is there to make sure this is
(A + 1)
∞ ∞ L
a A
A L p(L) =
∑ (A + 1)2 ∑
L=1 L=1
A
which is a sum of infinite converging q = < 1 geometric series, with
A+1
4 “Wentian Li “Random Texts Exhibit Zipf's-Law-Like Word Frequency Distribution”, Wentian Li
!10
A + 1( A + 1)
∞
qN − 1 q A A
q L = lim q
∑
= = 1− =A
L=1
N→∞ q−1 1−q
so that gives a = (A + 1)2 /A. Now if one is interested in the probability of all L-long
1 A L−1
pword (L) ∙ (# of wor d s of length L) = A L ∙ =
A(A + 1)L (A + 1)L
We thus get quite naturally that since in such a text all the words of same length are
somewhat similar – they all are equally probable (with probabilities scaling
exponentially in word’s length), ranks of words are solely determined by their lengths.
Consider all words of length L. For any word of this set all the words of smaller length
rank higher, so
L−1 L−1
A L−1 − 1
Al = A
∑ ∑
r (L) > (# of wor d s of length l ) =
l=1 l=1
A−1
and within these words of length L we need to assign exactly another A L ranks, so the
A L−1 − 1 AL − 1
A < r (L) ≤ A
A−1 A−1
To get Zipf’s law from this there are three steps left:
( A )
A−1
L − 1 < logA r (L) + 1 ≤ L
which, exponentiated base 1/(M + 1), multiplied by 1/M, an having used the formula for
where
log(A + 1) A 1 A α−1
α= , β = , c =
logA A−1 A (A − 1)α
!11
are the constants of interest. For English using Latin alphabet of A = 26 characters this
A possible explanation to the above phenomenon, as stated in the paper, is that this
happens due to our choice of rank as an independent variable – not the word length, for
pword ∝ e −γL
with γ being the scaling factor, turns into a power-law distribution of Zipf’s law as
Addition
Apart from the above, Zipf’s Law does not only work for texts. The question of economic
activity in the territorial space is investigated by scientists for two hundred years.
regional system of Zipf's law and distribution of cities according to the principle of
"rank-size".
of the Zipf’s Law in Russian cities, to prove or refute the hypothesis that in Russia the
Zipf coefficient depends on the size of the geographical territory of the Federal district.
I achieved this by using the method of least squares that I explored through my
investigation above. First, I analyzed the implementation of the Zipf’s Law as a whole in
Russia and then separately for each Federal district. In total 1 123 cities of Russia with
5 “Wentian Li “Random Texts Exhibit Zipf's-Law-Like Word Frequency Distribution”, Wentian Li
!12
The Zipf’s Law states that within a territory the size distribution of a city is subject to
Graphs reflecting the manifestation of the rank-size regularity (The Zipf’s law) in cities
Federal districts
a) b)
c) d)
!13
Finally, during the process of checking the Zipf’s Law for the cities of Russia I
determined that the law is implemented for small (8 600-15 300 people) and large cities
(66 700-331 000 people). In the sample of cities with population exceeding 100 thousand
people, the Zipf’s Law is not implemented for cities over 1 million people (the exception
is St. Petersburg). The result of the research was the confirmation of the hypothesis
about the dependence of the Zipf’s Law coefficient on the size of the geographical
Conclusion
Considering the mathematical exploration that conducted, I would like to point out that
Zipf’s Law is formulated using mathematical statistics that refers to the fact that many
words distribution and even cities’ population statistics. Through my research, I studied
the least squares method in a detailed way and learned to apply it, as well as plotting in
Log-Log scale. Furthermore, I not only found out that Zipf’s distribution law works for a
literary work “Twenty thousand Leagues under the sea”, but also proved that it’s
!14
applications remain even after several stages of translation, despite the fact that the
meaning is lacking. The graphs look almost identical which gives me the right to
Federal districts and found out that the Zipf coefficient varies from -0.64 (far Eastern
Federal district) to -0.9 (Ural and North Caucasus Federal districts). In the analysis of
the sample of cities with population over 100 thousand people the ratio of texts made up
-1,13 that indicates the uniformity of the hierarchy of cities in this sample. The result of
the research is the confirmation of the hypothesis about the dependence of the Zipf
Bibliograhy
Agent-Based Simulation
Approach. Journal of Economic Dynamics and Control, 2007, vol. 31, iss. 7, pp.
2438–2460.
2. Moura N.J., Ribeiro M.B. Zipf Law for Brazilian Cities. Physica A: Statistical
3. Jiang B., Jia T. Zipf's Law for All the Natural Cities in the United States: A
!15
4. Xu Z., Harriss R. A Spatial and Temporal Autocorrelated Growth Model for City
Rank–Size Distribution.
code.tutsplus.com/tutorials/how-to-use-python-to-find-the-zipf-distribution-of-a-
text-file--cms-26502
!16